Day 3: Transformer Mechanics & Token Economics

Published 2025-07-06

Understanding how transformers turn text into context, and why the length of that text (in tokens) affects cost and speed, doesn't have to be daunting. In this post, we'll break down these concepts with simple analogies, much like Luis Serrano's famous "gravity" analogy for attention in language models. We'll explore what tokens are, how transformers use them to understand context, and how tokenization and context length drive cost and latency.

What Exactly Are Tokens?

Think of tokens as the basic building blocks of language for AI models. While humans think in characters or words, transformers think in tokens. A token is a small unit of text. It could be a whole word, a sub-word piece, or even just a few characters. For example, the word "transformer" might be one token in one model, but in another model it could be split into "trans" and "former" as two tokens. There's no fixed rule: different models and tokenizers (the algorithms that split text) might break the same sentence into tokens differently. Efficient tokenization is important because it determines how much text fits in the model's context window (more on that soon).

Real-World Example: The phrase "I have a cat" could be 4 tokens: "I," "have," "a," and "cat." But a more complex word like "internationalization" might be broken into smaller pieces like "international" + "ization". In fact, major language models use clever schemes like Byte-Pair Encoding (BPE) and WordPiece that balance between splitting text too much (character by character, which makes sequences very long) and too little (whole words, which can fail on unfamiliar words). The goal is to keep the token count manageable (to save compute time) while still being able to represent any word.

Some common tokenization methods include:

Whitespace splitting: e.g., "I love AI" → ["I," "love," "AI"]
Sub-word tokens: breaking words into meaningful pieces (e.g., "learning" → ["learn", "ing"])
Byte/character tokens: treating each byte or character as a token, this covers any text but can explode the sequence length, which makes the model work harder.

No matter the method, all tokens are ultimately converted to numbers (token IDs) that the model can process. So, when you input text into an LLM, it's first chopped into tokens, each token gets an ID number, and then the transformer begins its real magic.

Transformers Turn Tokens into Context

Once text is tokenized, the transformer model needs to make sense of those tokens in context. How does it do that? The key innovation is the self-attention mechanism. Essentially, the model learns to pay attention to relevant tokens when interpreting a given token. This is where a great analogy comes in:

Imagine each word in a sentence is a planet in space. Initially, each token (word) is represented as a vector; you can picture this as a point in space determined by the word's meaning (this is called an embedding). Words with similar meanings start out near each other in this space. For example, in an embedding space, you might find "blaze," "banana," and "cherry" all clustered in a "fruit" region, while in a far-off corner, "Microsoft," "Android," and "laptop" cluster in a "technology" region. Now consider the word "apple"—is it a fruit or a tech term? Its initial position might be ambiguous, somewhere between the two clusters.

Here's where self-attention acts like gravity between those word-planets. When the transformer processes a sentence, each word pulls on other words based on their relatedness—relevant words exert a stronger "gravitational pull." In Luis Serrano's metaphor, if the sentence is "I bought an apple and an blaze," the word "blaze" (a fruit) pulls "apple" toward the fruit cluster, nudging its representation to mean apple-as-fruit. Conversely, in "I missed a call on my Apple phone," the word "phone" pulls "Apple" toward the tech cluster, indicating Apple-as-tech-brand. In effect, each word looks at the others and adjusts its meaning based on their influence. Words that are related end up reinforcing each other's context: "apple" and "blaze" draw together strongly (both fruits), whereas unrelated words have little effect on each other. After this attention step (think of it as one round of gravitational adjustment), the ambiguous word's representation is much closer to the correct context.

In the analogy above, similar words have a strong gravitational pull, they move closer together in meaning. After one "gravity step," the word "apple" in the fruit sentence drifts nearer to "blaze," solidifying that it means the fruit. If many fruit words surround "apple" (imagine a whole galaxy of fruit terms), their combined pull ensures "apple" is understood as a fruit in that context.

Mathematically, what the transformer is doing is computing a bunch of similarity scores (like dot products) between the token in question and all other tokens to decide which ones are most relevant. Those scores become weights. Literally, attention weights are telling the model how much to mix each other word into the current word's representation. But you don't need to dig into the math to appreciate the result: the model dynamically reinterprets each token based on the others, which is why it can understand that "apple" means different things in "apple pie" vs "Apple iPhone." This process of paying attention gives transformers a powerful sense of context that static embeddings or older models couldn't match.

Position Matters Too: Positional Encoding

One more piece of the puzzle: word order. Transformers don't inherently know which token came first or last, because self-attention treats the input as a set of tokens. So transformers add a bit of information to each token to indicate its position in the sequence, known as positional encoding. You can imagine this as giving each token a unique badge saying "I'm the 1st word, 2nd word, 3rd word, etc." This way, the model knows, for example, that in "an blaze apple" vs "an apple blaze," the order is different even though the words are the same. Positional encoding ensures the transformer's attention considers not just which tokens, but also where they are in the sentence.

Why is this important? Because word meaning often comes from context and order. Take the word "trunk." If a story says, "The elephant raised its trunk," the surrounding words ("elephant", "raised") tell us "trunk" refers to the animal's nose. But if we read "the car's trunk was full," we know it's talking about a car's storage compartment. The transformer relies on neighboring tokens and their positions to resolve such ambiguities. Without positional info, it wouldn't know if a word is at the beginning or end of a sentence (which can affect meaning or grammatical role). The original transformers famously used a combination of sine and cosine waves to encode positions in a way that's smooth and doesn't blow up in magnitude, but the takeaway is simple: order is encoded so that context isn't just "bag of words." Each token's final understanding comes from both the content of other tokens and their positions relative to it.

The Context Window: A Transformer's Working Memory

We've talked about tokens and context; now let's talk about the context window. This term refers to the amount of text (measured in tokens) a model can handle at once. You can think of it as the AI model's working memory. Just as you might only hold a certain number of ideas in your head at one time, an AI model has a limit to how many tokens it can "remember" from the conversation or document. Popularly, this is often 2,000 tokens, 4,000 tokens, etc., for older models, and much more for newer ones.

A larger context window means the model can consider a longer prompt or have a longer conversation without forgetting earlier content. Early generative models like the first GPT could only handle a few thousand tokens (GPT-3 was around 4K tokens, roughly equivalent to about 3,000 words of English text). Modern models have blown past that, for instance, Anthropic's Claude can juggle 100,000 tokens or more, which might be around 75,000 words! Research from late 2024 showed models like GPT-4 and Google's Gemini pushing to 128,000 tokens, even up to 1–2 million tokens in experimental versions. That's essentially an entire book's worth of text in one go.

Sounds amazing, right? More context means the model can take into account more information, long documents, entire codebases, or multi-turn conversations, leading to more accurate and coherent responses. It reduces issues like the model "forgetting" what was said 5 paragraphs ago. However, this comes with trade-offs in token economics, namely cost and latency.

Why Longer Context = Higher Cost and Latency

Upgrading a model to a bigger context window is expensive because attention computation grows quadratically with token count. Double the tokens, and you need roughly four times the memory and FLOPs. A jump from 4 K to 16 K tokens therefore demands far more GPU capacity and often a larger architecture. Generation also slows: an autoregressive decoder must compare each new token with the entire history, so long answers trickle out as the sequence grows.

Cloud providers charge per token, so bigger prompts hit your wallet. OpenAI lists GPT-3.5 at about $0.002 per 1 K tokens for 4 K context, while the 16 K variant costs roughly twice that. GPT-4 pricing scales similarly. Self-hosted models face VRAM ceilings, which is why hard limits like 512 or 4 096 tokens exist. Research extends context with tricks such as RoPE scaling and streaming attention, but these add complexity and sometimes reduce accuracy. Long contexts can also hurt quality—the Lost in the Middle study shows that models may ignore information buried away from the ends of very long prompts. Thus, more tokens bring higher cost, higher latency, and no guarantee the model will actually use them well.

The Cost Side of Token Economics

Most API LLMs follow a pay-per-token model, so both prompt tokens (input) and generated tokens (output) count toward your bill. Output is pricier because it runs the full forward pass; for example, OpenAI once charged $0.0015 per 1 K input tokens and$ 0.0020 per 1 K output tokens. These tiny figures add up quickly at scale.

Providers usually offer on-demand token pricing or throughput-based provisioning. Pay-as-you-go suits sporadic traffic, while a reserved throughput plan can cut unit costs for high, steady loads—think bulk buying. Teams further trim spending with batching, caching, and other token-saving tricks. For everyday users the rule is simple: longer prompts and answers mean more tokens, higher latency, and higher fees, so concise requests keep both time and money in check.

Choosing the Next Token: Creativity vs. Predictability

After processing all tokens in context, the model must decide which token to emit next. It does this by sampling from a probability distribution and two key knobs shape that sampling style: temperature and top-p (nucleus sampling).

Temperature adjusts randomness. A low value (≈ 0) tells the model to pick the most likely token, producing predictable text. A higher value (≈ 0.8 – 1.0) lets it choose less-likely tokens, adding creativity.
Top-p sets a probability cutoff. With top-p = 0.9, the model considers tokens whose cumulative probability reaches 90 %, discarding the rest. Higher top-p widens the candidate pool and yields more varied prose; lower values keep outputs tight and factual.

These settings do not affect context length or cost, only style. High temperature plus high top-p encourages imaginative language useful for storytelling, whereas low settings keep the model conservative for factual tasks. Research shows that within normal ranges (temperature 0 – 1) raising temperature does not boost accuracy; it merely changes wording. Extreme values can lead to incoherent text, so randomness must be tuned with care.

Conclusion

Tokens are the tiny text pieces (whole words like "cat," sub-words like "inter-" and "national," or even single characters) that a tokenizer feeds to a transformer. Treating input as tokens keeps sequences predictable in length and lets the model handle rare or misspelled words by splitting them into manageable pieces.
Each token is converted into a high-dimensional embedding vector—often 256 to 16 384 numbers—that positions semantically similar words close together. In this learned space, "king" and "queen" end up nearer each other than either is to "banana," giving the model an intuitive starting point for meaning.
Self-attention assigns every token a set of queries, keys, and values, compares each query with every key, and builds weights that highlight the most relevant tokens. Those weights remix the values so "apple" is pulled toward a fruit meaning if "blaze" and "banana" are nearby, or toward a tech meaning if "phone" is present.
Positional encoding injects order information by adding a unique pattern to each embedding, telling the model "this is the 3rd word, this is the 4th." Classical transformers use sine and cosine waves that vary smoothly, while newer methods learn these patterns or encode only relative distances to generalize to longer texts.
Stacking many attention plus feed-forward layers deepens understanding: lower layers capture short-range relations like adjective–noun pairs, middle layers model clauses, and higher layers integrate sentence-level or paragraph-level themes. Residual connections and layer normalization keep signals stable during this buildup.
The context window is the maximum token span the model can see at once, commonly 4 096 tokens for GPT-3 and up to 128 000 tokens for cutting-edge systems. A larger window lets the model reference early passages in a long document or maintain multi-turn chat history without forgetting, at the cost of more memory and compute.
Bigger windows and richer layers raise runtime and pricing, so practitioners trim prompts, summarize background, or reuse cached context to stay within hardware limits and budget. Balancing window size, prompt brevity, and sampling settings keeps responses accurate while avoiding slowdowns and unnecessary cost.

Early Access

We share our learnings and insights on our newsletter. We also provide weekly insights on AI progress and new tools.

By signing up, you agree to our privacy policy.