Transformers
Transformers are perhaps the most pivotal discovery in the field of machine learning. It may seem intimidating to understand this architecture, but trust me, it’s much more straightforward than you think.
Some background work is required before we jump straight into the architecture. I first want to introduce the idea of tokens and attention.
Tokens are simply chunks of a sequence. They are the building blocks that transformers understand. In the context of LLMs, tokens are parts of words, whole words, or even punctuation. For example, the word “is” can be a token, but a word can also be composed of multiple tokens such as “danc-ing”, where “danc” is one token and “ing” is another. How text is split into tokens is determined by the tokenizer, a process of its own.
One example of a tokenizer, and the one used in the LLM we implemented, is the Byte Pair Encoding tokenizer. The algorithm repeatedly analyzes the frequencies of adjacent characters and merges them into one token, doing this until a target vocabulary size is reached.
Attention
Next is attention. Attention can be thought of as how important different tokens are to other tokens.
Transformers are designed around attention; attention is a concept that has existed before Google’s transformer paper. The predecessor to the transformer, recurrent neural networks (RNNs), also had an attention mechanism, but it required sequential processing, which bottlenecked performance and led to information degradation over the sequence. Transformers solve this by directly computing pairwise attention over each token, easily enabling parallelization and opening up optimization opportunities, while reducing the aforementioned information degradation.
In transformers, we focus specifically on left-to-right causal attention. I will first explain general self-attention.
Mathematically, an attention function can be described as mapping a query and a set of key-value pairs to a weighted sum of values. Put abstractly, a score is computed for each token indicating how important it is to a given token, and these scores are used to produce a final value that represents information about that token.
A useful analogy: imagine you’re in a library looking for information on a topic.
- Query is the question you’re asking, what this token is looking for
- Key is the title of each book, what each token is about
- Value is the content inside each book, the actual information each token provides You compare your query against every key (book title) to decide which books are most relevant, then extract information from those books (values) proportionally.
Let’s say we start with the sentence, “I’m building a minimal ML framework.” We can work through this from left-to-right, starting with the query “I’m”. We compare this query against the keys of every token in the sentence.
The only thing is, we don’t use the letters “I” and “building”, rather, we use vector representations of them. We use a dot product as our compatibility function to measure the relationship between the query and each key, producing a score for each pair. We normalize those scores into an attention vector using softmax, and get something like the following:
These numbers are made up, but the idea is that a number close to 1 is more “important” and should have more focus than one close to 0. Also note that since we used softmax, all entries are nonnegative and they sum to 1.
We now give each token a value vector. Then the attention output for the query token “I” is the weighted sum of the value vectors:
Implementation
Mathematically, each token has three vectors that are produced from learned weight matrices: a query vector , a key vector , and a value vector . The same token produces all three, but through different learned transformations. Rather than computing attention one query at a time, we pack all tokens’ queries, keys, and values into matrices and compute attention for every token simultaneously:
The matrix multiplication computes the compatibility score between every pair of tokens at once. This is why transformers parallelize so well, connecting back to our discussion in the previous post. We divide by (the square root of the key dimension) to prevent the dot products from growing too large, which would push softmax into regions with very small gradients.
Attention Masking
In causal (left-to-right) attention, tokens can only attend to themselves and tokens that came before them. Tokens to the right are masked. This is enforced by setting the upper-triangular entries of the attention score matrix to before applying softmax, which drives those attention weights to zero.
This masking is essential during both training and inference. During training, it prevents the model from “cheating” by seeing future tokens it’s supposed to predict. During inference, the future tokens simply don’t exist yet. This is what enables the core idea of the transformer: predict the next token given all previous tokens.
A key benefit of this design is that even though each token can only look backward, we can still compute attention for all positions in parallel during training, since the masking is applied as a matrix operation.
Multi-head self-attention
Multi-head self-attention means the model runs several separate self-attention operations in parallel, each with its own learned , , and projection matrices. Each head can focus on different patterns. For example, one head might learn syntactic relationships while another captures semantic similarity.
The outputs of all heads are concatenated and then passed through a final linear projection to combine them back into the expected dimension:
Encoders and Decoders
The original transformer from “Attention Is All You Need” (Vaswani et al., 2017) uses an encoder-decoder architecture. The encoder processes the input sequence and the decoder generates the output sequence. For example, in a translation model, the encoder takes an English sentence and the decoder produces the French translation.
It’s worth noting LLMs, including the one we trained, use a decoder-only architecture, dropping the encoder entirely. We’ll cover both here, since understanding the full architecture makes the decoder-only variant easy to understand.
Encoder
The encoder’s job is to transform the input sequence into a rich set of representations that capture the meaning and relationships between tokens.
Step 1: Input Embedding
First, each token is converted into a vector using an **embedding layer, **a learned lookup table that maps each token ID to a dense vector of dimension .
Step 2: Positional Encoding
Self-attention has no inherent sense of order, treating the input as a set. To incorporate position information, we add a positional encoding vector to each token’s embedding. Each position gets a unique vector, typically generated using sinusoidal functions:
where is the position in the sequence and is the dimension index. These vectors are added to the token embeddings so the model can distinguish between the same words that appear in different positions.
Step 3: Multi-Head Self-Attention
The positionally encoded embeddings are passed into the multi-head self-attention layer described above. In the encoder, there is no causal masking. Every token can attend to every other token, since the full input is available.
Step 4: Add & Normalize
After attention, we apply a residual connection followed by layer normalization:
where is the input to the sublayer (multi-head attention in this case) and is the output. The residual connection (adding back) allows the model to build on its existing representations rather than starting from scratch, and helps gradients flow during backpropagation. Layer normalization provides stability and prevents vanishing/exploding gradients.
Step 5: Feed-Forward Network
The output is then passed through a position-wise feed-forward network (FFN) — the same network applied independently to each token:
This gives the model additional capacity to transform the representations. It is followed by another residual connection and layer normalization.
The encoder typically stacks of these layers (the original paper uses ), with each layer refining the representations further.
Decoder
The decoder generates the output sequence one token at a time, using both the encoder’s output and the tokens generated so far.
Step 1: Output Embedding + Positional Encoding
Just like the encoder, the target tokens are embedded and positionally encoded.
Step 2: Masked Multi-Head Self-Attention
This is the same multi-head self-attention, but with causal masking applied. Each token in the decoder can only attend to itself and previous tokens, preventing information from the future from leaking in.
Step 3: Add & Normalize
Residual connection and layer normalization, same as the encoder.
Step 4: Cross-Attention
This is the key difference from the encoder. The decoder performs a second attention step where:
- The queries come from the decoder’s previous layer
- The keys and values come from the encoder’s output This is how the decoder “looks at” the input sequence. For our translation example, this is where the French side attends to the English side to figure out what to say next.
Followed by another residual connection and layer normalization.
Step 5: Feed-Forward Network
Identical to the encoder, a position-wise FFN followed by residual connection and layer normalization.
The decoder also stacks identical layers.
Final Output
The decoder’s output is passed through a final linear layer that projects it to the vocabulary size, followed by a softmax to produce a probability distribution over all possible next tokens. The token with the highest probability is selected as the output (or sampled from, depending on the temperature).
Decoder-Only Models
As mentioned, most modern LLMs use a decoder-only architecture. This is essentially just the decoder described above, but without the cross-attention step. There’s no encoder and no separate input/output sequence — the model simply takes a sequence of tokens and predicts the next one, using causal masking to prevent looking ahead.
By training on massive amounts of text with the goal of predicting the next token, decoder-only transformers learn language and concepts that generalize to a variety of tasks.