Attention is All You Need: A Deep Dive into the Transformer
As deep learning evolves, the Transformer model has become a cornerstone for numerous applications in natural language processing and beyond. In this post, we will explore its architecture and key components, preserving the math and content structure along the way.
Introduction
The Transformer model revolutionized sequence transduction tasks by replacing recurrent and convolutional layers with a fully attention-based mechanism. This innovation not only reduced training costs significantly but also enabled much more parallelization during training.
"The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs." (p. 1)
Model Architecture
The Transformer consists of an encoder-decoder framework:
- Encoder: Processes the input sequence and generates continuous representations.
- Decoder: Consumes encoder outputs and predicts the target sequence step by step.
Both the encoder and decoder are composed of 6 identical layers, each having:
- A Multi-Head Self-Attention mechanism.
- A Position-wise Feed Forward Network.
"Residual connections are employed around each sub-layer, followed by layer normalization." (p. 3)
Self-Attention
Self-attention allows the model to compute a representation of a word based on all other words in the sequence, eliminating the need for sequential computation typical of recurrent networks.
What is Self-Attention?
Consider the sentence:
"The cat sat on the mat."
When processing the word "cat," the self-attention mechanism assigns higher weights to related words such as "sat" and "mat" to build a refined representation.
Steps in Self-Attention
-
Compute Query, Key, and Value matrices
Multiply the input embeddings by learned weight matrices:
Q = XW_Q K = XW_K V = XW_V
-
Calculate Attention Scores
Compute the dot product between and and scale by :
-
Apply Softmax
Normalize the scores so that each row sums to 1:
-
Compute the Output
Multiply the normalized scores with the Value matrix:
Example: With three words (embeddings , , ), the computed pairwise similarities produce attention weights that aggregate the value vectors into a refined representation for each word.
Attention Mechanism
Scaled Dot-Product Attention
The core attention mechanism is defined as:
Where:
- , , and are the Query, Key, and Value matrices.
- is the dimensionality of the keys.
Multi-Head Attention
Instead of applying a single attention function, the Transformer projects , , and into multiple subspaces and computes attention in parallel. The outputs are then concatenated, allowing each head to learn different relationships between words.
Example:
For the sentence “She went to the bank to deposit money”:
- One head might interpret “bank” as a financial institution.
- Another head might see “bank” in the context of a riverbank.
"Multi-head attention allows the model to jointly attend to information from different representation subspaces." (p. 4)
Positional Encoding
Since the Transformer model lacks recurrence, positional encodings are added to the token embeddings to capture word order.
"We hypothesized it would allow the model to easily learn to attend by relative positions." (p. 5)
How Positional Encodings Work
For each position and dimension , the positional encoding is computed as:
- Even indices use sine.
- Odd indices use cosine.
- (the embedding dimension).
These functions produce smooth, repeating patterns that help the model generalize to sequences of varying lengths.
Why Self-Attention?
Self-attention excels for several reasons:
- Computational Efficiency:
- Faster than recurrent layers for shorter sequences.
- Supports parallel processing.
- Reduced Path Length:
- The dependency path between inputs and outputs is reduced to , aiding gradient flow.
- Interpretability:
- Attention heads often reveal syntactic and semantic roles, making the model more interpretable.
Training the Transformer
Optimizer & Regularization
- The model utilizes the Adam optimizer with a specialized learning rate schedule.
- Dropout is applied to both attention and feed-forward layers.
- Label Smoothing helps reduce overfitting by encouraging the model to be less confident during training.
Results
The Transformer established state-of-the-art performance in machine translation:
- English-to-German BLEU score: 28.4 (previous best: 26.3).
- English-to-French BLEU score: 41.8 (previous best: 41.16).
Lecture Notes
Recurrent Neural Networks
Before Transformers, Recurrent Neural Networks (RNNs) were commonly used for processing sequential data.
- Process:
RNNs process a sequence sequentially, where each token’s representation is based on the previous hidden state. - Limitations:
- Slow for long sequences.
- Prone to vanishing and exploding gradients.
- Difficulty in capturing long-range dependencies.
Input & Positional Embedding
Tokenization splits a sentence into tokens (words or sub-words). Each token is assigned:
- A unique input ID.
- A learned embedding vector (e.g., size 512).
Positional Embeddings are added to these token embeddings to encode word order:
Single Head and Multi-Head Attention
Single Head Self-Attention
- Process:
-
Compute , , and .
-
Calculate attention scores:
-
Scale, apply softmax, and multiply by to get the final representation.
-
Multi-Head Self-Attention
-
Split , , and into subspaces.
-
Compute Scaled Dot-Product Attention for each head:
-
Concatenate the outputs from all heads.
-
Final Linear Transformation:
Layer Normalization
Normalization helps stabilize training by ensuring consistent input distributions. For each feature vector :
Then, applying learnable parameters:
Decoder Overview
The decoder utilizes cross-attention to combine encoder outputs with its own masked self-attention:
- Masked Multi-Head Self-Attention:
Prevents the decoder from “seeing” future tokens. - Cross-Attention:
Attends to encoder outputs. - Feed Forward Network:
Applies further transformations. - Add & Norm Layers:
Stabilizes training.
During inference, the model generates tokens one-by-one in an auto-regressive manner.
Inference and Training
Training is parallelized:
- Input tokenization and positional encoding.
- The encoder processes the input sequence.
- The decoder uses a shifted target sequence (starting with
<SOS>
) and applies masked self-attention. - The final output is generated via a linear layer followed by a softmax activation, with cross-entropy loss used for training.
Inference involves:
- Encoding the input.
- Iterative decoding (often enhanced with beam search) until an
<EOS>
token is produced.
This post offers a comprehensive look at the Transformer model, preserving the detailed math necessary to grasp its innovative approach to sequence modeling.