About | Jakub Wojdan

As deep learning evolves, the Transformer model has become a cornerstone for numerous applications in natural language processing and beyond. In this post, we will explore its architecture and key components, preserving the math and content structure along the way.

Introduction

The Transformer model revolutionized sequence transduction tasks by replacing recurrent and convolutional layers with a fully attention-based mechanism. This innovation not only reduced training costs significantly but also enabled much more parallelization during training.

"The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs." (p. 1)

Model Architecture

The Transformer consists of an encoder-decoder framework:

Encoder: Processes the input sequence and generates continuous representations.
Decoder: Consumes encoder outputs and predicts the target sequence step by step.

Both the encoder and decoder are composed of 6 identical layers, each having:

A Multi-Head Self-Attention mechanism.
A Position-wise Feed Forward Network.

"Residual connections are employed around each sub-layer, followed by layer normalization." (p. 3)

Self-Attention

Self-attention allows the model to compute a representation of a word based on all other words in the sequence, eliminating the need for sequential computation typical of recurrent networks.

What is Self-Attention?

Consider the sentence:

"The cat sat on the mat."
When processing the word "cat," the self-attention mechanism assigns higher weights to related words such as "sat" and "mat" to build a refined representation.

Steps in Self-Attention

Compute Query, Key, and Value matrices

Multiply the input embeddings $X$ by learned weight matrices:
```
Q = XW_Q
K = XW_K
V = XW_V
```
Calculate Attention Scores

Compute the dot product between $Q$ and $K^T$ and scale by $\sqrt{d_k}$ :
$\text{Scores} = \frac{QK^T}{\sqrt{d_k}}$
Apply Softmax

Normalize the scores so that each row sums to 1:
$\text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$
Compute the Output

Multiply the normalized scores with the Value matrix:
$\text{Output} = \text{Attention Weights} \times V$

Example: With three words (embeddings $x_1$ , $x_2$ , $x_3$ ), the computed pairwise similarities produce attention weights that aggregate the value vectors into a refined representation for each word.

Attention Mechanism

Scaled Dot-Product Attention

The core attention mechanism is defined as:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

$Q$ , $K$ , and $V$ are the Query, Key, and Value matrices.
$d_k$ is the dimensionality of the keys.

Multi-Head Attention

Instead of applying a single attention function, the Transformer projects $Q$ , $K$ , and $V$ into multiple subspaces and computes attention in parallel. The outputs are then concatenated, allowing each head to learn different relationships between words.

Example:
For the sentence “She went to the bank to deposit money”:

One head might interpret “bank” as a financial institution.
Another head might see “bank” in the context of a riverbank.

"Multi-head attention allows the model to jointly attend to information from different representation subspaces." (p. 4)

Positional Encoding

Since the Transformer model lacks recurrence, positional encodings are added to the token embeddings to capture word order.

"We hypothesized it would allow the model to easily learn to attend by relative positions." (p. 5)

How Positional Encodings Work

For each position $pos$ and dimension $i$ , the positional encoding is computed as:

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

Even indices use sine.
Odd indices use cosine.
$d_{model} = 512$ (the embedding dimension).

These functions produce smooth, repeating patterns that help the model generalize to sequences of varying lengths.

Why Self-Attention?

Self-attention excels for several reasons:

Computational Efficiency:
- Faster than recurrent layers for shorter sequences.
- Supports parallel processing.
Reduced Path Length:
- The dependency path between inputs and outputs is reduced to $O(1)$ , aiding gradient flow.
Interpretability:
- Attention heads often reveal syntactic and semantic roles, making the model more interpretable.

Training the Transformer

Optimizer & Regularization

The model utilizes the Adam optimizer with a specialized learning rate schedule.
Dropout is applied to both attention and feed-forward layers.
Label Smoothing helps reduce overfitting by encouraging the model to be less confident during training.

Results

The Transformer established state-of-the-art performance in machine translation:

English-to-German BLEU score: 28.4 (previous best: 26.3).
English-to-French BLEU score: 41.8 (previous best: 41.16).

Lecture Notes

Recurrent Neural Networks

Before Transformers, Recurrent Neural Networks (RNNs) were commonly used for processing sequential data.

Process:
RNNs process a sequence $X_1, X_2, \dots, X_n$ sequentially, where each token’s representation is based on the previous hidden state.
Limitations:
- Slow for long sequences.
- Prone to vanishing and exploding gradients.
- Difficulty in capturing long-range dependencies.

Input & Positional Embedding

Tokenization splits a sentence into tokens (words or sub-words). Each token is assigned:

A unique input ID.
A learned embedding vector (e.g., size 512).

Positional Embeddings are added to these token embeddings to encode word order:

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{model}}}}\right)

Single Head and Multi-Head Attention

Single Head Self-Attention

Process:
- Compute $Q$ , $K$ , and $V$ .
- Calculate attention scores:
  $Q \times K^T$
- Scale, apply softmax, and multiply by $V$ to get the final representation.

Multi-Head Self-Attention

Split $Q$ , $K$ , and $V$ into subspaces.
Compute Scaled Dot-Product Attention for each head:
$\text{Attention}(Q', K', V') = \text{softmax}\left(\frac{Q'K'^T}{\sqrt{d_k}}\right)V'$
Concatenate the outputs from all heads.
Final Linear Transformation:
$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$

Layer Normalization

Normalization helps stabilize training by ensuring consistent input distributions. For each feature vector $x_j$ :

\hat{x}_j = \frac{x_j - \mu_j}{\sqrt{\sigma_j^2 + \epsilon}}

Then, applying learnable parameters:

y_j = \gamma \hat{x}_j + \beta

Decoder Overview

The decoder utilizes cross-attention to combine encoder outputs with its own masked self-attention:

Masked Multi-Head Self-Attention:
Prevents the decoder from “seeing” future tokens.
Cross-Attention:
Attends to encoder outputs.
Feed Forward Network:
Applies further transformations.
Add & Norm Layers:
Stabilizes training.

During inference, the model generates tokens one-by-one in an auto-regressive manner.

Inference and Training

Training is parallelized:

Input tokenization and positional encoding.
The encoder processes the input sequence.
The decoder uses a shifted target sequence (starting with <SOS>) and applies masked self-attention.
The final output is generated via a linear layer followed by a softmax activation, with cross-entropy loss used for training.

Inference involves:

Encoding the input.
Iterative decoding (often enhanced with beam search) until an <EOS> token is produced.

This post offers a comprehensive look at the Transformer model, preserving the detailed math necessary to grasp its innovative approach to sequence modeling.