Transformers Explained: A Beginner’s Guide to the Attention-Based Model

Understanding Transformer Architecture

Table of Contents:

1. Introduction
1.1. Understanding Transformer Architecture

2. Attention Is All You Need — Summary
2.1. Introduction
2.2. Background
2.3. Model Architecture
2.4. Attention Mechanism
2.5. Positional Encoding
2.6. Training and Optimization
2.7. Experiments and Results
2.8. Conclusion

3. Transformer Architecture Overview
3.1. Key Terms and Concepts

4. Transformer High-level Overview with an Example
4.1. Input
4.2. Tokenization
4.3. Embeddings
4.4. Positional Embeddings
4.5. Encoder
4.6. Decoder
4.7. Output

5. Understanding Word Embeddings
5.1. Example and Explanation

6. Positional Encodings

7. Transformer Decoder Block

8. Building a Multi-layer Decoder

9. Transformer with Real Vocabulary — Example

10. Transformer Encoder Block
10.1. Questions on Encoder Block

11. Masked Language Modeling (MLM) with BERT
11.1. Questions on MLM

12. Understanding the Math Behind Transformers
12.1. Embeddings
12.2. Positional Encoding
12.3. Attention Weighting
12.4. Multi-Head Attention
12.5. Feed-Forward Network (FFN)
12.6. Residual Connection
12.7. Layer Normalization
12.8. Softmax Activation
12.9. Dropout Regularization

13. Conclusion
13.1. Example: Translating “The quick brown fox jumps over the lazy dog” to Spanish

14. References

Introduction:

The Transformer architecture, introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017, has revolutionized the field of natural language processing (NLP). This powerful model has become the foundation for state-of-the-art NLP systems, enabling remarkable advancements in tasks such as machine translation, text generation, and sentiment analysis.

In this article, we’ll explore the core building blocks of the Transformer, including self-attention, encoder-decoder structure, multi-head attention, and positional encoding. We'll provide illustrative examples and step-by-step explanations to guide you through the flow of information within the Transformer.

Let's dive in.

Image by the Author

Attention Is All You Need-Summary:

The paper “Attention Is All You Need” was published in 2017 by Google researchers Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Here’s a summary of the key points from the paper:

1. Introduction:
— The paper proposes a new architecture called the Transformer, which relies solely on attention mechanisms and eliminates the need for recurrent or convolutional layers commonly used in sequence-to-sequence models.

2. Background:
— Previous sequence-to-sequence models often used recurrent neural networks (RNNs) or convolutional neural networks (CNNs) along with attention mechanisms.
— However, these models faced challenges such as difficulty in parallelization and capturing long-range dependencies.

3. Model Architecture:
— The Transformer architecture consists of an encoder and a decoder, both composed of multiple layers.
— Each layer in the encoder and decoder includes a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.
— The encoder maps the input sequence to a continuous representation, while the decoder generates the output sequence one element at a time.

4. Attention Mechanism:
— The paper introduces the concept of scaled dot-product attention, which computes the attention weights by taking the dot product of the query and key vectors and scaling by the square root of the dimension.
— Multi-head attention is used to allow the model to jointly attend to information from different representation subspaces.
— The attention mechanism enables the model to capture dependencies between input and output sequences without relying on recurrent or convolutional operations.

5. Positional Encoding:
— Since the Transformer does not include recurrence or convolution, it incorporates positional encodings to inject information about the relative or absolute position of the tokens in the sequence.
— The positional encodings are added to the input embeddings and allow the model to learn and utilize positional information.

6. Training and Optimization:
— The Transformer model is trained using the standard Adam optimizer with a custom learning rate schedule.
— The training objective is to minimize the cross-entropy loss between the predicted and target sequences.
— The paper also employs label smoothing as a regularization technique to improve the model’s generalization.

7. Experiments and Results:
— The Transformer model is evaluated on several tasks, including machine translation, English constituency parsing, and English-to-German translation.
— The model achieves state-of-the-art performance on the WMT 2014 English-to-German and English-to-French translation tasks.
— The Transformer also outperforms previous models on the English constituency parsing task.

8. Conclusion:
— The Transformer architecture demonstrates the effectiveness of relying solely on attention mechanisms for sequence-to-sequence tasks.
— The self-attention mechanism allows for more parallelization and captures long-range dependencies effectively.
— The Transformer has become a foundational architecture for many subsequent advancements in natural language processing tasks.

The paper

Image credit Attention is All You Need

Transformer:

- The transformer is a neural network architecture used for processing sequential data, such as natural language.

- It consists of an encoder and a decoder, which are composed of multiple layers.

- The encoder takes the input sequence (e.g., a sentence) and generates a representation of it.
— The input is first converted into embeddings, which are dense vector representations of each word or token.
Positional encodings are added to the embeddings to provide information about the position of each word in the sequence.
— The encoder applies self-attention mechanism to the input embeddings.
— Self-attention allows the model to weigh the importance of different words in the sequence when generating the representation.
— It computes attention scores between each word and every other word in the sequence.
— The attention scores are used to create a weighted sum of the word embeddings, generating context-aware representations.
— The encoder also includes a feed-forward neural network layer to further process the representations.
— The encoder can have multiple layers, each consisting of self-attention and feed-forward components.

- The decoder generates the output sequence (e.g., a translated sentence) based on the encoder’s representation.
— It also uses word embeddings and positional encodings for the output sequence.
— The decoder applies self-attention to the output embeddings, allowing it to consider the previously generated words.
— It also applies encoder-decoder attention, which attends to the encoder’s representations to gather relevant information for generating each output word.
— The decoder includes a feed-forward neural network layer similar to the encoder.
— The decoder can have multiple layers, each consisting of self-attention, encoder-decoder attention, and feed-forward components.

- The transformer uses multi-head attention, which allows the model to attend to different aspects of the input simultaneously.
Instead of a single attention mechanism, multi-head attention computes multiple attention heads in parallel.
— Each attention head captures different relationships and patterns in the input.
— The outputs of the attention heads are concatenated and linearly transformed to obtain the final representations.

- The transformer architecture also includes residual connections and layer normalization.
— Residual connections help in gradient flow and allow the model to learn deeper representations.
— Layer normalization helps stabilize the training process and improves the model’s performance.

- The transformer is trained using techniques like masked language modeling (MLM) and next sentence prediction (NSP).
— MLM involves masking random words in the input and training the model to predict the masked words based on the surrounding context.
— NSP involves training the model to predict whether two sentences follow each other in the original text.

  • The transformer architecture has been widely adopted and has given rise to powerful language models like BERT, GPT, and their variants.
    — These models are pre-trained on large amounts of text data and can be fine-tuned for specific tasks like sentiment analysis, named entity recognition, and question answering.

Please watch the videos below before jumping into the code:

The authors of the transformer paper:8 Google Employees Invented Modern AI. Here’s the Inside Story

Transformer Highlevel Overview:

Image by the Author

Let’s say we want to translate the sentence “I love cats” from English to French using a transformer.

1. Input:
— The input sentence is “I love cats.”

2. Tokenization:
— The sentence is split into tokens: [“I”, “love”, “cats”]

3. Embeddings:
— Each token is converted into a vector (a list of numbers) called an embedding.
— Example: “I” → [0.2, 0.5, 0.1], “love” → [0.7, 0.2, 0.3], “cats” → [0.4, 0.1, 0.8]

4. Positional Embeddings:
— Positional embeddings are added to the token embeddings to represent the position of each token in the sentence.
— Example: “I” → [0.2, 0.5, 0.1] + [0.1, 0.2, 0.3], “love” → [0.7, 0.2, 0.3] + [0.4, 0.5, 0.6], “cats” → [0.4, 0.1, 0.8] + [0.7, 0.8, 0.9]

5. Encoder:
— The encoder processes the input embeddings through multiple layers.
— Each encoder layer has two main components:
a. Multi-Head Attention:
— The tokens attend to each other to understand the relationships between them.
— Example: “I” attends to “love” and “cats” to understand the context.
b. Feedforward Neural Network:
— The attended representations are passed through a feedforward neural network to transform the information.
— Residual connections and layer normalization are used between the components to facilitate training.

6. Decoder:
— The decoder generates the output sentence in the target language (French).
— It works similarly to the encoder but has an additional component called Encoder-Decoder Attention.
— The decoder attends to the encoder’s output to gather relevant information for generating the output tokens.

7. Output:
— The decoder generates the output tokens one by one.
— Example: “J’” → “aime” → “les” → “chats”
— The final output is the translated sentence: “J’aime les chats” (I love cats in French).

In summary, the transformer takes the input sentence, converts it into embeddings, and processes it through the encoder. The encoder uses multi-head attention and feedforward networks to understand the relationships between the tokens. The decoder then generates the output sentence by attending to the encoder’s output and the previously generated tokens.

Key Terms and Concepts:

  1. Encoder:
    — The encoder is the component of the transformer that processes the input sequence.
    — It takes the input embeddings and generates a representation of the input.
    — The encoder consists of multiple layers, each performing self-attention and feed-forward processing.

2. Decoder:
— The decoder is the component of the transformer that generates the output sequence.
— It takes the encoder’s output and generates the output tokens one at a time.
— The decoder also consists of multiple layers, each performing self-attention, encoder-decoder attention, and feed-forward processing.

3. Self-Attention:
— Self-attention is a mechanism that allows the model to attend to different positions of the input sequence.
— It computes attention scores between each position and every other position in the sequence.
— The attention scores determine the importance of each position when generating the output.
— Self-attention enables the model to capture dependencies and relationships within the input sequence.

4. Encoder-Decoder Attention:
— Encoder-decoder attention is a mechanism used in the decoder to attend to the encoder’s output.
— It allows the decoder to focus on relevant parts of the encoder’s representation when generating each output token.
— The decoder computes attention scores between each position in the decoder and every position in the encoder’s output.

5. Multi-Head Attention:
— Multi-head attention is an extension of the attention mechanism that allows the model to attend to different representations of the input simultaneously.
— Instead of a single attention function, multi-head attention computes multiple attention heads in parallel.
— Each attention head captures different relationships and patterns in the input.
— The outputs of the attention heads are concatenated and linearly transformed to obtain the final attention output.

6. Positional Encoding:
— Positional encoding is a technique used to inject information about the position of each token in the input sequence.
— Since the transformer architecture does not have inherent knowledge of token order, positional encodings are added to the input embeddings.
— The positional encodings are typically sine and cosine functions of different frequencies, allowing the model to learn positional information.

7. Feed-Forward Network (FFN):
— The feed-forward network is a component used in both the encoder and decoder layers.
— It consists of two linear layers with a non-linear activation function (usually ReLU) in between.
— The FFN processes the attention outputs and applies non-linear transformations to capture complex patterns.

8. Residual Connection:
— Residual connections are used in the transformer architecture to facilitate the flow of information and gradients.
— They connect the input of a layer directly to its output, allowing the model to learn identity functions and prevent vanishing gradients.
— Residual connections are typically added after the self-attention and FFN components in each layer.

9. Layer Normalization:
— Layer normalization is a technique used to normalize the activations of a layer.
— It normalizes the inputs across the features (i.e., across the embedding dimension) for each layer.
— Layer normalization helps stabilize the training process and improves the model’s performance.

10. Dropout:
— Dropout is a regularization technique used to prevent overfitting in neural networks.
— It randomly sets a fraction of the input units to zero during training, which helps the model learn more robust representations.
— Dropout is applied to the outputs of the self-attention and FFN components in the transformer.

11. Masked Language Modeling (MLM):
— Masked language modeling is a pre-training task used to train transformer-based models like BERT.
— In MLM, a fraction of the input tokens are randomly masked, and the model is trained to predict the original tokens based on the surrounding context.
— MLM helps the model learn rich representations of the input language.

12. Next Sentence Prediction (NSP):
— Next sentence prediction is another pre-training task used in some transformer models like BERT.
— In NSP, the model is given two sentences and is trained to predict whether the second sentence follows the first sentence in the original text.
— NSP helps the model learn relationships between sentences and can be useful for tasks like question answering and natural language inference.

Understanding Word Embeddings:

Image by the Author
import torch
import torch.nn as nn
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn

# Define an example sentence and create a word-to-index mapping
example_sentence = "Transformers are a powerful deep learning architecture"
word_to_index = {word: idx for idx, word in enumerate(set(example_sentence.split()))}
print(word_to_index)

# Convert the words to their corresponding indices
input_indices = torch.tensor([word_to_index[word] for word in example_sentence.split()])
print(input_indices)

# Function to get word embeddings
def get_word_embeddings(input_indices, embedding_dimension):
# Create an embedding layer
embedding_layer = nn.Embedding(input_indices.max() + 1, embedding_dimension)
# Convert input indices to embeddings
return embedding_layer(input_indices)

# Get the word embeddings
embedding_dimension = 16
word_embeddings = get_word_embeddings(input_indices, embedding_dimension)
print(word_embeddings)
{'architecture': 0, 'are': 1, 'deep': 2, 'a': 3, 'Transformers': 4, 'learning': 5, 'powerful': 6}
tensor([4, 1, 3, 6, 2, 5, 0])
tensor([[ 0.7307, -1.8295, -1.3415, -0.1667, -2.0778, 0.0379, 2.5125, -0.3417,
1.5476, -3.0841, 0.0403, 0.5197, 1.4371, -1.3395, 0.7904, 0.2756],
[-1.1684, -0.8106, -1.7614, 2.1557, -1.1091, -0.7754, 0.9326, 0.2574,
1.5657, 0.1843, 0.6166, 1.1669, -0.7471, 1.2733, -1.5429, -0.0356],
[-0.2008, -0.1224, -1.6218, -1.0898, -1.8558, -0.8685, 0.4534, 0.5661,
-0.9941, 0.1106, -2.2490, 0.3350, 1.3688, 0.3328, 0.5519, -1.0396],
[-0.6339, 2.0293, -0.8659, 1.2129, -0.0095, -2.2154, 1.4974, -1.6500,
-0.9725, -0.1531, -0.9813, 0.1600, 0.0884, 0.8129, 0.2332, 0.3755],
[-0.0713, 0.6791, -0.5991, 0.4398, 0.1938, 0.6582, 1.4959, -0.4695,
-1.1536, -0.0134, 0.0583, 0.5929, 0.0181, -0.4272, -0.7521, 0.2874],
[ 0.1824, 0.3780, -0.5110, -1.1229, -0.0886, 0.3397, -0.0549, -0.1871,
-1.6917, 0.5702, 1.1167, 0.7260, -0.5458, -0.1543, 0.5022, 0.3850],
[-1.6483, -0.2890, -0.4969, 1.0841, 0.5940, -0.8370, -0.8878, 0.5833,
1.3709, -0.9165, 0.2592, -0.3366, 1.4882, -0.2669, -1.0840, -0.2538]],
grad_fn=<EmbeddingBackward0>)

Let me explain what’s happening here:

  1. We define an example sentence to work with: “Transformers are a powerful deep learning architecture”. This is the sentence we’ll be encoding.
  2. We create a word_to_index dictionary that maps each unique word in the sentence to a unique index. This allows us to represent words as numbers that can be used by the model. We print out this dictionary to see the mappings.
  3. We convert the words in the sentence to their corresponding indices using the word_to_index mapping. This gives us a tensor input_indices that represents the sentence as a sequence of indices.
  4. We define a function get_word_embeddings that takes input_indices an embedding_dimension as arguments. This function will convert the word indices into dense vector representations (embeddings).
  • Inside the function, we create a nn.Embedding layer. This layer takes two arguments: the number of unique words (obtained by taking the maximum index in input_indices and adding 1), and the desired dimensionality of the embeddings (embedding_dimension).
  • We pass the input_indices through this embedding layer to get the actual word embeddings.
  1. We specify the desired embedding_dimension as 16. This means each word will be represented by a 16-dimensional vector.
  2. We call the get_word_embeddings function with input_indices and embedding_dimension, and store the result in word_embeddings.
  3. Finally, we print out the word_embeddings tensor to see the dense vector representations of the words.
Image by the Author

The key idea here is that we’re converting the words into dense vectors (embeddings) that capture semantic meaning and relationships between words. These embeddings will serve as the input to the transformer model.(Check the math behind transformers section below for more details on Embeddings)

Positional Encodings:

Image by the Author
# Function to generate positional encodings
def get_positional_encoding(sequence_length, embedding_dimension):
# Create a matrix of positions
positions = np.arange(sequence_length)[:, np.newaxis]
# Calculate the denominator term for the sinusoidal function
denominator_term = np.exp(np.arange(0, embedding_dimension, 2) * -(np.log(10000.0) / embedding_dimension))
# Calculate the positional encodings
positional_encodings = np.zeros((sequence_length, embedding_dimension))
positional_encodings[:, 0::2] = np.sin(positions * denominator_term)
positional_encodings[:, 1::2] = np.cos(positions * denominator_term)
return torch.tensor(positional_encodings, dtype=torch.float)

# Function to plot a heatmap
def plot_heatmap(data, title):
plt.figure(figsize=(5,5))
seaborn.heatmap(data, cmap="viridis", vmin=-1, vmax=1)
plt.title(title)
plt.show()

# Generate positional encodings
sequence_length = len(example_sentence.split())
positional_encodings = get_positional_encoding(sequence_length, embedding_dimension)
plot_heatmap(positional_encodings, "Positional Encodings")

Here’s what’s happening:

  1. We define a function get_positional_encoding that takes the sequence_length (number of words in the sentence) and the embedding_dimension as arguments. This function will generate the positional encodings.
  2. Inside the function:
  • We create a matrix positions using np.arange(sequence_length). This gives us a range of positions from 0 to sequence_length - 1. We add a new axis to this matrix using [:, np.newaxis] to make it a column vector.
  • We calculate the denominator term for the sinusoidal function. This term is used to create unique encodings for each position. It’s calculated using np.exp(np.arange(0, embedding_dimension, 2) * -(np.log(10000.0) / embedding_dimension)).
  • We initialize a matrix positional_encodings of shape (sequence_length, embedding_dimension) filled with zeros.
  • We fill the even indices of positional_encodings with the sine of the positions multiplied by the denominator term.
  • We fill the odd indices of positional_encodings with the cosine of the positions multiplied by the denominator term.
  • Finally, we convert the positional_encodings to a PyTorch tensor and return it.
  1. We define a helper function plot_heatmap to visualize the positional encodings as a heatmap.
  2. We calculate the sequence_length by counting the number of words in the example sentence.
  3. We generate the positional encodings by calling get_positional_encoding with sequence_length and embedding_dimension.
  4. We plot the positional encodings using the plot_heatmap function.
Image by the Author

The purpose of positional encodings is to give the transformer model information about the position of each word in the sentence. Since the transformer architecture doesn’t inherently capture the order of the words, positional encodings are added to the word embeddings to inject this positional information.

The sinusoidal functions used for positional encodings have the property that each position has a unique encoding, and the model can learn to attend to relative positions easily.

Add the word embeddings and positional encodings:

Image by the Author
# Add word embeddings and positional encodings
final_embeddings = word_embeddings + positional_encodings
print(final_embeddings)
tensor([[ 0.5863, -0.6661,  0.0167,  2.6542,  1.1245,  0.3778,  0.6896,  1.4321,
-1.5676, 2.8123, -2.8880, 1.4742, -0.8564, 1.7869, -0.3574, 1.3814],
[ 1.3501, 1.4485, 0.1985, -0.7612, 0.6105, -0.2798, 1.5297, 0.5854,
-2.1236, 0.6867, 1.0052, 2.0823, -0.5564, 1.3908, 0.3202, 0.1098],
[-0.1334, 0.4834, 0.6642, -0.3959, 0.6143, -0.0725, -0.2075, 0.2209,
-0.6934, 2.5294, 0.2628, 0.6040, -0.7490, 0.2714, 1.0503, 1.2154],
[ 0.0840, -0.4689, -0.5019, 1.6178, 0.7040, -0.5275, -1.2964, 3.4568,
-0.9409, -0.6549, -0.4368, 1.4268, -1.1184, 3.0026, 0.2321, 0.9161],
[-2.3014, -1.8047, 0.8805, 0.8195, -1.6781, 0.3765, -1.1197, 0.9758,
-0.0973, 0.1430, 0.2548, 0.6830, -0.1079, 2.9859, -0.7838, -1.1137],
[-0.5497, -0.0602, 1.5168, 1.1369, 0.0073, 2.4789, 1.1451, 0.9991,
0.4892, 0.7323, 0.8595, -0.4209, 0.4353, 1.4884, -1.7009, 0.6675],
[ 0.1927, 0.8313, -1.3919, 0.2572, 1.5193, 0.5110, 0.1726, 1.6686,
0.6747, 0.8584, 1.9237, 1.5792, -1.5368, 0.9396, -0.2742, 0.3545]],
grad_fn=<AddBackward0>)

The final_embeddings Tensor now contains the sum of the word embeddings and positional encodings. This serves as the input to the transformer model, providing it with both semantic information (from word embeddings) and positional information (from positional encodings).

Image by the Author

Building a decoder:

Image by the Author
import torch
import torch.nn as nn
import torch.nn.functional as F

class DecoderBlock(nn.Module):
def __init__(self, embedding_dimension, num_attention_heads, feedforward_dimension, dropout_rate):
super(DecoderBlock, self).__init__()

self.multihead_attention = nn.MultiheadAttention(embedding_dimension, num_attention_heads, dropout=dropout_rate)
self.layer_norm1 = nn.LayerNorm(embedding_dimension)
self.dropout1 = nn.Dropout(dropout_rate)
self.feedforward1 = nn.Linear(embedding_dimension, feedforward_dimension)
self.feedforward2 = nn.Linear(feedforward_dimension, embedding_dimension)
self.layer_norm2 = nn.LayerNorm(embedding_dimension)
self.dropout2 = nn.Dropout(dropout_rate)

def forward(self, input_embeddings, attention_mask):
attention_output, _ = self.multihead_attention(input_embeddings, input_embeddings, input_embeddings, attn_mask=attention_mask)
residual_output = input_embeddings + self.dropout1(attention_output)
normalized_output1 = self.layer_norm1(residual_output)
feedforward_output = self.feedforward2(F.relu(self.feedforward1(normalized_output1)))
residual_output = normalized_output1 + self.dropout2(feedforward_output)
normalized_output2 = self.layer_norm2(residual_output)
return normalized_output2

Let’s break it down:

  1. We define a DecoderBlock class that inherits from nn.Module. This class represents a single layer (block) of the transformer decoder.
  2. The __init__ method takes several parameters:
  • embedding_dimension: The size of the input embeddings.
  • num_attention_heads: The number of attention heads in the multi-head attention mechanism.
  • feedforward_dimension: The size of the hidden layer in the feedforward neural network.
  • dropout_rate: The dropout rate to be applied.
  1. Inside the __init__ method, we define the components of the decoder block:
  • multihead_attention: An instance of nn.MultiheadAttention that performs multi-head self-attention on the input embeddings.
  • layer_norm1: An instance of nn.LayerNorm for layer normalization after the attention mechanism.
  • dropout1: An instance of nn.Dropout for applying dropout after the attention mechanism.
  • feedforward1: The first linear layer of the feedforward neural network.
  • feedforward2: The second linear layer of the feedforward neural network.
  • layer_norm2: Another instance of nn.LayerNorm for layer normalization after the feedforward neural network.
  • dropout2: Another instance of nn.Dropout for applying dropout after the feedforward neural network.
  1. The forward method defines the forward pass of the decoder block. It takes two inputs:
  • input_embeddings: The input embeddings to the decoder block.
  • attention_mask: The attention mask to prevent attending to future positions.
  1. Inside the forward method:
  • We apply multi-head self-attention to the input embeddings using the multihead_attention module. The input embeddings serve as the query, key, and value for the attention mechanism. The attention_mask is used to mask out future positions.
  • We add the input embeddings to the attention output and apply dropout using dropout1. This is the residual connection.
  • We apply layer normalization to the residual output using layer_norm1.
  • We pass the normalized output through the feedforward neural network, which consists of two linear layers (feedforward1 and feedforward2) with a ReLU activation in between.
  • We add the normalized output to the feedforward output and apply dropout using dropout2. This is another residual connection.
  • We apply layer normalization to the final residual output using layer_norm2.
  • Finally, we return the normalized output.
Image by the Author
Image by the Author

The key ideas here are:

  • The decoder block consists of a multi-head attention mechanism followed by a feedforward neural network.
  • Residual connections are used to add the input embeddings to the outputs of the attention and feedforward layers. This helps in preserving information and facilitating gradient flow.
  • Layer normalization is applied after each residual connection to normalize the activations and stabilize training.
  • Dropout is applied after the attention and feedforward layers to regularize the model and prevent overfitting.

This decoder block can be stacked multiple times to form the complete transformer decoder.

Let’s move on to the positional encoding and the overall transformer decoder:

class PositionalEncoding(nn.Module):
def __init__(self, embedding_dimension, dropout_rate=0.1, max_sequence_length=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout_rate)

positional_encodings = torch.zeros(max_sequence_length, embedding_dimension)
positions = torch.arange(0, max_sequence_length, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, embedding_dimension, 2).float() * (-math.log(10000.0) / embedding_dimension))
positional_encodings[:, 0::2] = torch.sin(positions * div_term)
positional_encodings[:, 1::2] = torch.cos(positions * div_term)
positional_encodings = positional_encodings.unsqueeze(0).transpose(0, 1)
self.register_buffer("positional_encodings", positional_encodings)

def forward(self, input_embeddings):
input_embeddings = input_embeddings + self.positional_encodings[:input_embeddings.size(0), :]
return self.dropout(input_embeddings)

class TransformerDecoder(nn.Module):
def __init__(self, vocab_size, embedding_dimension, num_attention_heads, feedforward_dimension, dropout_rate, num_decoder_layers):
super(TransformerDecoder, self).__init__()

self.word_embeddings = nn.Embedding(vocab_size, embedding_dimension)
self.positional_encoding = PositionalEncoding(embedding_dimension, dropout_rate)
self.decoder_layers = nn.ModuleList([
DecoderBlock(embedding_dimension, num_attention_heads, feedforward_dimension, dropout_rate)
for _ in range(num_decoder_layers)
])
self.output_linear = nn.Linear(embedding_dimension, vocab_size)
self.output_softmax = nn.LogSoftmax(dim=-1)

def forward(self, input_indices):
embeddings = self.word_embeddings(input_indices)
embeddings_with_positions = self.positional_encoding(embeddings)
decoder_output = embeddings_with_positions
for decoder_layer in self.decoder_layers:
attention_mask = generate_square_subsequent_mask(decoder_output.size(0))
decoder_output = decoder_layer(decoder_output, attention_mask)
linear_output = self.output_linear(decoder_output)
softmax_output = self.output_softmax(linear_output)
return softmax_output

Here’s what’s happening:

  1. The PositionalEncoding class is defined to generate positional encodings.
  • In the __init__ method, we calculate the positional encodings using sinusoidal functions.
  • The positional encodings are stored as a buffer in the module.
  • In the forward method, the positional encodings are added to the input embeddings.
  1. The TransformerDecoder class represents the complete transformer decoder.
  • It consists of an embedding layer (word_embeddings) to convert input indices to embeddings.
  • The positional_encoding module is used to add positional information to the embeddings.
  • Multiple decoder layers (decoder_layers) are stacked using nn.ModuleList.
  • An output linear layer (output_linear) is used to project the decoder output to the vocabulary size.
  • A softmax layer (output_softmax) is applied to obtain the final output probabilities.
  1. In the forward method of TransformerDecoder:
  • The input indices are passed through the embedding layer to get the word embeddings.
  • Positional encodings are added to the embeddings using the positional_encoding module.
  • The embeddings with positional information are passed through each decoder layer.
  • An attention mask is generated using the generate_square_subsequent_mask function to prevent attending to future positions.
  • The output of the last decoder layer is passed through the output linear layer and softmax layer to obtain the final output probabilities.

The key ideas here are:

  • The transformer decoder consists of an embedding layer, positional encoding, multiple decoder layers, and output layers.
  • The positional encoding module adds positional information to the word embeddings.
  • The decoder layers are stacked sequentially, and each layer applies multi-head attention and feedforward processing.
  • The attention mask is used to prevent the decoder from attending to future positions during training, ensuring that the predictions are based only on the current and previous positions.
Image by the Author

Building a multi-layer decoder:

Image by the Author
class MultiLayerTransformerDecoder(nn.Module):
def __init__(self, vocab_size, embedding_dimension, num_attention_heads, feedforward_dimension, dropout_rate, num_decoder_layers):
super(MultiLayerTransformerDecoder, self).__init__()

self.word_embeddings = nn.Embedding(vocab_size, embedding_dimension)
self.positional_encoding = PositionalEncoding(embedding_dimension, dropout_rate)
self.decoder_layers = nn.ModuleList([
DecoderBlock(embedding_dimension, num_attention_heads, feedforward_dimension, dropout_rate)
for _ in range(num_decoder_layers)
])
self.output_linear = nn.Linear(embedding_dimension, vocab_size)
self.output_softmax = nn.LogSoftmax(dim=-1)

def forward(self, input_indices):
embeddings = self.word_embeddings(input_indices)
embeddings_with_positions = self.positional_encoding(embeddings)
decoder_output = embeddings_with_positions
for decoder_layer in self.decoder_layers:
attention_mask = generate_square_subsequent_mask(decoder_output.size(0))
decoder_output = decoder_layer(decoder_output, attention_mask)
linear_output = self.output_linear(decoder_output)
softmax_output = self.output_softmax(linear_output)
return softmax_output

The MultiLayerTransformerDecoder class is similar to the TransformerDecoder class from the previous section, but it allows for multiple decoder layers.

  • The __init__ method takes an additional parameter num_decoder_layers to specify the number of decoder layers.
  • The decoder_layers attribute is defined as an nn.ModuleList that contains num_decoder_layers instances of DecoderBlock.
  • In the forward method, the input embeddings are passed through each decoder layer in sequence, with an attention mask generated for each layer.

The rest of the code remains the same as in the previous section.

Now, let’s move on to adding a real vocabulary to the model:

import torch
import torch.nn as nn
import time # Import the time module

def generate_square_subsequent_mask(sequence_length):
mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1).bool()
return mask


# Define the hyperparameters and the vocabulary
embedding_dimension = 100
num_attention_heads = 1
feedforward_dimension = 4 * embedding_dimension
dropout_rate = 0.1
num_decoder_layers = 4
context_length = 5
batch_size = 1
vocabulary = ["transformer", "deep", "learning", "natural", "language", "processing", "model", "architecture"]
vocab_size = len(vocabulary)

# Create dictionaries to map words to indices and vice versa
word_to_index = {word: index for index, word in enumerate(vocabulary)}
index_to_word = {index: word for index, word in enumerate(vocabulary)}

# Initialize the model
model = MultiLayerTransformerDecoder(vocab_size, embedding_dimension, num_attention_heads,
feedforward_dimension, dropout_rate, num_decoder_layers)

# Create an example sequence
sequence = ["transformer", "architecture", "natural", "language", "processing"][:context_length]
input_indices = torch.tensor([[word_to_index[word] for word in sequence]])

# Generate a sequence of words
generated_words = []
for _ in range(10): # Generate 10 words
output = model(input_indices)
predicted_index = output.argmax(dim=-1)[0, -1] # Get the predicted word index
predicted_word = index_to_word[predicted_index.item()]
print(predicted_word, end=" ")
generated_words.append(predicted_word)
input_indices = torch.cat([input_indices, predicted_index.unsqueeze(0).unsqueeze(0)], dim=-1) # Append the predicted word to the input
time.sleep(0.75) # Pause for 0.75 seconds
processing natural language language language model language language language language 

In this section, we define a real vocabulary for the model and generate a sequence of words using the trained model.

  • We define the hyperparameters and the vocabulary as a list of words.
  • We create dictionaries word_to_index and index_to_word to map words to indices and vice versa.
  • We initialize the MultiLayerTransformerDecoder model with the specified hyperparameters and vocabulary size.
  • We create an example sequence of words and convert it to input indices using the word_to_index mapping.
  • We generate a sequence of words by iteratively passing the input indices through the model, obtaining the predicted word index, converting it back to a word using index_to_word, and appending the predicted word to the input indices for the next iteration.
  • We print the generated words with a pause of 0.75 seconds between each word.
Image by the Author

Let's see an example:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import time

# Load pre-trained models and tokenizers from Hugging Face Model Hub
tokenizer_small = GPT2Tokenizer.from_pretrained("gpt2")
model_small = GPT2LMHeadModel.from_pretrained("gpt2")

tokenizer_large = GPT2Tokenizer.from_pretrained("gpt2-xl")
model_large = GPT2LMHeadModel.from_pretrained("gpt2-xl")

# Define a prompt
prompt = "Transformers have revolutionized the field of"

# Generate text with GPT-2 Small
inputs_small = tokenizer_small.encode(prompt, return_tensors="pt")
attention_mask_small = torch.ones(inputs_small.shape, dtype=torch.long)
pad_token_id_small = tokenizer_small.eos_token_id

print(prompt, end=" ", flush=True)
for _ in range(10): # Generate 10 words
outputs_small = model_small.generate(inputs_small, max_length=inputs_small.shape[-1]+1, do_sample=True,
pad_token_id=pad_token_id_small, attention_mask=attention_mask_small)
generated_word = tokenizer_small.decode(outputs_small[0][-1])
print(generated_word, end=" ", flush=True)
inputs_small = torch.cat([inputs_small, outputs_small[0][-1].unsqueeze(0).unsqueeze(0)], dim=-1)
attention_mask_small = torch.cat([attention_mask_small, torch.ones((1, 1), dtype=torch.long)], dim=-1)
time.sleep(0.7)
print("\nGPT-2 Small completed.")

# Generate text with GPT-2 XL
inputs_large = tokenizer_large.encode(prompt, return_tensors="pt")
attention_mask_large = torch.ones(inputs_large.shape, dtype=torch.long)
pad_token_id_large = tokenizer_large.eos_token_id

print(prompt, end=" ", flush=True)
for _ in range(10): # Generate 10 words
outputs_large = model_large.generate(inputs_large, max_length=inputs_large.shape[-1]+1, do_sample=True,
pad_token_id=pad_token_id_large, attention_mask=attention_mask_large)
generated_word = tokenizer_large.decode(outputs_large[0][-1])
print(generated_word, end=" ", flush=True)
inputs_large = torch.cat([inputs_large, outputs_large[0][-1].unsqueeze(0).unsqueeze(0)], dim=-1)
attention_mask_large = torch.cat([attention_mask_large, torch.ones((1, 1), dtype=torch.long)], dim=-1)
time.sleep(0.7)
print("\nGPT-2 XL completed.")
Transformers have revolutionized the field of natural language processing (NLP) by enabling the
GPT-2 Small completed.

Transformers have revolutionized the field of natural language processing, machine learning, and artificial intelligence.
GPT-2 XL completed.

Code Explanation:

  • The code imports the necessary libraries: GPT2LMHeadModel and GPT2Tokenizer from the transformers library, torch, and time.
  • It loads two pre-trained models and their corresponding tokenizers from the Hugging Face Model Hub: GPT-2 Small ("gpt2") and GPT-2 XL ("gpt2-xl").
  • A prompt string is defined: "Transformers have revolutionized the field of".
  • The code then generates text using both the GPT-2 Small and GPT-2 XL models based on the provided prompt.

For each model (GPT-2 Small and GPT-2 XL):

  • The prompt is encoded using the corresponding tokenizer, and the input tensors, attention masks, and padding token IDs are prepared.
  • The model generates text word by word, up to a maximum of 10 words.
  • For each generated word:
  • The model’s generate() function is called with the current input tensor, attention mask, and other parameters.
  • The generated word is decoded using the tokenizer and printed.
  • The generated word is appended to the input tensor and attention mask for the next iteration.
  • There is a short delay of 0.7 seconds between each generated word.
  • After generating 10 words, a completion message is printed.

Output:

  • The code will output the generated text for both GPT-2 Small and GPT-2 XL models.
  • The output will start with the provided prompt, followed by the generated words.
  • Each generated word will be printed with a space separator.
  • After generating 10 words for each model, a completion message will be printed.
Image by the Author

Building Encoder Transformer:

Image by the Author
Image by the Author

This section focuses on building an encoder transformer from scratch.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class FeedForward(nn.Module):
def __init__(self, input_dim, hidden_dim, dropout=0.1):
super(FeedForward, self).__init__()
self.linear1 = nn.Linear(input_dim, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, input_dim)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
x = self.linear1(x)
x = F.relu(x)
x = self.dropout(x)
x = self.linear2(x)
return x

class TransformerEncoderBlock(nn.Module):
def __init__(self, input_dim, num_heads, feedforward_dim, dropout=0.1):
super(TransformerEncoderBlock, self).__init__()
self.self_attention = nn.MultiheadAttention(input_dim, num_heads, dropout=dropout)
self.norm1 = nn.LayerNorm(input_dim)
self.norm2 = nn.LayerNorm(input_dim)
self.feedforward = FeedForward(input_dim, feedforward_dim, dropout)
self.dropout = nn.Dropout(dropout)

def forward(self, x, mask=None):
attention_output, _ = self.self_attention(x, x, x, attn_mask=mask)
x = x + self.dropout(attention_output)
x = self.norm1(x)

feedforward_output = self.feedforward(x)
x = x + self.dropout(feedforward_output)
x = self.norm2(x)

return x

class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, input_dim, num_heads, feedforward_dim, num_layers, dropout=0.1):
super(TransformerEncoder, self).__init__()
self.word_embeddings = nn.Embedding(vocab_size, input_dim)
self.position_embeddings = nn.Embedding(1000, input_dim)
self.layers = nn.ModuleList(
[
TransformerEncoderBlock(input_dim, num_heads, feedforward_dim, dropout)
for _ in range(num_layers)
]
)

def forward(self, x, mask=None):
seq_length = x.shape[1]
positions = torch.arange(0, seq_length).expand(x.shape[0], seq_length).to(x.device)
x = self.word_embeddings(x) + self.position_embeddings(positions)

for layer in self.layers:
x = layer(x, mask)

return x

Explanation:

  1. We define a FeedForward class that represents a simple feed-forward neural network with two linear layers separated by a ReLU activation function and a dropout layer.
  2. The TransformerEncoderBlock class represents a single block of the transformer encoder. It consists of a multi-head self-attention layer (self_attention), layer normalization (norm1 and norm2), and a feed-forward network (feedforward). The forward method applies these components in sequence.
  3. The TransformerEncoder class represents the complete transformer encoder. It consists of word embeddings (word_embeddings), position embeddings (position_embeddings), and a series of transformer encoder blocks (layers). The forward method applies the embeddings and passes the input through each encoder block.
  4. We instantiate the TransformerEncoder model with specific hyperparameters, such as vocabulary size, input dimension, number of attention heads, feed-forward dimension, number of layers, and dropout rate.
  5. We generate some random input data (input_tensor) and perform a forward pass through the model to obtain the output embeddings.
vocab_size = 10000
input_dim = 512
num_heads = 8
feedforward_dim = 2048
num_layers = 6
dropout = 0.1

model = TransformerEncoder(vocab_size, input_dim, num_heads, feedforward_dim, num_layers, dropout)

input_sentence = "The quick brown fox jumps over the lazy dog"
input_tensor = torch.randint(0, vocab_size, (1, len(input_sentence.split())))

output_embeddings = model(input_tensor)

print(f"The model has {sum(p.numel() for p in model.parameters() if p.requires_grad):,} trainable parameters")
The model has 24,546,304 trainable parameters

Explanation:

  1. We define the hyperparameters for the transformer encoder, such as vocab_size, input_dim, num_heads, feedforward_dim, num_layers, and dropout.
  2. We instantiate the TransformerEncoder model with these hyperparameters.
  3. We provide an example input sentence: “The quick brown fox jumps over the lazy dog”.
  4. We convert the input sentence into a tensor of word indices (input_tensor) using torch.randint(). Each word in the sentence is represented by a random index from 0 to vocab_size - 1.
  5. We perform a forward pass through the model using model(input_tensor) to obtain the output embeddings (output_embeddings).
  6. Finally, we print the number of trainable parameters in the model using a list comprehension and the sum() function.
Image by the Author

Question 1: How does changing different hyperparameters affect the overall size of the model?

Answer:

  • Changing the input_dim (dimension of the word embeddings) directly affects the size of the word embedding matrix and the input/output dimensions of the linear layers in the feed-forward network.
  • Increasing the num_heads (number of attention heads) increases the number of parameters in the multi-head attention layer.
  • Modifying the feedforward_dim (dimension of the hidden layer in the feed-forward network) affects the size of the linear layers in the feed-forward network.
  • Increasing the num_layers (number of transformer encoder blocks) adds more layers to the model, increasing the overall parameter count.
  • The dropout rate does not directly affect the model size but helps regularize the model during training.

Here’s an example of creating a transformer encoder with modified hyperparameters:

new_model = TransformerEncoder(vocab_size=5000, input_dim=256, num_heads=4, feedforward_dim=1024, num_layers=4, dropout=0.2)
print(f"The new model has {sum(p.numel() for p in new_model.parameters() if p.requires_grad):,} trainable parameters")
The new model has 4,695,040 trainable parameters

Question 2: Visualize the embeddings of a different set of words.

Answer: Here’s an example of visualizing the embeddings of a set of food-related words:

words = ["apple", "banana", "orange", "pizza", "burger", "salad", "soup", "cake", "cookie", "ice cream"]

embeddings = []
for word in words:
inputs = tokenizer(word, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embeddings.append(outputs.last_hidden_state[0, 0, :].numpy())

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

plt.figure(figsize=(8, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=500)
for i, word in enumerate(words):
plt.annotate(word, xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.show()

Question 3: Compute the cosine similarity between the embeddings of a sentence and its scrambled version.

Answer: Here’s an example of computing the cosine similarity between a sentence and its scrambled version:

import torch

# Define the cosine_similarity function
def cosine_similarity(vec1, vec2):
vec1 = vec1.squeeze()
vec2 = vec2.squeeze()
return torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))

# Define the sentence_to_embeddings function
def sentence_to_embeddings(sentence, model, tokenizer):
inputs = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
return embeddings

# Load the pre-trained model and tokenizer (replace with your own)
from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence = "The quick brown fox jumps over the lazy dog"
scrambled_sentence = "dog lazy the over jumps fox brown quick The"

original_embedding = sentence_to_embeddings(sentence, model, tokenizer)
scrambled_embedding = sentence_to_embeddings(scrambled_sentence, model, tokenizer)

avg_embedding_original = torch.mean(original_embedding, dim=1)
avg_embedding_scrambled = torch.mean(scrambled_embedding, dim=1)

similarity = cosine_similarity(avg_embedding_original, avg_embedding_scrambled)
print(f"Cosine similarity between original and scrambled sentence embeddings: {similarity.item():.2f}")
Cosine similarity between original and scrambled sentence embeddings: 0.81

Question 4: Compute the cosine similarity between the embeddings of a word used in two different contexts.

Answer: Here’s an example of computing the cosine similarity between the embeddings of the word “bank” in two different contexts:

import torch

# Define the cosine_similarity function
def cosine_similarity(vec1, vec2):
vec1 = vec1.squeeze()
vec2 = vec2.squeeze()
return torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))

# Define the sentence_to_embeddings function
def sentence_to_embeddings(sentence, model, tokenizer):
inputs = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
return embeddings

# Load the pre-trained model and tokenizer (replace with your own)
from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence1 = "I need to deposit money at the bank"
sentence2 = "The river bank is full of sand"

embedding1 = sentence_to_embeddings(sentence1, model, tokenizer)
embedding2 = sentence_to_embeddings(sentence2, model, tokenizer)

avg_embedding1 = torch.mean(embedding1, dim=1)
avg_embedding2 = torch.mean(embedding2, dim=1)

similarity = cosine_similarity(avg_embedding1, avg_embedding2)
print(f"Cosine similarity between embeddings of the sentences: {similarity.item():.2f}")
Cosine similarity between embeddings of the sentences: 0.62

The cosine similarity between the embeddings of the word “bank” in different contexts is expected to be relatively low because the transformer encoder captures the contextual meaning of the word based on the surrounding words in the sentence.

Masked Language Modeling (MLM) with BERT:

We load the BERT model and tokenizer, define a function to predict masked words, and experiment with different sentences.

from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

def predict_masked_words(sentence, model, tokenizer):
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predicted_token_ids = outputs.logits.argmax(dim=-1)
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids[0])
return " ".join(predicted_tokens)

sentence = "I enjoy playing [MASK] games on weekends."
print(predict_masked_words(sentence, model, tokenizer))
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
. i enjoy playing video games on weekends . .

Explanation:

  1. We import the necessary libraries and load the pre-trained BERT tokenizer and model.
  2. We define a function predict_masked_words that takes a sentence, the BERT model, and the tokenizer as input. This function tokenizes the sentence, passes it through the model, and predicts the masked words.
  3. We provide an example sentence with a masked word: “I enjoy playing [MASK] games on weekends.”
  4. We call the predict_masked_words function with the example sentence, model, and tokenizer to predict the masked word.

Question 5: What happens when you mask more than one word in a sentence?

Answer: Here’s an example of masking multiple words in a sentence:

sentence = "The [MASK] brown [MASK] jumps over the lazy [MASK]."
print(predict_masked_words(sentence, model, tokenizer))

The model can accurately predict multiple masked words in a sentence. It uses the surrounding context to infer the most likely words that fit the masked positions.

Question 6: Use the model to predict the masked word in a sentence in a language other than English.

Answer: Here’s an example of predicting a masked word in a French sentence:

sentence = "Je suis allé au [MASK] pour acheter du pain."
print(predict_masked_words(sentence, model, tokenizer))
. je sui ##s all ##s au ##tre pour ache ##t du pain . .

The model may not accurately predict the masked word in a non-English sentence because it was pre-trained on English text data. To handle other languages, we would need to use a multilingual BERT model or a model specifically trained on the target language.

Question 7: Mask a word that has different meanings in different contexts.

Answer: Here’s an example of masking a word with multiple meanings:

sentence1 = "I need to [MASK] money at the bank."
sentence2 = "The river [MASK] is full of sand."
print(predict_masked_words(sentence1, model, tokenizer))
print(predict_masked_words(sentence2, model, tokenizer))
. i need to get money at the bank . .
. the river bed is full of sand . .

The model can accurately predict the appropriate word based on the context. In the first sentence, it predicts “deposit” or a similar word related to banking, while in the second sentence, it predicts “bank” in the context of a river bank.

Question 8: Mask a word in a sentence that makes sense only in a specific cultural context.

Answer: Here’s an example of masking a word in a culturally specific sentence:

sentence = "I love eating [MASK] on Thanksgiving."
print(predict_masked_words(sentence, model, tokenizer))
i love eating pizza on thanksgiving . i

The model may struggle to accurately predict culturally specific words or concepts that are not well-represented in its training data. In this case, it may predict a more generic word related to food rather than the specific Thanksgiving dish.

Question 9: Mask a word in a sentence that contains an idiomatic expression.

Answer: Here’s an example of masking a word in an idiomatic expression:

sentence = "It's raining cats and [MASK]."
print(predict_masked_words(sentence, model, tokenizer))
it ' s raining cats and dogs . .

The model may have difficulty predicting the correct word in idiomatic expressions because they often have non-literal meanings. In this case, it may predict a word that fits the literal context rather than the idiomatic meaning.

Question 10:How can you visualize the attention weights between words in a sentence using transformers?

from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt
import seaborn as sns

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

# Process input
inputs = tokenizer("The quick brown fox jumps over the lazy dog", return_tensors="pt")
outputs = model(**inputs)

# Get attention weights
attention = outputs.attentions[-1][0, 0].detach().numpy() # Use the last layer's attention

# Plotting
sns.heatmap(attention, annot=True, fmt=".2f", xticklabels=tokenizer.tokenize("The quick brown fox jumps over the lazy dog"), yticklabels=tokenizer.tokenize("The quick brown fox jumps over the lazy dog"))
plt.title('Attention Weights')
plt.show()

This code visualizes the attention weights from the last layer of a BERT model, helping understand which words the model focuses on when processing different words in the sentence.

Question 11: How to load a pre-trained transformer model and tokenizer?

Expected Answer with Code:

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print("Model and tokenizer are loaded.")

This code demonstrates how to load a pre-trained BERT model and its corresponding tokenizer. These tools are essential for processing text in a way that the model can understand.

Question 12: How do you encode text using a transformer tokenizer?

Expected Answer with Code:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "Hello, world! This is a test sentence."

# Encode text
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)


{'input_ids': tensor([[ 101, 7592, 1010, 2088, 999, 2023, 2003, 1037, 3231, 6251, 1012, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

This snippet shows how to use a tokenizer to convert a text string into a format suitable for input to a transformer model, including conversion to token IDs and creation of attention masks.

Question 13: How to perform token classification using a pre-trained transformer model?

Expected Answer with Code:

from transformers import BertForTokenClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForTokenClassification.from_pretrained('bert-base-uncased')

text = "Add a token classification layer to BERT."
inputs = tokenizer(text, return_tensors="pt")

# Assuming token classification task (e.g., NER)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)

print(predictions)

This code is for using a pre-trained BERT model configured for token classification, like named entity recognition (NER), demonstrating how to get predictions for each token in a sentence.

Question 14: How do you visualize the output of a transformer model?

import torch
from transformers import BertTokenizer, BertModel
import matplotlib.pyplot as plt

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize input
input_ids = tokenizer.encode("Example sentence for BERT.", return_tensors="pt")

# Get last hidden states
with torch.no_grad():
outputs = model(input_ids)
hidden_states = outputs.last_hidden_state.squeeze()

# Visualize the output for the first token's embedding
plt.figure(figsize=(10, 1))
plt.imshow(hidden_states[0:1].numpy())
plt.colorbar()
plt.title("Visualization of BERT Output for First Token")
plt.show()

This snippet visualizes the output embedding for the first token in a sentence processed by BERT, helping to understand the model’s internal representations.

Question 15: How to modify a transformer model to change its dropout rate?

Expected Answer with Code:

from transformers import BertConfig, BertModel

# Load model with custom configuration
config = BertConfig.from_pretrained('bert-base-uncased', hidden_dropout_prob=0.2, attention_probs_dropout_prob=0.2)
model = BertModel.from_pretrained('bert-base-uncased', config=config)

print("Modified dropout rates in the model.")

This code modifies the dropout rates in a BERT model’s configuration, which can be useful for experimenting with different regularization strengths during fine-tuning.

Question 16: How to preprocess text data for input into a transformer model?

Expected Answer with Code:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Example text that needs to be tokenized."

# Tokenizing text
encoded_input = tokenizer(text, return_tensors='pt')
print(encoded_input)
{'input_ids': tensor([[  101,  2742,  3793,  2008,  3791,  2000,  2022, 19204,  3550,  1012,
102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

This snippet demonstrates the essential first step in working with transformer models: tokenizing text so that it’s in the proper format (token IDs, attention mask) for the model.

Question 17: How to calculate the number of parameters in a transformer model?

Expected Answer with Code:

from transformers import BertModel

model = BertModel.from_pretrained('bert-base-uncased')

# Calculate the number of trainable parameters
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"The model has {total_params:,} trainable parameters.")
The model has 109,482,240 trainable parameters.

Question 18: How to visualize token embeddings using PCA?

Expected Answer with Code:

import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Assume embeddings is obtained from a model
embeddings = torch.rand(10, 512) # Simulated embedding for 10 tokens

# Convert to numpy array for PCA
embeddings_np = embeddings.detach().numpy()

# Perform PCA to reduce dimensions
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_np)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], c='blue', label='Tokens')
for i in range(embeddings_2d.shape[0]):
plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1], f'Token {i+1}')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('PCA of Token Embeddings')
plt.legend()
plt.show()

This code snippet visualizes the distribution of token embeddings in a two-dimensional space using PCA, helping to understand the variance and relationships between different tokens’ embeddings.

Understanding the Math Behind Transformers:

Embeddings:

Embeddings are a way to represent discrete objects, like words or categories, as continuous vectors in a high-dimensional space. This allows neural networks to work with these objects more effectively.

Example:
Imagine you have a small vocabulary of five words: “king”, “queen”, “man”, “woman”, and “child”. We want to create embeddings for these words.

Step 1: Create a lookup table
First, we create a lookup table that assigns a unique index to each word:

- "king": 0
- "queen": 1
- "man": 2
- "woman": 3
- "child": 4

Step 2: Define the embedding dimension
Next, we choose the size of the embedding vectors. Let’s say we want to represent each word with a 3-dimensional vector.

Step 3: Initialize the embedding matrix
We create an embedding matrix with the size (vocabulary_size, embedding_dimension). In this case, it would be a 5x3 matrix:


[
[0.1, 0.2, 0.3], # Embedding for "king"
[0.4, 0.5, 0.6], # Embedding for "queen"
[0.7, 0.8, 0.9], # Embedding for "man"
[1.0, 1.1, 1.2], # Embedding for "woman"
[1.3, 1.4, 1.5] # Embedding for "child"
]

These values are usually initialized randomly and learned during training.

Step 4: Look up the embeddings
To get the embedding for a word, we use its index to look up the corresponding row in the embedding matrix. For example:

- Embedding for "king" (index 0): [0.1, 0.2, 0.3]
- Embedding for "woman" (index 3): [1.0, 1.1, 1.2]

Step 5: Use the embeddings
Now that we have the embeddings, we can use them as input to our neural network. The network can learn to interpret these embeddings and discover relationships between words.

For instance, we might find that the embeddings for “king” and “queen” are closer to each other than to “child”, indicating that they share similar properties.

Embedding space:

queen
king woman
man
child

In the context of transformers, word embeddings are used to convert input words into continuous vectors that can be processed by the attention mechanism and feed-forward layers.

The beauty of embeddings is that they allow the model to learn meaningful representations of discrete objects, capturing their semantic relationships and similarities.

Positional Encoding:

Imagine you have a sentence: “Alice loves Bob.” The meaning would change if we switched the order of words: “Bob loves Alice.” To teach the model about the position of each word, we assign a unique number (encoding) to each position.

Example:

  • Sentence: “Alice loves Bob.”
  • Positions: 1, 2, 3

We can represent the positional encoding using sine and cosine functions:

Position 1: [sin(1/1), cos(1/1), sin(1/2), cos(1/2), ...]
Position 2: [sin(2/1), cos(2/1), sin(2/2), cos(2/2), ...]
Position 3: [sin(3/1), cos(3/1), sin(3/2), cos(3/2), ...]

These positional encodings are then added to the word embeddings (numerical representations of words) before being processed by the attention mechanism. This way, the model can learn the relative positions of words in a sentence.

Attention Weighting:

Imagine you have a group of friends, and each friend has a unique set of skills. Let’s say you have a task that requires a combination of these skills. To decide which friends to involve, you assign a weight to each friend based on how relevant their skills are to the task.

Friends’ Skills Matrix (F):

- Alice: [Cooking (0.8), Singing (0.2)]
- Bob: [Cooking (0.3), Dancing (0.7)]
- Charlie: [Singing (0.6), Dancing (0.4)]

Attention Weights (A) — Focused on Cooking:

- Alice: 0.8
- Bob: 0.3
- Charlie: 0.0

Each friend’s skill vector is scaled by their respective attention weight:

- Alice's weighted skills: \(0.8 \times [0.8, 0.2] = [0.64, 0.16]\)
- Bob's weighted skills: \(0.3 \times [0.3, 0.7] = [0.09, 0.21]\)
- Charlie's weighted skills: \(0.0 \times [0.6, 0.4] = [0.00, 0.00]\)

Resulting Attention-Weighted Skills Matrix (Z):

- Alice: [0.64, 0.16]
- Bob: [0.09, 0.21]
- Charlie: [0.00, 0.00]

Multi-Head Attention:

Image by the Author

For multi-head attention, assume we have three heads, each focusing on a different skill:

- Head 1 (Cooking)
- Head 2 (Singing)
- Head 3 (Dancing)

Each head computes its attention weights and outputs:

- Head 1 Weights: Alice (0.8), Bob (0.3), Charlie (0)
- Head 2 Weights: Alice (0.2), Bob (0), Charlie (0.6)
- Head 3 Weights: Alice (0), Bob (0.7), Charlie (0.4)

Weighted Skills for Each Head:

- Head 1 Outputs:
- Alice: \(0.8 \times [0.8, 0.2] = [0.64, 0.16]\)
- Bob: \(0.3 \times [0.3, 0.7] = [0.09, 0.21]\)
- Charlie: \([0, 0]\)
- Head 2 Outputs:
- Alice: \(0.2 \times [0.8, 0.2] = [0.16, 0.04]\)
- Charlie: \(0.6 \times [0.6, 0.4] = [0.36, 0.24]\)
- Head 3 Outputs:
- Bob: \(0.7 \times [0.3, 0.7] = [0.21, 0.49]\)
- Charlie: \(0.4 \times [0.6, 0.4] = [0.24, 0.16]\)

Concatenated Results:

- Alice: \([0.64, 0.16, 0.16, 0.04, 0, 0]\)
- Bob: \([0.09, 0.21, 0, 0, 0.21, 0.49]\)
- Charlie: \([0, 0, 0.36, 0.24, 0.24, 0.16]\)

Let’s perform these calculations and confirm the results.

Here are the results from both the single-head and multi-head attention computations:

import numpy as np

# Skills matrix for Alice, Bob, and Charlie
skills_matrix = np.array([
[0.8, 0.2], # Alice: [Cooking, Singing]
[0.3, 0.7], # Bob: [Cooking, Dancing]
[0.6, 0.4] # Charlie: [Singing, Dancing]
])

# Attention weights for Head 1 (Cooking), Head 2 (Singing), Head 3 (Dancing)
weights_head_1 = np.array([0.8, 0.3, 0])
weights_head_2 = np.array([0.2, 0, 0.6])
weights_head_3 = np.array([0, 0.7, 0.4])

# Function to compute weighted skills
def compute_weighted_skills(skills, weights):
# Scaling skills by weights
weighted_skills = skills * weights[:, np.newaxis]
return weighted_skills

# Compute weighted skills for each head
weighted_skills_head_1 = compute_weighted_skills(skills_matrix, weights_head_1)
weighted_skills_head_2 = compute_weighted_skills(skills_matrix, weights_head_2)
weighted_skills_head_3 = compute_weighted_skills(skills_matrix, weights_head_3)

# Concatenate results from all heads
concatenated_results = np.hstack((weighted_skills_head_1, weighted_skills_head_2, weighted_skills_head_3))
concatenated_results

Single-Head Attention Results (Focused on Cooking)

- Alice: [0.64, 0.16]
- Bob: [0.09, 0.21]
- Charlie: [0.00, 0.00]

These values represent the scaled skills of Alice, Bob, and Charlie, where only their cooking skills were emphasized for the competition.

Multi-Head Attention Results

The concatenated results from the three heads are as follows:

- Alice:
- [0.64, 0.16, 0.16, 0.04, 0.00, 0.00]
- Bob:
- [0.09, 0.21, 0.00, 0.00, 0.21, 0.49]
- Charlie:
- [0.00, 0.00, 0.36, 0.24, 0.24, 0.16]

Each row shows the concatenated outputs from all three heads for each individual, representing weighted skills for cooking, singing, and dancing. These results can then be processed further, typically with a linear layer in a neural network, to integrate the features from all the skills and produce a final representation for each person.

Feed-Forward Network (FFN):

A typical FFN used in a transformer model consists of two linear transformations with a non-linear activation function in between.

Let’s consider the output from the multi-head attention for Alice, Bob, and Charlie. We’ll apply an FFN to these outputs. To keep it simple, let’s assume the following for our FFN:

  • First layer transforms the concatenated six-dimensional vector to a four-dimensional vector (dimensionality reduction).
  • Second layer transforms back from four dimensions to two dimensions (to match the original skill dimensions for simplicity).

The code is

import numpy as np

# Define the concatenated results from the multi-head attention
concatenated_results = np.array([
[0.64, 0.16, 0.16, 0.04, 0.00, 0.00], # Alice
[0.09, 0.21, 0.00, 0.00, 0.21, 0.49], # Bob
[0.00, 0.00, 0.36, 0.24, 0.24, 0.16] # Charlie
])

# Define weights and biases for the FFN
W1 = np.random.rand(6, 4) # First layer weights (6x4 matrix)
b1 = np.random.rand(4) # First layer biases (4-dimensional vector)
W2 = np.random.rand(4, 2) # Second layer weights (4x2 matrix)
b2 = np.random.rand(2) # Second layer biases (2-dimensional vector)

# Activation function: ReLU
def relu(x):
return np.maximum(0, x)

# FFN function applying two layers of transformations
def apply_ffn(x, W1, b1, W2, b2):
# First layer transformation
x = np.dot(x, W1) + b1
# Apply ReLU activation
x = relu(x)
# Second layer transformation
x = np.dot(x, W2) + b2
return x

# Applying the FFN to each concatenated vector
ffn_outputs = np.array([apply_ffn(person, W1, b1, W2, b2) for person in concatenated_results])

# Print the outputs for Alice, Bob, and Charlie
print("FFN Output for Alice:", ffn_outputs[0])
print("FFN Output for Bob:", ffn_outputs[1])
print("FFN Output for Charlie:", ffn_outputs[2])
FFN Output for Alice: [3.0401607 2.5599341]
FFN Output for Bob: [3.32439661 2.65308496]
FFN Output for Charlie: [2.98914309 2.39759574]

Residual Connection:

Also known as skip connections, are a way to allow information to bypass one or more layers in a neural network. They help in training deep networks by allowing gradients to flow more easily through the network.

Imagine you’re solving a complex math problem. Sometimes, it’s helpful to break the problem into smaller steps and then combine the results at the end. Residual connections work similarly by allowing the model to learn simpler functions and then combine them with the original input.

Example:

  • Input: x = [2, 3, 4]
  • Layer output: f(x) = [1, 2, 3]

A residual connection would add the input (x) to the layer output (f(x)):

  • Residual output: f(x) + x = [1, 2, 3] + [2, 3, 4] = [3, 5, 7]

In transformers, residual connections are used after the attention and FFN layers. This allows the model to learn both the original input and the transformed representation, making it easier to capture complex patterns.

Layer Normalization:

Is a technique used to normalize the activations (outputs) of a layer in a neural network. It helps stabilize the training process and improves the model’s performance.

Imagine you have a group of friends, and each friend has a score in different subjects. Some friends might have scores that are much higher or lower than others. Layer normalization helps bring all the scores to a similar range.

Example:

Original scores:
Alice: [85, 92, 78]
Bob: [60, 70, 65]
Charlie: [95, 88, 92]

To normalize the scores, we first calculate the mean and standard deviation for each friend:

Mean:
Alice: (85 + 92 + 78) / 3 = 85
Bob: (60 + 70 + 65) / 3 = 65
Charlie: (95 + 88 + 92) / 3 = 91.67
Standard Deviation:
Alice: sqrt(((85–85)² + (92–85)² + (78–85)²) / 3) = 7.02
Bob: sqrt(((60–65)² + (70–65)² + (65–65)²) / 3) = 5.00
Charlie: sqrt(((95–91.67)² + (88–91.67)² + (92–91.67)²) / 3) = 3.51

Then, we subtract the mean and divide by the standard deviation for each score:

Normalized scores:
Alice: [(85–85)/7.02, (92–85)/7.02, (78–85)/7.02] = [0, 1, -1]
Bob: [(60–65)/5.00, (70–65)/5.00, (65–65)/5.00] = [-1, 1, 0]
Charlie: [(95–91.67)/3.51, (88–91.67)/3.51, (92–91.67)/3.51] = [0.95, -1.05, 0.09]

The normalized scores now have a mean of approximately 0 and a standard deviation of approximately 1, making them easier to process by the next layer.

Softmax Activation:

It is used to convert a vector of real numbers into a probability distribution. In the context of transformers, it’s often used in the attention mechanism to compute the attention weights.

Imagine you have a list of scores for different options, and you want to choose the best one. The softmax function helps you convert these scores into probabilities, making it easier to make a decision.

Example:

Scores: [2.0, 1.0, 0.5]

To apply the softmax function, we first compute the exponential of each score:

Exponentials: [exp(2.0), exp(1.0), exp(0.5)] = [7.39, 2.72, 1.65]

Then, we sum up all the exponentials:

Sum: 7.39 + 2.72 + 1.65 = 11.76

Finally, we divide each exponential by the sum to get the probabilities:

Probabilities: [7.39/11.76, 2.72/11.76, 1.65/11.76] = [0.63, 0.23, 0.14]

The resulting probabilities add up to 1, and the option with the highest score gets the highest probability. In transformers, these probabilities are used as attention weights to determine the importance of different input elements.

Dropout Regularization:

It is a regularization technique used to prevent overfitting in neural networks. It works by randomly “dropping out” (setting to zero) some of the neurons during training, which helps the model learn more robust features.

Imagine you’re learning a new skill, like playing the piano. If you always practice with the same group of friends, you might become too reliant on them. Dropout is like randomly removing some of your friends during practice, forcing you to adapt and learn independently.

Example:

Input: [1.0, 2.0, 3.0, 4.0]
Dropout probability: 0.5 (50% chance of dropping a neuron)
  • During training, we randomly generate a mask based on the dropout probability:
Mask: [1, 0, 1, 0] (1 means keep, 0 means drop)

We apply the mask to the input by multiplying element-wise:

Masked input: [1.0 * 1, 2.0 * 0, 3.0 * 1, 4.0 * 0] = [1.0, 0, 3.0, 0]

The masked input is then passed through the network. During testing, dropout is not applied, but the output is scaled by the dropout probability to compensate for the missing neurons during training.

In transformers, dropout is often applied to the attention weights and the outputs of the FFN layers to improve generalization and prevent overfitting.

Conclusion:

Let’s walk through another example, this time using the sentence “The quick brown fox jumps over the lazy dog” and translating it from English to Spanish and end this article.

1. Input:
— The input sentence is “The quick brown fox jumps over the lazy dog.”

2. Tokenization:
— The sentence is split into tokens: [“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”]

3. Embeddings:
— Each token is converted into a vector (a list of numbers) called an embedding.
— Example:

"The" → [0.1, 0.2, 0.3]
"quick" → [0.4, 0.5, 0.6]
"brown" → [0.7, 0.8, 0.9]

4. Positional Embeddings:
— Positional embeddings are added to the token embeddings to represent the position of each token in the sentence.
— Example:

"The" → [0.1, 0.2, 0.3] + [0.01, 0.02, 0.03]
"quick" → [0.4, 0.5, 0.6] + [0.04, 0.05, 0.06]
"brown" → [0.7, 0.8, 0.9] + [0.07, 0.08, 0.09]

5. Encoder:
— The encoder processes the input embeddings through multiple layers.
— Each encoder layer has two main components:
a. Multi-Head Attention:
— The tokens attend to each other to understand the relationships between them.
— Example: “quick” attends to “brown” and “fox” to understand that the fox is described as quick and brown.
b. Feedforward Neural Network:
— The attended representations are passed through a feedforward neural network to transform the information.
— Residual connections and layer normalization are used between the components to facilitate training.

6. Decoder:
— The decoder generates the output sentence in the target language (Spanish).
— It works similarly to the encoder but has an additional component called Encoder-Decoder Attention.
— The decoder attends to the encoder’s output to gather relevant information for generating the output tokens.
— Example:

Decoder input: ["<start>"]
Encoder-Decoder Attention: The decoder attends to relevant parts of the encoder's output, such as "quick", "brown", and "fox".
Decoder output: ["El"]

7. Output:
— The decoder generates the output tokens one by one.
— Example:
* “El” → “rápido” → “zorro” → “marrón” → “salta” → “sobre” → “el” → “perro” → “perezoso”
— The final output is the translated sentence: “El rápido zorro marrón salta sobre el perro perezoso” (The quick brown fox jumps over the lazy dog in Spanish).

In this example, the transformer takes the input sentence “The quick brown fox jumps over the lazy dog”, converts it into embeddings, and processes it through the encoder. The encoder understands the relationships between the tokens, such as “quick” and “brown” describing the “fox”. The decoder then generates the output sentence in Spanish by attending to the encoder’s output and the previously generated tokens. The final output is the translated sentence in Spanish.

References:

  1. Attention is All You Need.

2. Paper with code:

3. Jay Alamar’s Blog:

4. Huggingface blog

--

--

ML/DS - Certified GCP Professional Machine Learning Engineer, Certified AWS Professional Machine learning Speciality,Certified GCP Professional Data Engineer .