Transformers: A Quick Explanation with Code

Transformers are a class of models that has gained a lot of traction over the years, especially in the domain of natural language processing and understanding.

In this post, I will discuss, part by part, how this architecture is formed. Each part will discuss a component or concept of the architecture by giving a concrete definition using code. The explanations are based on this great video by Dr. Pascal Poupart.

A few quick notes before we start
Regular Dictionary Method
Attention Method
Implementation
Implementation
Encoder
Decoder
Implementation
- Defining the encoder and the decoder
- Inference

A few quick notes before we start

The only dependency for all the code sections is NumPy. This is in the hope that it’ll be clearer due to the higher transparency of NumPy.
Some code sections (like that of MHA), have the potential to be optimized by vectorization. However, I have chosen not to do so, to avoid complexity.
Due to the same reason, this post doesn’t pay much “attention” (pun unintended), to how transformers are trained, but rather focuses on how transformers can be used to generate sequences based on inputs. The prior of the two would likely require the usage of other packages such as PyTorch or Tensorflow.

Attention

Attention is essentially like a generalization of a database/dictionary.

Let’s say we have a query.

q = np.array([0, 1])

In a dictionary, we check if that query is equal to one of the keys, and then we get the value for that key.
In attention, the query doesn’t need to be equal to one of the keys. We simply look at the similarity between the query and the key, and get a sum of the values weighted by the similarities.

Regular Dictionary Method

dictionary = {
    str(np.array([0, 1])): np.array([0, 0]),    # converting to string because numpy arrays are not hashable
    str(np.array([1, 0])): np.array([1, 1])
}

print(f"Query = {q} => Value = {dictionary[str(q)]}")

# Output:
# Query = [0 1] => Value = [0 0]

Attention Method

def q_to_v(query, keys, values):
    total_sim = 0
    total_v = 0
    for i, k in enumerate(keys):
        similarity = np.dot(query, k)
        v = similarity * values[i]
        
        total_sim += similarity
        total_v += v
    
    return total_v / (total_sim or 1)


keys = np.array([[0, 1], [1, 0]])
values = np.array([[0, 0], [1, 1]])

print(f"\nQuery = {q} => Value = {q_to_v(q, keys, values)}")

# Output:
# Query = [0 1] => Value = [0. 0.]

Similarity

In the above example, we use a dot product to measure the similarity between the query and the key.

In addition to this, there are several options for measuring similarity:

Dot product (q.T * k)
Scaled dot product (q.T * k / sqrt(len(k)))
General dot product (q.T * W * k), where W is a trainable weight matrix
Additive similarity (W.T * q + W.T * k)

Multi-Head Attention (MHA) in Transformers

Multi-head Attention architecture

As seen in the above diagram, each of the value, key, and query is first projected using 3 linear layers (you can think of it as simply multiplying by a matrix). The results of the 3 linear layers are then passed through a scaled attention mechanism.

The idea behind multi-head attention is that we can have multiple such sets that are finally concatenated and passed through a linear layer to get the output of the MHA unit. You can think of the idea behind this to be similar to kernels/filters in CNN.

The following equations can denote this entire process:

MHA Equations

Implementation

Initialization

# Initialize queries, keys, and values
queries = np.random.random((3, 2))
keys = np.random.random((3, 2))
values = np.random.random((3, 2))

seq_len, d = queries.shape

Linear Layer

Next, I’ll define the function for the linear layer. In practice, this often gets implemented as a neural network, rather than a simple matrix multiplication. However, for the simplicity of using NumPy, I’ll treat it as matrix multiplication.

# Define the function to apply the linear layer weights onto the vector
def linear(weights, vecs):
    return (np.expand_dims(weights, 0) @ np.expand_dims(vecs, 2)).reshape(vecs.shape[0], weights.shape[0])

MHA Function

I’ll also define a function for the Multi-head Attention functionality. Here, as the inputs, q_s, k_s, and v_s refer to the queries, keys, and values, respectively. weights refer to the weights of the linear layers in the MHA module. n_heads refers to the number of heads in the MHA module.

# Define the Multi-head Attention function
def mha(q_s, k_s, v_s, weights, n_heads=4):
    W_Q, W_K, W_V, W_O = weights
    seq_len, d = q_s.shape
    
    head_outputs = []
    for i in range(n_heads):
        total_sim = 0
        total_v = 0
        
        # Linear layer
        qs_head = linear(W_Q[i], q_s)
        ks_head = linear(W_K[i], k_s)
        vs_head = linear(W_V[i], v_s)
        
        # Attention
        for i, k_head in enumerate(ks_head):
            similarity = np.exp(np.dot(qs_head, k_head) / d ** 0.5)         # (seq_len,) : one similarity for each query
            v = similarity.reshape(seq_len, 1) * vs_head[i].reshape(1, d)   # (seq_len, d) : one weighted value for each query
            
            total_sim += similarity
            total_v += v
        
        head_outputs.append(total_v / total_sim[:, None])       # broadcast and divide by total weight

    h_out = np.concatenate(head_outputs, axis=1)            # (seq_len, d * n_heads)
    
    return linear(W_O, h_out)        # (seq_len, d)

Let’s see how we can use these functions in code. I’ll first define the variables that I’ll be using.

# Number of heads
num_heads = 4

# The weight matrices
# - Initialized randomly for demonstrating
# - These, too, get tuned during training
linear_weights = [
    [np.random.random((d, d)) for _ in range(num_heads)],   # W_Q
    [np.random.random((d, d)) for _ in range(num_heads)],   # W_K
    [np.random.random((d, d)) for _ in range(num_heads)],   # W_V
    np.random.random((d, num_heads * d))                    # W_O
]

mha(queries, keys, values, linear_weights, num_heads)

Masked Multi-Head Attention in Transformers

The issue with regular MHA is that since we always input the entire sequence into the module, the module can just look up what comes next and just use that.

Masked MHA simply masks out the vectors representing the future tokens.

Implementation

def masked_mha(q_s, k_s, v_s, weights, n_heads=4):
    W_Q, W_K, W_V, W_O = weights
    seq_len, d = q_s.shape
    
    head_outputs = []
    for i in range(n_heads):
        total_sim = 0
        total_v = 0
        
        # Linear layer
        qs_head = linear(W_Q[i], q_s)
        ks_head = linear(W_K[i], k_s)
        vs_head = linear(W_V[i], v_s)
        
        # Attention
        for i, k_head in enumerate(ks_head):
            # Mask out the keys occuring in the future
            mask = np.zeros((seq_len,))
            mask[i:] = float('-inf')
            
            similarity = np.exp((np.dot(qs_head, k_head) + mask) / d ** 0.5)        # (seq_len,) : one similarity for each query for a given key
            v = similarity.reshape(seq_len, 1) * vs_head[i].reshape(1, d)           # (seq_len, d) : one weighted value for each query for a given key
            
            total_sim += similarity
            total_v += v
        
        head_outputs.append(total_v / (total_sim[:, None] + 0.0000001))       # broadcast and divide by total weight

    h_out = np.concatenate(head_outputs, axis=1)            # (seq_len, d * n_heads)
    
    return linear(W_O, h_out)        # (seq_len, d)

The usage of this function in code looks as follows:

masked_mha(queries, keys, values, linear_weights, num_heads)

Embeddings

Let’s take text as an example (although it doesn’t have to be).

Say we have a sentence: “The apple fell”

What we do is, for each word of this sentence, we find a vector that captures the meaning of the sentence. This vector is something known as an embedding vector.

The mapping from a word to its embedding vector can be achieved using a lookup table. For words, there are usually lookup tables made by other people. We can also create this lookup table by optimizing for the values of the table.

Self Attention

Self-attention simply means that the queries, keys, and values are all the same set of vectors.

The reasoning behind this is that, by passing the same set of vectors to all three inputs, we’re essentially getting a representation of how every pair of words of a sentence relates to each other.

embeddings = np.array([[0, 0], [0, 1], [0.1, 0.5]])
queries = keys = values = embeddings

Layer Normalization

It turns out that for very deep neural networks, it is difficult to perform the gradient descent updates in a smooth manner. This is because, when we update with gradient descent, we are not just updating one parameter or even one layer; we’re updating multiple layers. Since the final loss and output depend on all the layers, the target is constantly moving. Practically, normalizing the outputs of the layer has been shown to greatly improve the time and performance of gradient descent.

Generally, this means that we are simply making it so that the distribution of the outputs from a layer has mean = 0 and standard deviation = 1. However, we can also multiply the normalization expression by a scalar called the gain in order to control what the layers get normalized to.

Layer Normalization

def layer_norm(inp, gain=1):
    return gain * (inp - inp.mean()) / inp.std()


layer_outputs = np.random.random((10,))
layer_norm(layer_outputs)

Positional Embeddings in Transformers

As you know, when it comes to getting the meaning of a sentence or text, the order of the words matters. Since we are basically getting a weighted sum of the values corresponding to different tokens of the query, by default, the order is not taken into consideration.

To address this, we can add another term to each embedding vector so that the token’s position in a sentence is also captured. This added term is known as positional embedding.

Positional Embedding equations

def get_position_encoded_embeddings(embeddings):
    seq_len, d = embeddings.shape

    pe = np.zeros(embeddings.shape)
    for pos in range(seq_len):
        for i in range(d):
            if i % 2 == 0:
                pe[pos, i] = np.sin(pos / 10000 ** (2 * i / d))
            else:
                pe[pos, i] = np.cos(pos / 10000 ** (2 * i / d))
    return embeddings + pe

Here’s an example of how we can use this function:

get_position_encoded_embeddings(queries)

Transformer

Transformer Architecture

The transformer can be divided into two main components:

Encoder (left)
Decoder (right)

The inputs to both these components are summed with the positional embeddings prior to being fed into the component.

Encoder

The encoder of a transformer can be used to encode an input sentence/sequence into a vector representation.

The encoder consists of sets of blocks placed sequentially, each of which consists of the following components:

Multi-head Attention Unit - Set up as self-attention
Add MHA inputs to MHA outputs and apply Layer Normalization
Feed Forward NN
Add NN inputs to NN outputs and apply Layer Normalization

Note: The encoder, although often used, may not be a compulsory component of the architecture, based on the context. For example, a text generation task where there is no input can work without the encoder.

Decoder

The decoder of a transformer takes the encoder outputs and the previous outputs as inputs and generates a new output based on them.

The decoder, too, consists of a set of blocks, each of which consists of the following components:

Masked Multi-head Attention Unit - Masks out the embeddings of the future tokens
Add Masked-MHA inputs to Masked-MHA outputs and apply Layer Normalization
Multi-head Attention Unit - values = keys = encoder_outputs, queries = normalized_masked_mha_outputs
Add MHA inputs to MHA outputs and apply Layer Normalization
Feed Forward NN
Add NN inputs to NN outputs and apply Layer Normalization

Note: The decoder has two main inputs, the encoder outputs and the tokens generated up to the current state by the decoder. Since there is no such token as the first input of the decoder, the first token is usually specified to be a special start token.

Implementation

Let’s first define some variables and functions we can use in the implementation. Feel free to read the comments for an idea of what each variable defines.

def get_random_mha_weights(n_heads):
    # Generate a random set of weights for a single MHA module
    return [
        [np.random.random((d, d)) for _ in range(n_heads)],   # W_Q
        [np.random.random((d, d)) for _ in range(n_heads)],   # W_K
        [np.random.random((d, d)) for _ in range(n_heads)],   # W_V
        np.random.random((d, n_heads * d))                    # W_O
    ]


# Number of words in the vocabulary
vocab_size = 10
# Embedding length
d = 2
# Embedding definition
embedding_table = np.random.random((vocab_size, d))

# Number of heads in MHA blocks
num_heads = 4
# Number of encoder blocks
num_enc_blocks = 3
# Number of decoder blocks
num_dec_blocks = 3

# MHA linear layer weights for each MHA component
enc_mha_weights = [get_random_mha_weights(num_heads) for _ in range(num_enc_blocks)]
dec_mha_weights = [get_random_mha_weights(num_heads) for _ in range(num_dec_blocks)]

# Linear layer weights for decoder masked MHA component
dec_mmha_weights = [get_random_mha_weights(num_heads) for _ in range(num_dec_blocks)]

# Weights for linear layer after decoder
dec_lin_weights = np.random.random((d, vocab_size))


# NOTE: For the sake of keeping numpy as the only dependency and for simplicity, I will use a matrix as a substitute for the neural network.
# However, keep in mind that a neural network works somewhat differently than a linear layer since it can have multiple layers with non-linear layers between layers.
enc_nn_weights = [np.random.random((d, d)) for _ in range(num_enc_blocks)]
dec_nn_weights = [np.random.random((d, d)) for _ in range(num_dec_blocks)]

Defining the encoder and the decoder

Next, let’s define the functions of the encoder and decoder.

def encoder(enc_inputs, mha_weights, nn_weights, n_blocks=3):
    for i in range(n_blocks):
        # Self Attention with MHA + Add & Norm
        q = k = v = enc_inputs
        out = mha(q, k, v, mha_weights[i], num_heads)
        out_norm = layer_norm(out + q)
        
        # Linear + Add & Norm
        nn_out = linear(nn_weights[i], out_norm)
        nn_out_norm = layer_norm(nn_out + out_norm)
        
        enc_inputs = nn_out_norm
    
    return q

def decoder(dec_inputs, enc_outputs, mha_weights, mmha_weights, nn_weights, vocab_size, n_blocks=3):
    for i in range(n_blocks):
        # Self Attention with MHA + Add & Norm
        q = k = v = dec_inputs
        out = masked_mha(q, k, v, mmha_weights[i], num_heads)
        out_norm = layer_norm(out + q)
        
        # MHA + Add & Norm
        out_mha = mha(out_norm, enc_outputs, enc_outputs, mha_weights[i], num_heads)
        out_mha_norm = layer_norm(out_mha + out_norm)
        
        # Linear + Add & Norm
        nn_out = linear(nn_weights[i], out_mha_norm)
        nn_out_norm = layer_norm(nn_out + out_norm)
        
        dec_inputs = nn_out_norm
    
    # Take the last vector since it corresponds to the new word
    lin_out = nn_out_norm[-1] @ dec_lin_weights
    
    # Softmax
    lin_out_exp = np.exp(lin_out)
    softmax_out = lin_out_exp / np.sum(lin_out_exp)
    
    # Sample word using the probabilities from softmax
    return random.choices(range(vocab_size), weights=softmax_out)[0]

Inference

Finally, let’s use all these functions to generate a new sentence, given an input sentence.

# Input sentence
sentence = np.random.randint(0, vocab_size, size=3)
enc_embeddings = embedding_table[sentence]       # (seq_len, d)
pos_enc_embeddings = get_position_encoded_embeddings(enc_embeddings)

# Start token embedding
words = [0]
dec_embeddings = [embedding_table[0]]
pos_dec_embeddings = get_position_encoded_embeddings(np.array(dec_embeddings))

# Encoder
enc_out = encoder(pos_enc_embeddings, enc_mha_weights, enc_nn_weights, num_enc_blocks)

max_chars = 20
# Decoder loop
for _ in range(20):
    word = decoder(pos_dec_embeddings, enc_out, dec_mha_weights, dec_mmha_weights, dec_nn_weights, vocab_size, num_dec_blocks)
    if word == vocab_size - 1:
        break
    words.append(word)
    dec_embeddings.append(embedding_table[word])
    pos_dec_embeddings = get_position_encoded_embeddings(np.array(dec_embeddings))

# Output sentence
print(words)

Final Thoughts

That’s it. Hope you found it helpful.

Do keep in mind the notes/caveats I mentioned at the beginning of the post. On that note, I will try to connect these concepts with the training side of transformers, at a later date. You can keep an eye out for that here if you wish.