1. Introduction

Transformers have become one of the most important concepts in Natural Language Processing (NLP), emerging as a response to key limitations in existing approaches. The architecture addresses several challenges faced by earlier models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks:

Sequential processing bottleneck: RNNs and LSTMs process input sequences one element at a time, limiting their efficiency.
Long-range dependencies: These earlier models struggled to capture relationships between distant elements in a sequence.
Attention mechanism potential: Previous research showed that attention mechanisms could improve sequence modeling. Transformers explored whether attention alone could replace recurrent structures.
Training efficiency: The Transformer architecture enables faster and more efficient training on large datasets compared to its predecessors.

By tackling these issues, Transformers revolutionized NLP tasks, offering improved performance and scalability.

2. Main Concept of Transformer

2-1. Attention Mechanism

The attention mechanism in Transformers is inspired by human cognitive processes. Just as humans focus on specific, relevant parts of information when processing complex input, the attention mechanism allows the model to do something similar:

1. selective focus: It enables the model to "pay attention" to certain parts of the input sequence more than others
2. Importance weighting: The mechanism assigns different weights to different parts of the input, emphasizing more relevant or important information.
3. Context-dependent processing: Unlike fixed weighting, the importance of each part can change based on the current context or task
4. Efficient information extraction: This selective focus helps the model efficiently extract and utilize the most relevant information from the entire input sequence.
5. Improved handling of long-range dependencies: By directly connecting different positions in the sequence, attention helps capture relationships between distant elements more effectively than sequential processing.

2-1-1. Self-Attention

Self-attention is a mechanism that allows a model to focus on different parts of an input sequence when processing each element of that same sequence. The key distinction between conventional attention and self-attention lies in the reference point. In self-attention, each element in the sequence attends to every other element within the same sequence. This means that the input and the reference sequence are identical, which is the defining characteristic of self-attention.

Then why we need self-attention over normal attention? There are several reasons why we are using self attention.

First reason is capturing intra-sequence dependencies. Self-attention allows each element to directly interact with every other element in the sequence. This helps in capturing long-range dependencies within the input. which is crucial for understanding context in language.

The second reason for using self-attention is its ability to enable parallelization. Unlike RNNs, self-attention can be computed for all elements in parallel, which significantly improves computational efficiency. While there are other forms of attention, such as cross-attention used in encoder-decoder architectures, self-attention offers unique advantages. In cross-attention, it's possible to calculate the relevance between encoder states and decoder states. However, the encoder-decoder model tends to operate more sequentially rather than in parallel. Self-attention, by contrast, allows for parallel processing within a single sequence, making it particularly well-suited for modern, high-performance computing architectures.

Also, There are no sequential bottleneck in self-attention. Normal attention in seq2seq models often relies on sequential processing of inputs. However, self-attention removes this bottleneck, allowing for more efficient processing of long sequences.

2-1-2. Multi-Head Attention

Next, we'll be looking for multi-head attention and why we need this mechanism. Multi-head attention extends the idea of single-head attention by running multiple attention heads in parallel on the same input sequence. This allows the model to learn different types of relationships and patterns within the input data simultaneously, thereby considerably enhancing the expressive power of the model as compared to using just a single attention head.

Multi-head attention enhances the normal attention mechanism by utilizing multiple sets of learnable Query (Q), Key (K), and Value (V) matrices instead of a single set. This approach offers two key advantages. Firstly, it enables parallel processing, potentially increasing computational speed compared to normal attention. Secondly, it allows the model to simultaneously focus on different aspects of the input. By creating several sets of Q, K, and V matrices, multi-head attention can capture various facets of the input in parallel, facilitating the learning of diverse features and relationships within the data. This capability typically results in improved model performance, particularly for complex tasks such as machine translation. While multi-head attention does require more computational resources, the enhanced results it produces often justify this additional cost.

2-2. Positional Encoding

It's true that the Transformer model can process input in parallel, making it faster than sequential models like RNNs or LSTMs. However, this parallel processing means the Transformer can't inherently capture the sequence of words. To address this, we need to add positional information separately. Word position is crucial because changing the order of words can alter the sentence's meaning.

When applying positional encoding, we must consider two key factors. First, each position should have a unique identifier that remains consistent regardless of the sequence length or input. This ensures the positional embedding works identically even if the sequence changes. Second, we must be careful not to make the positional values too large, as this could overshadow the semantic or syntactic information, hindering effective training in the attention layer.

A technique that addresses the factors mentioned earlier is the use of Sine & Cosine Functions for positional encoding. This approach offers several advantages. Firstly, sine and cosine functions always maintain values between -1 and 1, ensuring that the positional encoding doesn't overshadow the input embeddings. While sigmoid functions also satisfy this constraint, they're less suitable because the gap between values for adjacent positions can become extremely small.

Some might worry that different positions could yield identical values. However, this concern is mitigated by using multiple sine and cosine functions of different frequencies to create a vector representation for each position. By increasing the frequency of these functions, we can create more varied encodings, making it highly unlikely for different positions to have the same representation.

The equation to represent different frequencies depends on position and dimension looks like this. The i means position, and d means the number of dimensions.

2-3. Encoder-Decoder

The Transformer architecture consists of two main components: the Encoder and Decoder.

2-3-1. Encoder

The encoder processes the input sequence, which is composed of multiple identical layers. Each layer has two sub-layers. The first sub-layer is the Multi-Head Attention mechanism. After this mechanism, the second sub-layer, a feedforward neural network, is applied sequentially. Both the multi-head self-attention and feedforward neural network are followed by a residual connection and layer normalization. The entire process within each encoder layer looks like this:

Input → Multi-head Self-Attention → Add & Norm → Feedforward NN → Add & Norm → Output

This entire process is then repeated for each subsequent encoder layer in the stack. And, I'll discuss in detail about an Add(Residual Connection) & Norm(Layer Normalization) in detail.

2-3-1-1. Residual Connections

Residual connections, also known as "skip connections" or "shortcut connections", serve to preserve information flow in neural networks. These connections allow unmodified input information to pass directly through the network, helping to retain important features from the original input that might otherwise be lost through multiple-layer transformations. Crucially, this approach helps mitigate the difficulty of learning in deep neural networks by providing a direct path for information and gradients to flow.

2-3-1-2. Layer Normalization

Layer normalization is applied in the "Add & Norm" step of Transformers to stabilize learning, reduce training time, and decrease dependence on careful initialization. It works by Calculating mean and variance across the last dimension which is also called the feature dimension. And then normalizing the input using these statistics. Finally, it scales and shifts the result with learnable parameters. This process helps maintain consistent value scales throughout the network. allowing each layer to learn more effectively and independently.

2-3-2. Decoder

The Transformer's decoder generates the output sequence, differing from the encoder in key aspects. It operates sequentially rather than in parallel, employs both self-attention and cross-attention mechanisms, and uses masked self-attention to preserve the autoregressive property. The decoder maintains a causal dependency on previous outputs and includes an additional output projection layer. These distinctions enable the decoder to produce coherent outputs based on the encoded input.

2-3-2-1. Masked Multi-Head Attention

The first thing we want to discuss in the decoder is Masked Multi-Head Attention. This structure is primarily to maintain the autoregressive property during training and inference. The decoder generates output tokens once at a time, from left to right. Each token should only depend on previously generated tokens. Also, the model can only attend to previous tokens in the sequence, not the future ones. This approach allows the model to be trained in a way that mimics the inference process, where future tokens are unknown. and it also allows parallel computation during training.

2-3-2-2. Multi-Head Attention for Decoder

In the Transformer architecture, the interaction between the encoder and decoder is facilitated by a critical mechanism known as cross-attention. This process utilizes queries derived from the decoder and key-value pairs obtained from the encoder, serving as a vital link between the two components. The queries, originating from the current decoder layer, represent the specific information the decoder seeks at each step of the output generation. Conversely, the keys and values, products of the encoder's processing, encapsulate the essence of the input sequence.

This intricate interplay allows the decoder to align its output generation with the most relevant elements of the input. By doing so, it captures and leverages the context from the entire input sequence, rather than being limited to a narrow window of information. The beauty of this mechanism lies in its dynamic nature; for each output token being generated, the decoder can shift its focus to different parts of the input, ensuring that the most pertinent information is always at the forefront of the generation process.
Essentially, this attention mechanism acts as a sophisticated bridge between the input—the source sequence meticulously processed by the encoder—and the output, which the decoder generates one token at a time. This bridging effect is crucial in enabling the Transformer to perform complex tasks such as translation, summarization, or question-answering with remarkable accuracy and contextual awareness. It allows the model to maintain a nuanced understanding of the input throughout the entire generation process, ensuring that each output token is produced with consideration of the full context provided by the input sequence.

6. Conclusion

The Transformer architecture represents a significant leap forward in natural language processing, addressing key limitations of previous models like RNNs and LSTMs. Its core innovations – the self-attention mechanism, multi-head attention, and positional encoding – have revolutionized how machines process and understand language.

These innovations have not only improved performance on various NLP tasks but have also paved the way for larger, more powerful language models. The Transformer's scalability and efficiency have become the foundation for models like BERT, GPT, and their successors, driving rapid advancements in the field.

As NLP continues to evolve, the principles introduced by the Transformer architecture remain central to cutting-edge research and applications, underscoring its lasting impact on the field of artificial intelligence.

Transformer Deep Dive