LLMs are ubiquitous tools used for the processing and producing of natural language texts. Since its inception in 2018, several generations of LLMs continuously pushed the frontier of LLM capabilities. Today’s LLMs such as LLaMA 2 and GPT-4 are universally applicable to all classical NLP tasks, but this is not the case for the early models of 2018. These gen1 LLMs are around 150M parameters strong. They are typically trained in the Toronto Book Corpus and Wikipedia Text, with the goal to optimize the prediction of a word given its context of previous and following word. While the model architectures differ, e.g. number of attention heads and hidden dimensions, the resulting models need to be finetuned for any downstream NLP task.
But what exactly does it mean to fine-tuning a model? This article explains the general concepts and techniques to fine-tune generation 1 LLMs. It starts with describing the most common NLP benchmarks and datasets, then details the technical realization of fine-tuning, and shows how to persist the fine-tuned model for interference.
This article originally appeared at my blog admantium.com. Also note that I’m taking time off from writing during the summer period. Until September, all articles will be published every second week.
Pre-Trained LLMs: Components of the Transformer Architecture
A transformer is a specific type of neural network. It’s the results of a steady stream of scientific innovations that resulted in a scalable mechanism for providing both absolute meaning and relative meaning to an individual token inside a whole text. This attention mechanism was created in the famous paper Attention is All You Need. Nowadays, the transformer is the de-facto standard architecture for LLMs,
A transformer consists of several interrelated building blocks. The complete architecture is shown in the following picture:
Source: Wikipedia, Transformer (machine learning model), https://en.wikipedia.org/wiki/Transformer(machine_learning_model)_
The individual building blocks work as follows:
- Input transformation: Each token is converted to a numerical representation using byte pair encoding, and then relative positional information is added.
- Encoder: Iteratively process each token through all layers. The self-attention layer creates. a token representation in which both the absolute and relative representation in the context of other tokens is considered.
- Decoder: Iteratively process each output token from the encoder and the tokens produced by the decoder themselves. The cross-attention layer consumes the encoder output, and the self-attention layer process the decoders output token in the same fashion as the encoders self- attention layer.
- Feed Forward Networks: Each encoder and decoder layer includes a feed-forward neural network. in this network, information flows only forward, but residual connections enable skipping some layers. The task of this networks is to learn effective positional encodings.
- Attention heads: In an attention head, three different weight matrixes are learned: query, key, and value. Each token processed by a head is multiplied with each matrix to generate the tokens output representation. Typically, layers have multiple attention heads, and these learn different types of relevance between the tokens, for example to anticipate the very next token, or to identify the subject to which a verb applies, and more.
Concrete transformer models use either both an encoder and decoder (BART), encoder only (BERT), or decoder only (XLNet). Decode-only became the dominant type of transformers architectures starting from 2021, because the cross-attention layer can take the role of the entire encoder block by just itself.
Fine Tuning Tasks, Benchmarks and Datasets
To fine-tune a LLM for a specific task, and adequate datasets needs to be chosen, a benchmark for performance comparison created, and the fine-tuning training applied. Interestingly, with increased sophistication of LLM and general research attention, these three disparate aspects merged and created several benchmarks and accompanying datasets that are routinely used to train and test models.
The following table summarizes the most important benchmarks.
Abbreviation | Name | Description | Metric |
---|---|---|---|
cola | The Corpus of Linguistic Acceptability | Matthew's Corr | |
sst2 | The Stanford Sentiment Treebank | Accuracy | |
mrpr | Microsoft Research Paraphrase Corpus | F1 / Accuracy | |
stsb | Semantic Textual Similarity Benchmark | Pearson-Spearman Corr | |
qqp | Quora Question Pairs | F1 / Accuracy | |
mnli | MultiNLI | Accuracy | |
qnli | Question NLI | Accuracy | |
rte | Recognizing Textual Entailment | Accuracy | |
wnli | Winograd NLI | Accuracy | |
- | Diagnostics Main | Matthew's Corr |
All of these distinct benchmarks are summarized and included in GLUE - the general language understanding evaluation. Specific taks measure and calculate a score for the detection of linguistically acceptable sentences, textual entailment of two paragraphs, or question answering.
Introspection of a Pre-Trained Transformer Model
To understand how fine-tuning changes a model, we need to understand how a pre-trained model looks like. The decoder-only BERT model will be regarded to understand this process.
The BERT model is trained on masked language modelling next-sentence prediction and. In masked language modelling, for a given set of tokens, the probability of the missing token is determined. And in the next sentence prediction, the model is provided with a set of possible next sentences, and then needs to determine the very next sentence with the highest probability.
Here is a very technical description how input data is processed by a transformer. The input is processed through each layer, and in each layer, the token is processed by cross-attention => self-attention => feed-forward network. The input token representation is changed to an individual encoded byte pair and its positional encoding, then modified in the context of all previous produced tokens, and finally modified by the feed-forward network to normalize its information. This representation is passed to the next layer. When the first token by the decoder is produced, its representation will be consumed by all subsequent self-attention heads as the first output token, while processing the second (new) input token.
The training process of a transformer model is essentially a continued application of gradient backpropagation. The decoders output is a multi-dimensional matrix attributing a probability to each token contained in the models vocabulary. Using the expected token as the training goal, the models weight and biases are updated after each training epoch to optimize the models accuracy.
The BERT model has 110M parameters. This is the number of input-dimensions of each layers and the input dimensions of the connected feed-forward model, multiplied by the total number of layers. When the pre-training stage is finished, the 110m parameters represent the final learned state for the LLMs based on its masked language modelling and next sentence prediction tasks.
Fine-Tuning Transformer Models
Fine-Tuning means to adapt a LLM in such a way as to perform for a specific fine-tuning task. This adaptation takes different forms, and changes different aspects of the LLM, dependent on the task.
Typically, not changed are the models vocabulary and embeddings. The tokenization process needs to consider if all of its input tokens are contained in the LLM, and if not, either delete them or change them to an "unknown" token type (not a "masked" type though). Also, the total number of input tokens cannot exceed the LLMs input token length. The input processing process also does not change: tokens are changed to byte-pair encoding, then enriched with positional encodings and passed through the LLMs layers to create the probability matrix of output token. And from this matrix, the output word or text is created step-by-step.
Typically changed is the models architecture, the training goal and the specific training metric that is used to determine back-propagation changes. In question-answering, the output tokens generated by the decoder layer are stacked until an end-of-message token is produced. The resulting sentence is then compared with the specific metric, like the cross-entropy loss, to determine the performance score. Transformer based LLMs are trained by gradient upgrading through backpropagation. Backpropagation is the method applied during fine-tuning as well.
In principle, the modification of the LLMs complete weights and biases is possible, practical considerations and limitations need to be considered. Training is a compute-intensive process, and full re-training changes the model from the ground up, which could lead to worse performance in benchmarks that it was originally tested on. Instead, either a few of the last layers of an LLM are modified, or new layers are stacked on top that can be fully changed during fine-tune training.
Following model modification strategies are possible:
- Full re-training: All models parameters (cross-attention/self-attention head weight matrices, feed-forward network weights) are opened for value updates. These updates can be achieved with backpropagation or other means.
- Gradual unfreezing: The model parameter updates are restricted to individual layers only. At the start of the fine-tuning process only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on.
- Layer specific unfreezing: Instead of opening all layer parameters for gradient updates, only some of them can be modified, for example only the self-attention heads, or only the parameters of the feed-forward network.
- Layer modification: The transformer architecture is changed to include additional blocks inside each layer. Typically following the feed-forward network, a new neural network block, such as ReLu, is added. This block has the same dimensions as the feed forward network. During training, only the weights of these new blocks are updated.
- Layer addition: Additional layers to the encoder or decoder layer stack are added. These layers can be self-contained decoders as well or use a combination of other neural network types. Training then only updates the weights of the added layers.
To the best of my knowledge, fine-tuning processes are not full re-trainings, but typical a combination of gradual unfreezing and layer addition.
Fine-Tuning Process Steps
Machine learning projects follow a well-established sequence of steps. Starting with dataset loading and exploration, continuing with model definition and training configuration, to training and evaluation.
In general, fine-tuning an LLM follows these steps too, but with a different focus. To make the explanation more understandable, the following phases are discussed in the context of fine-tuning a BERT model for question-answering.
- Base Model Selection: The first step is to determine which concrete LLM will be used. This sets several constraints, most importantly the tokenization scheme. For example, when using BERT, all sentences need to start with the
CLS
token, two consecutive sentences need to be distinguished bySEP
, and all sequences need to addPAD
tokens to arrive at their defined sequence length. - Fine-Tuning Goal and Approach: The concrete task goal is determined, and a general training approach created that funnels decisions in all following stages. Equally important is to fully understand all base model limitations that will shape the fine-tuning process.
- Dataset Selection: Both the training and evaluation dataset are required to be in pure text format since other encoding or tokenization schemes might not work with the intended model.
- Dataset Exploration & Preprocessing: As in all machine learning projects, the dataset should be checked for any kind of anomaly, for example unknown encodings, anomalies like text in another language, and tokens that are not defined in the intended model. Also, the general quality of the input material can be calculated, and then e.g. dropping all inputs below a threshold value.
- Train and Test Dataset Tokenization: This essential phase transforms the raw text input into the format required by the LLM. If the raw text contains token that were not included in the models original input data, they either need to be masked or dropped. This phase also incorporates training goals to create suitable tokenized output. In the BERT model example, a typical strategy is to create sequences that combine the question and the paragraph containing the answer.
- Model Modification: The intended model is loaded and customized with single or several modification strategies. In the example of a BERT model fine-tuned for question answering, a new output layer is added that computes the starting and end position of tokens in the answer paragraph.
- Training Parameter: A suitable training metric needs to be defined with respect to the training goal, such as accuracy for text generation, or cross entropy for classification. Also, all training hyperparameter are determined, such as the batch size, learning rate and optimizers.
- Train and evaluate: The model is trained for its defined amounts of epochs, required metrics calculated and compared between runs, and the best fitting model is determined. The amount of manual definition of training process is very dependent on the used libraries, since each supports different abstractions and features, such as parallelizing training over multiple servers of GPUs.
Training libraries provide helpful abstractions that facilitate fine-tuning. The transformers library explicitly supports layer addition and gradual unfreezing with the BertForSequenceClassification
and BertForQuestionAnswering
pipelines, and also full re-training with BertForPreTraining
, BertForMaskedLM
, and BertForNextSentencePrediction
. A concrete code-example is out of scope for this article, but if you are curious, check out the excellent BERT Fine-Tuning Tutorial with PyTorch and the Collab notebook Fine-tuning BART for summarization in two languages.
Conclusion
Large Language Model from 2018 are transformer models trained on masked language modelling or next sentence prediction. These models exhibit fascinating text generation capabilities and can infer information from their training material. To make them applicable to NLP tasks like classification, summarization and question answering, fine-tuning is required. This article explained all required fine-tuning aspects. The first part showed benchmarks and datasets. The second part detailed technical aspects of the transformer models and illuminated that fine-tuning boils down to modifying weights and bias parameter. The third part generalized this to model modification strategies and the concrete fine-tuning process steps.