Chapter 4 - Fine-Tune Models and Training Algorithms
Chapter 4 - Fine-Tune Models and Training Algorithms
HCM
VIỆN CÔNG NGHỆ THÔNG TIN, ĐIỆN, ĐIỆN TỬ
Chapter 4.
FINE-TUNE MODELS
AND TRAINING
ALGORITHMS
2
Finetune model
1. Choose a Pre-trained Model: Select a model that has been trained on a large
dataset and whose architecture is suitable for the target task.
2. Prepare Your Dataset: Gather and preprocess the data specific to your task.
3. Modify the Model: Adapt the pre-trained model's architecture if necessary,
typically by replacing the final classification layer.
4. Freeze Layers (Optional but Common): Initially freeze the weights of the
early layers of the pre-trained model to prevent them from being drastically
changed by the new, smaller dataset.
5. Train the Model: Train the modified model on your dataset, typically using a
lower learning rate than you would for training from scratch.
6. Unfreeze and Retrain (Optional): After initial training, you might unfreeze
some of the earlier layers and continue training with an even lower learning rate
to fine-tune the entire model.
7. Evaluate Performance: Assess the performance of the finetuned model on a
validation set.
8. Hyperparameter Tuning: Adjust hyperparameters like learning rate, batch
size, and the number of frozen layers to optimize performance.
3
Finetune model
Key Considerations for Finetuning:
● Similarity of Datasets: The more similar your target task and
dataset are to the original task and dataset the pre-trained
model was trained on, the better the finetuning will likely
work.
● Size of the New Dataset: The size of your task-specific
dataset will influence how many layers you should unfreeze
and the learning rate you should use. Smaller datasets might
benefit from freezing more layers to prevent overfitting.
● Computational Resources: Finetuning can still be
computationally intensive, especially for large pre-trained
models.
● Potential for Catastrophic Forgetting: If you finetune too
aggressively on a very different task, the model might
"forget" the useful general features it learned during
pre-training.
When to Finetune:
● You have a limited amount of labeled data for your specific
task.
● A good pre-trained model exists for a related task or domain.
● You want to achieve good performance quickly without
training a large model from scratch.
4
Training Algorithms
Training algorithms are the methods used to teach a machine learning model to learn from data. They define how the model's internal
parameters (weights and biases in neural networks) are adjusted based on the training data to minimize a defined loss function. The loss
function measures the difference between the model's predictions and the actual target values.
Core Components of a Training Algorithm:
1. Loss Function (Objective Function): A function that quantifies the error or discrepancy between the model's predictions and the true
values in the training data. The goal of training is to minimize this function. Examples include:
○ Mean Squared Error (MSE): For regression tasks.
○ Binary Cross-Entropy: For binary classification.
○ Categorical Cross-Entropy: For multi-class classification.
2. Optimizer: An algorithm that determines how the model's parameters are updated to reduce the loss function. Common optimizers
include:
○ Gradient Descent (GD): A basic optimization algorithm that iteratively moves the parameters in the direction of the negative
gradient of the loss function.
○ Stochastic Gradient Descent (SGD): Updates parameters using the gradient calculated on a single randomly chosen training
example (or a small batch). This is more efficient for large datasets.
○ Adam (Adaptive Moment Estimation): An adaptive learning rate optimization algorithm that is widely used and often
performs well.
○ RMSprop (Root Mean Square Propagation): Another adaptive learning rate optimizer.
○ Adagrad (Adaptive Gradient Algorithm): Adapts the learning rate for each parameter based on the historical gradients.
3. Learning Rate: A hyperparameter that controls the step size at each iteration while moving towards a minimum of the loss function.
A high learning rate might lead to overshooting the minimum, while a low learning rate might result in slow convergence.
4. Batch Size: The number of training examples used in one iteration to calculate the gradient and update the model's parameters.
5. Number of Epochs: The number of times the entire training dataset is passed through the model during training.
5
Training Algorithms
Common Training Algorithms (Optimization Algorithms):
● Gradient Descent and its Variants (SGD, Mini-batch GD): These form the foundation of many training algorithms, especially
for neural networks. They iteratively adjust model parameters based on the gradient of the loss function.
● Backpropagation: An algorithm used to efficiently calculate the gradients of the loss function with respect to the weights in a
neural network. It's a crucial part of training deep learning models.
● Evolutionary Algorithms (e.g., Genetic Algorithms): While less common for training deep learning models directly, they can be
used for tasks like hyperparameter optimization or neural architecture search.
Advanced Training Techniques:
● Learning Rate Scheduling: Adjusting the learning rate during training (e.g., decreasing it over time) can help the model
converge better.
● Regularization (L1, L2, Dropout): Techniques to prevent overfitting by adding a penalty to the loss function or randomly
dropping out neurons during training.
● Batch Normalization: A technique to stabilize and accelerate training by normalizing the activations of intermediate layers.
● Early Stopping: Monitoring the performance on a validation set and stopping training when the performance starts to degrade to
prevent overfitting.
● Data Augmentation: Creating artificial variations of the training data to increase its size and improve the model's generalization
ability.
Relationship Between Finetuning and Training Algorithms.
Finetuning is a specific application of the general training process. The key differences in finetuning often lie in:
● Initialization: The model's weights are initialized with the values learned during pre-training, rather than random initialization.
● Layer Freezing/Unfreezing: You strategically choose which layers to update during training.
● Learning Rate Adjustment: You often use different learning rates for different parts of the model.
6
Optimization algorithm in Pytorch
In PyTorch, training a neural network typically involves using an optimization algorithm to update the model's parameters (weights and
biases) based on the gradients of a loss function with respect to those parameters. These algorithms aim to minimize the loss function,
thereby improving the model's performance on the training data. PyTorch provides a rich set of optimization algorithms within its
torch.optim module.
The choice of optimization algorithm can significantly impact the training process and the final performance of your model. There's no
single "best" optimizer for all tasks. Here are some general guidelines:
● AdamW is often a good starting point for many modern deep learning tasks and architectures.
● Adam is also a very popular and generally effective choice.
● SGD with momentum can work well, especially with careful tuning of the learning rate and other hyperparameters. It might
generalize better in some cases but often takes longer to converge.
● RMSprop is another good alternative to Adam.
● Adagrad and Adadelta were more popular in the past but are less commonly used now compared to Adam and its variants.
● LBFGS is often used for problems where you can afford full-batch training and need faster convergence in terms of the number
of iterations (e.g., certain types of optimization problems in computer vision or physics)
1. Stochastic Gradient Descent (SGD):
● Concept: The most basic and fundamental optimization algorithm. It updates the model's parameters in the direction of the
negative gradient of the loss function computed on a single random sample (or a small batch) of the training data.
● Pros: Simple to understand and implement.
● Cons: Can be slow to converge, especially with noisy gradients. May get stuck in local minima. The learning rate is crucial and
often needs careful tuning.
● PyTorch Implementation: torch.optim.SGD(params, lr=0.01, momentum=0, dampening=0, weight_decay=0, nesterov=False)
○ lr: Learning rate.
○ momentum: Helps accelerate SGD in the relevant direction and dampens oscillations.
○ weight_decay: L2 regularization to prevent overfitting. 7
○ nesterov: Enables Nesterov momentum, which often leads to faster convergence.
Optimization algorithm in Pytorch
2. Adam (Adaptive Moment Estimation):
● Concept: An adaptive learning rate optimization algorithm that combines the benefits of both AdaGrad and RMSprop. It maintains
per-parameter learning rates that are adapted based on estimates of the first and second moments of the gradients.
● Pros: Generally converges faster than SGD and requires less hyperparameter tuning. Effective for a wide range of problems.
● Cons: Can sometimes generalize worse than SGD in certain scenarios.
● PyTorch Implementation: torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
○ lr: Learning rate.
○ betas: Coefficients used for computing running averages of the gradient and its square.
○ eps: Term added to improve numerical stability.
○ amsgrad: Whether to use the AMSGrad variant of this algorithm.
9
Optimization algorithm in Pytorch
10
Learning Rate Scheduling in Pytorch
Learning rate scheduling in PyTorch is a technique to adjust the
Common Learning Rate Schedulers in PyTorch
learning rate of your optimizer during training. Instead of using a
Here are some of the most commonly used learning rate
constant learning rate throughout the entire training process, you
schedulers in PyTorch:
can dynamically change it based on the number of epochs, the
1. StepLR: Reduces the learning rate by a fixed factor at
performance on a validation set, or other criteria.
specified epochs
Why Use Learning Rate Scheduling?
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
● Improved Convergence: Starting with a higher learning
scheduler = lr_scheduler.StepLR(optimizer, step_size=30,
rate can help the model quickly move towards a good region
gamma=0.1)
in the weight space. Then, reducing the learning rate allows
# step_size: Number of epochs after which learning rate will be
for finer adjustments and helps the model converge to a
reduced.
better minimum.
# gamma: Multiplicative factor of learning rate decay.
● Avoiding Local Minima: A fluctuating or decreasing
learning rate can help the model escape shallow local
minima.
● Better Generalization: Carefully scheduled learning rates
can sometimes lead to models that generalize better to
unseen data.
● Faster Training: By starting with a higher learning rate,
you might reach a reasonable performance level faster.
PyTorch torch.optim.lr_scheduler Module
PyTorch provides a dedicated module, torch.optim.lr_scheduler,
which implements several common learning rate scheduling
strategies.
from torch.optim import lr_scheduler 11
Learning Rate Scheduling in Pytorch
Common Learning Rate Schedulers in PyTorch:
15
Data Handling in Pytorch
PyTorch provides powerful and flexible tools for managing and processing your data during training. The core components involved
are:
● torch.utils.data.Dataset: An abstract class representing a dataset. You need to implement custom dataset classes that define
how to access your data and labels.
● torch.utils.data.DataLoader: An iterator that provides batches of data from a Dataset. It handles shuffling, batching, and
parallel data loading.
● torchvision.transforms (for image data): A module containing common image transformations that can be used for data
augmentation.
1. Batching Strategies in PyTorch:
PyTorch makes batching straightforward using the DataLoader.
● torch.utils.data.DataLoader: This class takes a Dataset object as input and provides an iterable over the data in batches.
● Key Parameters for Batching:
○ batch_size (int, optional): How many samples per batch to load (default: 1).
○ drop_last (bool, optional): If True, the last incomplete batch is dropped if its size is less than batch_size. Defaults to
False.
16
Data Handling in Pytorch
2. Data Augmentation in PyTorch:
For common data types like images, PyTorch provides the torchvision.transforms module. You can define a sequence of
transformations to apply to your data.
● torchvision.transforms: This module offers a wide range of image transformations.
● Common Augmentation Transforms:
○ transforms.ToTensor(): Converts a PIL Image or NumPy ndarray into a PyTorch Tensor.
○ transforms.Normalize(mean, std): Normalizes a tensor image with mean and standard deviation.
○ transforms.RandomHorizontalFlip(p=0.5): Randomly flips the image horizontally with a given probability.
○ transforms.RandomVerticalFlip(p=0.5): Randomly flips the image vertically with a given probability.
○ transforms.RandomRotation(degrees): Rotates the image by a random angle within the specified degrees.
○ transforms.RandomResizedCrop(size, scale=(0.08, 1.0), ratio=(3/4, 4/3)): Crops a random sized and aspect ratio
patch of the original image and then resizes it to the given size.
○ transforms.ColorJitter(brightness=0, contrast=0, saturation=0, hue=0): Randomly changes the brightness,
contrast, saturation, and hue of an image.
○ transforms.GaussianBlur(kernel_size, sigma=(0.1, 2.0)): Applies Gaussian blur to the image.
○ transforms.Compose(transforms): Chains multiple transforms together.
● Applying Augmentations: You typically define a transform object using transforms.Compose and then apply it within your
custom Dataset class's __getitem__ method or when creating the DataLoader (though applying within Dataset is more
common).
● Custom Augmentations: For more specific or complex augmentations, you can create your own custom transformation
classes by inheriting from torch.nn.Module and implementing the __call__ method.
17
Data Handling in Pytorch
3. Data Shuffling in PyTorch:
Shuffling the training data in each epoch is crucial to prevent the model from learning spurious patterns based on the order of the data.
● torch.utils.data.DataLoader: The DataLoader handles shuffling.
● Key Parameter for Shuffling:
○ shuffle (bool, optional): Set to True to have the data reshuffled at every epoch (default: False).
18
Data Handling in Pytorch
19
Regularization Techniques
Dropout is a regularization technique where randomly selected neurons are "dropped out" (set to zero) during the training process.
This means their contribution to the activation of downstream neurons is temporarily removed. The probability of a neuron being
dropped out is controlled by a hyperparameter, typically denoted as 'p'.’
→ Implementation in PyTorch:
○ torch.nn.Dropout layer: Show how to add dropout layers to a model.
○ p parameter: Explain the dropout probability.
○ torch.nn.Dropout(p=0.5, inplace=False):
○ Training vs. Evaluation: Emphasize that dropout is active during training but turned off during evaluation
(model.eval()).
Weight decay is a technique that adds a penalty to the loss function proportional to the square of the magnitude of the model's
weights.
→ Implementation in PyTorch: weight_decay parameter in optimizer (torch.optim.Adam or torch.optim.SGD,)
Early stopping is a regularization technique that involves monitoring the model's performance on a validation set during training.
Training is stopped prematurely when the performance on the validation set starts to degrade (e.g., the validation loss starts
increasing or the validation accuracy starts decreasing), even if the training loss is still decreasing.
→ Implementation in PyTorch: Early stopping is typically implemented manually within your training loop.
1. Keep track of the validation loss (or another relevant metric).
2. Define a "patience" value: This is the number of epochs to wait after the validation loss has stopped improving before
stopping training.
3. Keep track of the best validation loss seen so far and the corresponding model state.
4. In each epoch:
○ Train on the training data.
○ Evaluate on the validation data and calculate the validation loss.
○ If the current val_loss is better than the best val_loss seen so far, update the best loss and save the current model state.
○ If the val_loss has not improved for 'patience' epochs, stop training. 20
Regularization Techniques
21
Gradient Clipping
Gradient Clipping as a Solution in PyTorch. Gradient clipping sets a threshold on the magnitude of the gradients. If the gradients
exceed this threshold, they are scaled down to prevent overly large weight updates. PyTorch provides convenient utilities in the
torch.nn.utils module to implement gradient clipping.
There are two main ways to implement gradient clipping in PyTorch:
1. Clipping by Value (torch.nn.utils.clip_grad_value_): This method directly clips the individual values of the gradients to a
specified range.
2. Clipping by Norm (torch.nn.utils.clip_grad_norm_): This is the more common and often recommended approach. It clips
the L2 norm (or another specified norm) of the gradients of all parameters together. If the total norm exceeds a threshold, all
gradients are scaled down proportionally
Gradient clipping is particularly useful in the following scenarios:
● Recurrent Neural Networks (RNNs): RNNs, especially those with many time steps or complex architectures like LSTMs
and GRUs, are prone to exploding gradients.
● Deep Neural Networks: Very deep feedforward networks can sometimes experience this issue as well.
● Training with High Learning Rates: If you are using a relatively high learning rate, gradient clipping can help maintain
stability.
● Observing Unstable Training: If you notice your training loss fluctuating wildly or increasing, it might be a sign of
exploding gradients, and gradient clipping could help
Implementation Steps in the Training Loop:
1. After calculating the gradients using loss.backward().
2. Before the optimizer's step() function updates the model's parameters.
Choosing the Clipping Threshold:
The optimal clipping threshold (clip_value for value clipping or max_norm for norm clipping) often needs to be determined through
experimentation. You can try different values and monitor the training process (e.g., loss curves, gradient magnitudes) to find a
suitable threshold that stabilizes training without hindering learning. Common values for max_norm often range between 0.1 and 10. 22
Gradient Clipping
23
Debugging and Logging: Tools and techniques for understanding the training
process.
24
Practice 3 - Get started with Hugging Face
Exercise1: Sentiment Analysis with Hugging Face Exercise2: Finetuning a Pretrained Model for Binary Text
Classification
1. Install the Hugging Face transformers library. In this exercise, you will:
2. Use a pre-trained sentiment analysis model from the 1. Install the necessary Hugging Face libraries
Hugging Face Hub. (transformers, datasets, evaluate).
2. Load a simple dataset for binary text classification.
3. Tokenize a sample sentence.
3. Load a pretrained model and its tokenizer.
4. Perform sentiment analysis on the sentence. 4. Preprocess the dataset to be suitable for the model.
5. Define training arguments.
References:
6. Create a Trainer object and finetune the model.
https://huggingface.co/docs/transformers/en/training#fine-tun
7. Evaluate the finetuned model.
e-a-pretrained-model
https://huggingface.co/blog/sentiment-analysis-python
https://www.kaggle.com/code/gauravduttakiit/sentiment-anal
ysis-using-hugging-face
https://www.kaggle.com/code/neerajmohan/fine-tuning-bert-f
or-text-classification
25
Q&A
26