0% found this document useful (0 votes)

1 views26 pages

Chapter 4 - Fine-Tune Models and Training Algorithms

The document discusses the process of finetuning models in machine learning, emphasizing its benefits such as leveraging learned features, reduced data requirements, and improved performance. It outlines the steps involved in finetuning, key considerations, and when to finetune, along with an overview of training algorithms and their core components. Additionally, it covers optimization algorithms in PyTorch, learning rate scheduling, and their importance in enhancing model training efficiency.

Uploaded by

lechuc508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views26 pages

Chapter 4 - Fine-Tune Models and Training Algorithms

Uploaded by

lechuc508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

TRƯỜNG ĐẠI HỌC GIAO THÔNG VẬN TẢI TP.

HCM
VIỆN CÔNG NGHỆ THÔNG TIN, ĐIỆN, ĐIỆN TỬ

Chapter 4.

FINE-TUNE MODELS
AND TRAINING
ALGORITHMS

PhD. Nguyễn Thị Khánh Tiên

tienntk@ut.edu.vn
Finetune model
Finetuning is a process in machine learning where you take a model that has already been trained on a large dataset (the
pre-trained model) and further train it on a smaller, task-specific dataset.
The goal is to adapt the pre-trained model's learned features to a new, related task, leading to better performance with less data
and faster training times compared to training a model from scratch.
Finetuning is a key technique in transfer learning.
Why Finetune?
● Leveraging Learned Features: Pre-trained models have learned general features from massive datasets. Finetuning allows you
to utilize these features, which are often relevant to your specific task.
● Reduced Data Requirements: You typically need significantly less data to finetune a model than to train one from scratch.
● Faster Training: Since the model is already partially trained, the finetuning process usually converges faster.
● Improved Performance: In many cases, finetuned models achieve higher accuracy and better generalization on the target task.

2
Finetune model

Steps Involved in Finetuning:

1. Choose a Pre-trained Model: Select a model that has been trained on a large
dataset and whose architecture is suitable for the target task.
2. Prepare Your Dataset: Gather and preprocess the data specific to your task.
3. Modify the Model: Adapt the pre-trained model's architecture if necessary,
typically by replacing the final classification layer.
4. Freeze Layers (Optional but Common): Initially freeze the weights of the
early layers of the pre-trained model to prevent them from being drastically
changed by the new, smaller dataset.
5. Train the Model: Train the modified model on your dataset, typically using a
lower learning rate than you would for training from scratch.
6. Unfreeze and Retrain (Optional): After initial training, you might unfreeze
some of the earlier layers and continue training with an even lower learning rate
to fine-tune the entire model.
7. Evaluate Performance: Assess the performance of the finetuned model on a
validation set.
8. Hyperparameter Tuning: Adjust hyperparameters like learning rate, batch
size, and the number of frozen layers to optimize performance.

3
Finetune model
Key Considerations for Finetuning:
● Similarity of Datasets: The more similar your target task and
dataset are to the original task and dataset the pre-trained
model was trained on, the better the finetuning will likely
work.
● Size of the New Dataset: The size of your task-specific
dataset will influence how many layers you should unfreeze
and the learning rate you should use. Smaller datasets might
benefit from freezing more layers to prevent overfitting.
● Computational Resources: Finetuning can still be
computationally intensive, especially for large pre-trained
models.
● Potential for Catastrophic Forgetting: If you finetune too
aggressively on a very different task, the model might
"forget" the useful general features it learned during
pre-training.

When to Finetune:
● You have a limited amount of labeled data for your specific
task.
● A good pre-trained model exists for a related task or domain.
● You want to achieve good performance quickly without
training a large model from scratch.
4
Training Algorithms
Training algorithms are the methods used to teach a machine learning model to learn from data. They define how the model's internal
parameters (weights and biases in neural networks) are adjusted based on the training data to minimize a defined loss function. The loss
function measures the difference between the model's predictions and the actual target values.
Core Components of a Training Algorithm:
1. Loss Function (Objective Function): A function that quantifies the error or discrepancy between the model's predictions and the true
values in the training data. The goal of training is to minimize this function. Examples include:
○ Mean Squared Error (MSE): For regression tasks.
○ Binary Cross-Entropy: For binary classification.
○ Categorical Cross-Entropy: For multi-class classification.
2. Optimizer: An algorithm that determines how the model's parameters are updated to reduce the loss function. Common optimizers
include:
○ Gradient Descent (GD): A basic optimization algorithm that iteratively moves the parameters in the direction of the negative
gradient of the loss function.
○ Stochastic Gradient Descent (SGD): Updates parameters using the gradient calculated on a single randomly chosen training
example (or a small batch). This is more efficient for large datasets.
○ Adam (Adaptive Moment Estimation): An adaptive learning rate optimization algorithm that is widely used and often
performs well.
○ RMSprop (Root Mean Square Propagation): Another adaptive learning rate optimizer.
○ Adagrad (Adaptive Gradient Algorithm): Adapts the learning rate for each parameter based on the historical gradients.
3. Learning Rate: A hyperparameter that controls the step size at each iteration while moving towards a minimum of the loss function.
A high learning rate might lead to overshooting the minimum, while a low learning rate might result in slow convergence.
4. Batch Size: The number of training examples used in one iteration to calculate the gradient and update the model's parameters.
5. Number of Epochs: The number of times the entire training dataset is passed through the model during training.
5
Training Algorithms
Common Training Algorithms (Optimization Algorithms):
● Gradient Descent and its Variants (SGD, Mini-batch GD): These form the foundation of many training algorithms, especially
for neural networks. They iteratively adjust model parameters based on the gradient of the loss function.
● Backpropagation: An algorithm used to efficiently calculate the gradients of the loss function with respect to the weights in a
neural network. It's a crucial part of training deep learning models.
● Evolutionary Algorithms (e.g., Genetic Algorithms): While less common for training deep learning models directly, they can be
used for tasks like hyperparameter optimization or neural architecture search.
Advanced Training Techniques:
● Learning Rate Scheduling: Adjusting the learning rate during training (e.g., decreasing it over time) can help the model
converge better.
● Regularization (L1, L2, Dropout): Techniques to prevent overfitting by adding a penalty to the loss function or randomly
dropping out neurons during training.
● Batch Normalization: A technique to stabilize and accelerate training by normalizing the activations of intermediate layers.
● Early Stopping: Monitoring the performance on a validation set and stopping training when the performance starts to degrade to
prevent overfitting.
● Data Augmentation: Creating artificial variations of the training data to increase its size and improve the model's generalization
ability.
Relationship Between Finetuning and Training Algorithms.
Finetuning is a specific application of the general training process. The key differences in finetuning often lie in:
● Initialization: The model's weights are initialized with the values learned during pre-training, rather than random initialization.
● Layer Freezing/Unfreezing: You strategically choose which layers to update during training.
● Learning Rate Adjustment: You often use different learning rates for different parts of the model.

6
Optimization algorithm in Pytorch
In PyTorch, training a neural network typically involves using an optimization algorithm to update the model's parameters (weights and
biases) based on the gradients of a loss function with respect to those parameters. These algorithms aim to minimize the loss function,
thereby improving the model's performance on the training data. PyTorch provides a rich set of optimization algorithms within its
torch.optim module.
The choice of optimization algorithm can significantly impact the training process and the final performance of your model. There's no
single "best" optimizer for all tasks. Here are some general guidelines:
● AdamW is often a good starting point for many modern deep learning tasks and architectures.
● Adam is also a very popular and generally effective choice.
● SGD with momentum can work well, especially with careful tuning of the learning rate and other hyperparameters. It might
generalize better in some cases but often takes longer to converge.
● RMSprop is another good alternative to Adam.
● Adagrad and Adadelta were more popular in the past but are less commonly used now compared to Adam and its variants.
● LBFGS is often used for problems where you can afford full-batch training and need faster convergence in terms of the number
of iterations (e.g., certain types of optimization problems in computer vision or physics)
1. Stochastic Gradient Descent (SGD):
● Concept: The most basic and fundamental optimization algorithm. It updates the model's parameters in the direction of the
negative gradient of the loss function computed on a single random sample (or a small batch) of the training data.
● Pros: Simple to understand and implement.
● Cons: Can be slow to converge, especially with noisy gradients. May get stuck in local minima. The learning rate is crucial and
often needs careful tuning.
● PyTorch Implementation: torch.optim.SGD(params, lr=0.01, momentum=0, dampening=0, weight_decay=0, nesterov=False)
○ lr: Learning rate.
○ momentum: Helps accelerate SGD in the relevant direction and dampens oscillations.
○ weight_decay: L2 regularization to prevent overfitting. 7
○ nesterov: Enables Nesterov momentum, which often leads to faster convergence.
Optimization algorithm in Pytorch
2. Adam (Adaptive Moment Estimation):
● Concept: An adaptive learning rate optimization algorithm that combines the benefits of both AdaGrad and RMSprop. It maintains
per-parameter learning rates that are adapted based on estimates of the first and second moments of the gradients.
● Pros: Generally converges faster than SGD and requires less hyperparameter tuning. Effective for a wide range of problems.
● Cons: Can sometimes generalize worse than SGD in certain scenarios.
● PyTorch Implementation: torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
○ lr: Learning rate.
○ betas: Coefficients used for computing running averages of the gradient and its square.
○ eps: Term added to improve numerical stability.
○ amsgrad: Whether to use the AMSGrad variant of this algorithm.

3. RMSprop (Root Mean Square Propagation):

● Concept: Another adaptive learning rate algorithm that maintains a moving average of the squared gradients for each parameter. It
divides the learning rate for each parameter by the square root of this average, effectively reducing the learning rate for parameters
with large gradients.
● Pros: Often performs well in practice and is less prone to getting stuck in saddle points compared to SGD.
● Cons: Can sometimes converge slower than Adam.
● PyTorch Implementation: torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0,
centered=False)
○ lr: Learning rate.
○ alpha: Smoothing constant.
○ centered: If True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance.
8
Optimization algorithmin Pytorch

4. Adagrad (Adaptive Gradient Algorithm):

● Concept: An adaptive learning rate algorithm that adapts the learning rate to the parameters, giving higher learning rates
to infrequently updated parameters and lower learning rates to frequently updated parameters. It accumulates the squared
gradients for each parameter over time.
● Pros: Suitable for sparse data.
● Cons: The learning rate can become very small over time, leading to slow convergence or even stopping prematurely.
● PyTorch Implementation: torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0,
initial_accumulator_value=0, eps=1e-10)
○ lr: Learning rate.
○ lr_decay: Learning rate decay per step.

5. Adadelta (Adaptive Delta):

● Concept: An extension of Adagrad that addresses its diminishing learning rate problem. Instead of accumulating all past
squared gradients, it restricts the window of accumulated past gradients to a fixed size. It also doesn't require manual
tuning of a global learning rate.
● Pros: Often performs well without needing to tune the learning rate.
● Cons: Can sometimes oscillate and may not converge as quickly as other adaptive methods in some cases.
● PyTorch Implementation: torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
○ rho: Coefficient used for computing a running average of squared gradients.

9
Optimization algorithm in Pytorch

6. AdamW (Adam with Weight Decay):

● Concept: A modification of the Adam optimizer that decouples the weight decay (L2 regularization) from the gradient
update. This has been shown to often lead to better generalization performance compared to the standard Adam optimizer
where weight decay is applied directly to the gradients.
● Pros: Often outperforms Adam in terms of generalization. Highly recommended for many modern deep learning
architectures.
● Cons: None significant compared to Adam.
● PyTorch Implementation: torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01,
amsgrad=False)
○ Parameters are similar to Adam, with weight_decay being a separate parameter.

7. LBFGS (Limited-memory BFGS):

● Concept: A quasi-Newton method that approximates the Hessian matrix to guide the optimization process. It's a more
sophisticated optimization algorithm that can often converge in fewer iterations than gradient-based methods, especially
for smaller datasets and well-conditioned problems.
● Pros: Can converge quickly for certain types of problems.
● Cons: Requires computing and storing Hessian approximations, which can be memory-intensive for very large models.
Typically used for full-batch optimization and might not be suitable for large datasets or mini-batch training.
● PyTorch Implementation: torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05,
tolerance_change=1e-09, history_size=100, line_search_fn=None)
○ Has different parameters as it's a second-order optimization method.

10
Learning Rate Scheduling in Pytorch
Learning rate scheduling in PyTorch is a technique to adjust the
Common Learning Rate Schedulers in PyTorch
learning rate of your optimizer during training. Instead of using a
Here are some of the most commonly used learning rate
constant learning rate throughout the entire training process, you
schedulers in PyTorch:
can dynamically change it based on the number of epochs, the
1. StepLR: Reduces the learning rate by a fixed factor at
performance on a validation set, or other criteria.
specified epochs
Why Use Learning Rate Scheduling?
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
● Improved Convergence: Starting with a higher learning
scheduler = lr_scheduler.StepLR(optimizer, step_size=30,
rate can help the model quickly move towards a good region
gamma=0.1)
in the weight space. Then, reducing the learning rate allows
# step_size: Number of epochs after which learning rate will be
for finer adjustments and helps the model converge to a
reduced.
better minimum.
# gamma: Multiplicative factor of learning rate decay.
● Avoiding Local Minima: A fluctuating or decreasing
learning rate can help the model escape shallow local
minima.
● Better Generalization: Carefully scheduled learning rates
can sometimes lead to models that generalize better to
unseen data.
● Faster Training: By starting with a higher learning rate,
you might reach a reasonable performance level faster.
PyTorch torch.optim.lr_scheduler Module
PyTorch provides a dedicated module, torch.optim.lr_scheduler,
which implements several common learning rate scheduling
strategies.
from torch.optim import lr_scheduler 11
Learning Rate Scheduling in Pytorch
Common Learning Rate Schedulers in PyTorch:

1.StepLR: Reduces the learning rate by a fixed factor at specified epochs

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# step_size: Number of epochs after which learning rate will be reduced.
# gamma: Multiplicative factor of learning rate decay.
2.MultiStepLR: Reduces the learning rate by a fixed factor at specified epoch milestones.
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
milestones = [50, 100, 150]
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)
# milestones: List of epoch indices. Must be increasing.
# gamma: Multiplicative factor of learning rate decay.
3. ExponentialLR: Reduces the learning rate by an exponential factor.
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.01)
scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
# gamma: Multiplicative factor of learning rate decay (should be < 1).
4. CosineAnnealingLR: Reduces the learning rate following a cosine annealing schedule
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
T_max = 100 # Maximum number of iterations in one cycle.
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=T_max)
# T_max: Maximum number of iterations in one cycle.
# eta_min: Minimum learning rate during the cycle (default: 0).
12
5. ReduceLROnPlateau: Reduces the learning rate when a metric has stopped improving. This scheduler monitors a metric (usually
validation loss or accuracy)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)
# mode: 'min' (for loss) or 'max' (for accuracy).
# factor: Factor by which the learning rate will be reduced.
# patience: Number of epochs with no improvement after which learning rate will be reduced.
6. CyclicLR: Cyclically varies the learning rate between two boundaries.
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
base_lr = 0.001
max_lr = 0.1
step_size_up = 5 # Number of training iterations in the increasing half of a cycle.
scheduler = lr_scheduler.CyclicLR(optimizer, base_lr=base_lr, max_lr=max_lr, step_size_up=step_size_up,mode='triangular')
# base_lr: Initial learning rate which is the lower boundary in the cycle.
# max_lr: Upper boundary in the cycle.
# step_size_up: Number of training iterations in the increasing half of a cycle.
# mode: {'triangular', 'triangular2', 'exp_range'}.
7. OneCycleLR: Adjusts the learning rate following a 1-cycle policy. This policy involves increasing the learning rate from a low value
to a maximum value and then decreasing it again
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
total_steps = len(dataloader) * num_epochs
scheduler = lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, total_steps=total_steps)
# max_lr: Upper learning rate boundaries in the cycle.
# total_steps: The total number of steps in the training loop across all epochs.
# epochs: The number of epochs to train for.
# steps_per_epoch: The number of steps per epoch (length of the dataloader). 13
Learning Rate Scheduling in Pytorch
How to Use a Learning Rate Scheduler
Here's the general workflow for using a learning rate scheduler in your training loop:
1. Initialize the Optimizer: Create your optimizer as usual.
2. Initialize the Scheduler: Instantiate the desired learning rate scheduler, passing the optimizer as an argument, along with any
specific parameters for that scheduler.
3. In the Training Loop:
○ Perform a training step (forward pass, loss calculation, backward pass, optimizer.step()).
○
○ Crucially, call scheduler.step() after each optimization step (for schedulers like StepLR, MultiStepLR,
ExponentialLR, CosineAnnealingLR, CyclicLR, OneCycleLR) or after each epoch (for ReduceLROnPlateau).
Important Considerations:
● When to call scheduler.step():
○ For schedulers like StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, CyclicLR, and OneCycleLR, you
typically call scheduler.step() after each optimizer.step().
○ For ReduceLROnPlateau, you should call scheduler.step(metric) after each epoch, passing the value of the metric you
are monitoring (e.g., validation loss).
● Choosing the Right Scheduler: The best scheduler depends on your specific problem, model architecture, and dataset.
Experimentation is often necessary.
● Hyperparameter Tuning: The parameters of the learning rate scheduler (e.g., step_size, gamma, patience) are also
hyperparameters that might need tuning.
●
● Monitoring Learning Rate: It's often helpful to log the learning rate during training to observe its changes. You can access
the current learning rate using optimizer.param_groups[0]['lr'].
14
Learning Rate Scheduling in Pytorch

15
Data Handling in Pytorch
PyTorch provides powerful and flexible tools for managing and processing your data during training. The core components involved
are:
● torch.utils.data.Dataset: An abstract class representing a dataset. You need to implement custom dataset classes that define
how to access your data and labels.
● torch.utils.data.DataLoader: An iterator that provides batches of data from a Dataset. It handles shuffling, batching, and
parallel data loading.
● torchvision.transforms (for image data): A module containing common image transformations that can be used for data
augmentation.
1. Batching Strategies in PyTorch:
PyTorch makes batching straightforward using the DataLoader.
● torch.utils.data.DataLoader: This class takes a Dataset object as input and provides an iterable over the data in batches.
● Key Parameters for Batching:
○ batch_size (int, optional): How many samples per batch to load (default: 1).
○ drop_last (bool, optional): If True, the last incomplete batch is dropped if its size is less than batch_size. Defaults to
False.

16
Data Handling in Pytorch
2. Data Augmentation in PyTorch:
For common data types like images, PyTorch provides the torchvision.transforms module. You can define a sequence of
transformations to apply to your data.
● torchvision.transforms: This module offers a wide range of image transformations.
● Common Augmentation Transforms:
○ transforms.ToTensor(): Converts a PIL Image or NumPy ndarray into a PyTorch Tensor.
○ transforms.Normalize(mean, std): Normalizes a tensor image with mean and standard deviation.
○ transforms.RandomHorizontalFlip(p=0.5): Randomly flips the image horizontally with a given probability.
○ transforms.RandomVerticalFlip(p=0.5): Randomly flips the image vertically with a given probability.
○ transforms.RandomRotation(degrees): Rotates the image by a random angle within the specified degrees.
○ transforms.RandomResizedCrop(size, scale=(0.08, 1.0), ratio=(3/4, 4/3)): Crops a random sized and aspect ratio
patch of the original image and then resizes it to the given size.
○ transforms.ColorJitter(brightness=0, contrast=0, saturation=0, hue=0): Randomly changes the brightness,
contrast, saturation, and hue of an image.
○ transforms.GaussianBlur(kernel_size, sigma=(0.1, 2.0)): Applies Gaussian blur to the image.
○ transforms.Compose(transforms): Chains multiple transforms together.
● Applying Augmentations: You typically define a transform object using transforms.Compose and then apply it within your
custom Dataset class's __getitem__ method or when creating the DataLoader (though applying within Dataset is more
common).
● Custom Augmentations: For more specific or complex augmentations, you can create your own custom transformation
classes by inheriting from torch.nn.Module and implementing the __call__ method.

17
Data Handling in Pytorch
3. Data Shuffling in PyTorch:
Shuffling the training data in each epoch is crucial to prevent the model from learning spurious patterns based on the order of the data.
● torch.utils.data.DataLoader: The DataLoader handles shuffling.
● Key Parameter for Shuffling:
○ shuffle (bool, optional): Set to True to have the data reshuffled at every epoch (default: False).

4. Combining Batching, Augmentation, and Shuffling:

In a typical PyTorch training pipeline, you'll combine these three aspects:
1. Create a Dataset: This will load your data and apply any necessary transformations (including data augmentation if applicable).
2. Create a DataLoader: This will take your Dataset and handle batching and shuffling.

5. Handling Different Data Types:

● Text Data: For text, you might use libraries like torchtext or Hugging Face's Transformers library, which provide their own
Dataset and DataLoader implementations along with text-specific augmentation techniques.
● Audio Data: Libraries like torchaudio offer datasets and transformations relevant to audio processing.
In essence, the core principles remain the same: create a Dataset to manage your data and apply transformations, and use a DataLoader
to handle batching and shuffling during training. The specific implementation details and augmentation techniques will vary depending
on the data modality.

18
Data Handling in Pytorch

19
Regularization Techniques
Dropout is a regularization technique where randomly selected neurons are "dropped out" (set to zero) during the training process.
This means their contribution to the activation of downstream neurons is temporarily removed. The probability of a neuron being
dropped out is controlled by a hyperparameter, typically denoted as 'p'.’
→ Implementation in PyTorch:
○ torch.nn.Dropout layer: Show how to add dropout layers to a model.
○ p parameter: Explain the dropout probability.
○ torch.nn.Dropout(p=0.5, inplace=False):
○ Training vs. Evaluation: Emphasize that dropout is active during training but turned off during evaluation
(model.eval()).
Weight decay is a technique that adds a penalty to the loss function proportional to the square of the magnitude of the model's
weights.
→ Implementation in PyTorch: weight_decay parameter in optimizer (torch.optim.Adam or torch.optim.SGD,)
Early stopping is a regularization technique that involves monitoring the model's performance on a validation set during training.
Training is stopped prematurely when the performance on the validation set starts to degrade (e.g., the validation loss starts
increasing or the validation accuracy starts decreasing), even if the training loss is still decreasing.
→ Implementation in PyTorch: Early stopping is typically implemented manually within your training loop.
1. Keep track of the validation loss (or another relevant metric).
2. Define a "patience" value: This is the number of epochs to wait after the validation loss has stopped improving before
stopping training.
3. Keep track of the best validation loss seen so far and the corresponding model state.
4. In each epoch:
○ Train on the training data.
○ Evaluate on the validation data and calculate the validation loss.
○ If the current val_loss is better than the best val_loss seen so far, update the best loss and save the current model state.
○ If the val_loss has not improved for 'patience' epochs, stop training. 20
Regularization Techniques

21
Gradient Clipping
Gradient Clipping as a Solution in PyTorch. Gradient clipping sets a threshold on the magnitude of the gradients. If the gradients
exceed this threshold, they are scaled down to prevent overly large weight updates. PyTorch provides convenient utilities in the
torch.nn.utils module to implement gradient clipping.
There are two main ways to implement gradient clipping in PyTorch:
1. Clipping by Value (torch.nn.utils.clip_grad_value_): This method directly clips the individual values of the gradients to a
specified range.
2. Clipping by Norm (torch.nn.utils.clip_grad_norm_): This is the more common and often recommended approach. It clips
the L2 norm (or another specified norm) of the gradients of all parameters together. If the total norm exceeds a threshold, all
gradients are scaled down proportionally
Gradient clipping is particularly useful in the following scenarios:
● Recurrent Neural Networks (RNNs): RNNs, especially those with many time steps or complex architectures like LSTMs
and GRUs, are prone to exploding gradients.
● Deep Neural Networks: Very deep feedforward networks can sometimes experience this issue as well.
● Training with High Learning Rates: If you are using a relatively high learning rate, gradient clipping can help maintain
stability.
● Observing Unstable Training: If you notice your training loss fluctuating wildly or increasing, it might be a sign of
exploding gradients, and gradient clipping could help
Implementation Steps in the Training Loop:
1. After calculating the gradients using loss.backward().
2. Before the optimizer's step() function updates the model's parameters.
Choosing the Clipping Threshold:
The optimal clipping threshold (clip_value for value clipping or max_norm for norm clipping) often needs to be determined through
experimentation. You can try different values and monitor the training process (e.g., loss curves, gradient magnitudes) to find a
suitable threshold that stabilizes training without hindering learning. Common values for max_norm often range between 0.1 and 10. 22
Gradient Clipping

23
Debugging and Logging: Tools and techniques for understanding the training
process.

24
Practice 3 - Get started with Hugging Face

Exercise1: Sentiment Analysis with Hugging Face Exercise2: Finetuning a Pretrained Model for Binary Text
Classification
1. Install the Hugging Face transformers library. In this exercise, you will:
2. Use a pre-trained sentiment analysis model from the 1. Install the necessary Hugging Face libraries
Hugging Face Hub. (transformers, datasets, evaluate).
2. Load a simple dataset for binary text classification.
3. Tokenize a sample sentence.
3. Load a pretrained model and its tokenizer.
4. Perform sentiment analysis on the sentence. 4. Preprocess the dataset to be suitable for the model.
5. Define training arguments.
References:
6. Create a Trainer object and finetune the model.
https://huggingface.co/docs/transformers/en/training#fine-tun
7. Evaluate the finetuned model.
e-a-pretrained-model

https://huggingface.co/blog/sentiment-analysis-python

https://www.kaggle.com/code/gauravduttakiit/sentiment-anal
ysis-using-hugging-face

https://www.kaggle.com/code/neerajmohan/fine-tuning-bert-f
or-text-classification

25
Q&A

DL Unit 4&5
No ratings yet
DL Unit 4&5
27 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Lecture 9 Model Selection
No ratings yet
Lecture 9 Model Selection
15 pages
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul Spector Test Bank
No ratings yet
Industrial and Organizational Psychology Research and Practice 8th Edition by Paul Spector Test Bank
31 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Geog Updated 2021-1
No ratings yet
Geog Updated 2021-1
278 pages
Chapter 5 - CNNs - Part1
No ratings yet
Chapter 5 - CNNs - Part1
30 pages
Finetuning Large Language Models - Short Course
No ratings yet
Finetuning Large Language Models - Short Course
16 pages
Deep Learning Important Questions For Ia 1
No ratings yet
Deep Learning Important Questions For Ia 1
11 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Chapter 3 - Training Deep Neural Networks
No ratings yet
Chapter 3 - Training Deep Neural Networks
25 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Fine Tuning Dictionary
No ratings yet
Fine Tuning Dictionary
17 pages
Hiperparametre
No ratings yet
Hiperparametre
10 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Architecture Automation Thesis 1
No ratings yet
Architecture Automation Thesis 1
85 pages
Chapter 2 - Artificial Neural Networks
No ratings yet
Chapter 2 - Artificial Neural Networks
19 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
Chapter1. Introduction To Deep Learning
No ratings yet
Chapter1. Introduction To Deep Learning
21 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Training NNs
No ratings yet
Training NNs
34 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
Module 2
No ratings yet
Module 2
67 pages
Week 2 - Select and Train A Model
No ratings yet
Week 2 - Select and Train A Model
29 pages
Maruti Suzuki India Limited: Standard Draft Purchase Order Msil Gurgaon Plant Page No.: 1 of 1
No ratings yet
Maruti Suzuki India Limited: Standard Draft Purchase Order Msil Gurgaon Plant Page No.: 1 of 1
1 page
Lecture 12 - Machine Learning
No ratings yet
Lecture 12 - Machine Learning
18 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
TrainingNN 1
No ratings yet
TrainingNN 1
52 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
08 Training
No ratings yet
08 Training
18 pages
Lect 7
No ratings yet
Lect 7
43 pages
Lec 8
No ratings yet
Lec 8
43 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
ML Sem
No ratings yet
ML Sem
24 pages
Assignment Jaiprakash
No ratings yet
Assignment Jaiprakash
5 pages
Batch - 7 FINAL Review (DEEP LEARNING)
No ratings yet
Batch - 7 FINAL Review (DEEP LEARNING)
42 pages
Fine Tuning Hper Parameters
No ratings yet
Fine Tuning Hper Parameters
13 pages
IMP Deep Learning
No ratings yet
IMP Deep Learning
9 pages
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
Data Mining
No ratings yet
Data Mining
16 pages
Elaborate On The Significance of Hyperparameter Optimization
No ratings yet
Elaborate On The Significance of Hyperparameter Optimization
5 pages
Deep Learning Basics Lecture 11 Practical Methodology
No ratings yet
Deep Learning Basics Lecture 11 Practical Methodology
25 pages
Best IELTS Coaching Institutes in Chandigarh
No ratings yet
Best IELTS Coaching Institutes in Chandigarh
15 pages
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
No ratings yet
15-Hyperparameter Tuning - Batch Normalization-14!08!2024
4 pages
Tutorial 4
No ratings yet
Tutorial 4
6 pages
Fine Tune Factors
No ratings yet
Fine Tune Factors
3 pages
Unit - IV
No ratings yet
Unit - IV
24 pages
Hyperparameter Tuning in DNNs
No ratings yet
Hyperparameter Tuning in DNNs
6 pages
Lead HRBP - Techolution - JD
No ratings yet
Lead HRBP - Techolution - JD
4 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
DL 4
No ratings yet
DL 4
15 pages
Couns Edu Supervision - 2024 - Kim - Discrimination Social Support and Wellness Among BIPOC Counseling Students
No ratings yet
Couns Edu Supervision - 2024 - Kim - Discrimination Social Support and Wellness Among BIPOC Counseling Students
12 pages
CHAPTER - 6 Attitude & Social Cognition
100% (2)
CHAPTER - 6 Attitude & Social Cognition
6 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Unit 5
No ratings yet
Unit 5
36 pages
Keras
No ratings yet
Keras
4 pages
Parameters To Fine Tune Large Language Models
No ratings yet
Parameters To Fine Tune Large Language Models
4 pages
Result Number Nov-Dec-2024
No ratings yet
Result Number Nov-Dec-2024
2 pages
Thesis Data Analysis PDF
100% (3)
Thesis Data Analysis PDF
6 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
01 - Introduction
No ratings yet
01 - Introduction
35 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Sony Ai Content
No ratings yet
Sony Ai Content
26 pages
Quiz 2 Solutions
No ratings yet
Quiz 2 Solutions
3 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Watch She Is So Fineee - #Hot #Ass, #Straight, Babe Porn - SpankBang
No ratings yet
Watch She Is So Fineee - #Hot #Ass, #Straight, Babe Porn - SpankBang
1 page
Art 10.14277 2499 5975 Tol 17 15 11
No ratings yet
Art 10.14277 2499 5975 Tol 17 15 11
14 pages
Pa Turnpike Design Consistancy Manual 2011
No ratings yet
Pa Turnpike Design Consistancy Manual 2011
208 pages
Fine-Tuning The Model What Why and How
No ratings yet
Fine-Tuning The Model What Why and How
3 pages
AVINASH PRABHAKAR TAYDE, Science Graduate, +35 Yrs Experience in Pharma
No ratings yet
AVINASH PRABHAKAR TAYDE, Science Graduate, +35 Yrs Experience in Pharma
3 pages
B210317003 - Zeeshan Asghar - Assignment No 02
No ratings yet
B210317003 - Zeeshan Asghar - Assignment No 02
6 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Resource Allocation III
No ratings yet
Resource Allocation III
32 pages
Present Simple English ESL Worksheets PDF &
100% (1)
Present Simple English ESL Worksheets PDF &
1 page
Science 9 Q2-Wk 3-SLHT-3 Ok
No ratings yet
Science 9 Q2-Wk 3-SLHT-3 Ok
8 pages
Anderson and Vitt, 1990.sexual Selection Versus Alternative Causes of Sexual Dimorphism in Teiid Lizards
No ratings yet
Anderson and Vitt, 1990.sexual Selection Versus Alternative Causes of Sexual Dimorphism in Teiid Lizards
14 pages
Purchae Stationery & Other Store Items PDDL
No ratings yet
Purchae Stationery & Other Store Items PDDL
3 pages
1 Sap-Step-By-Step-Navigation-Guide-For-Beginners
No ratings yet
1 Sap-Step-By-Step-Navigation-Guide-For-Beginners
22 pages
Getit Physics Electricity Key Words Gcse Aug 2017
No ratings yet
Getit Physics Electricity Key Words Gcse Aug 2017
9 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Assignment #1 WRE
No ratings yet
Assignment #1 WRE
5 pages
MRP - Chapter 5
No ratings yet
MRP - Chapter 5
7 pages
ICT Target Audience B2 G7 W2
No ratings yet
ICT Target Audience B2 G7 W2
4 pages
Getting Started With Data-Driven Decision Making: A Workbook
No ratings yet
Getting Started With Data-Driven Decision Making: A Workbook
17 pages
Assignment For Dynamic Programming
No ratings yet
Assignment For Dynamic Programming
5 pages
Receiving Skills: Listening Is Composed of Six Distinct Components
No ratings yet
Receiving Skills: Listening Is Composed of Six Distinct Components
11 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Brookfield Viscometer Assignment
No ratings yet
Brookfield Viscometer Assignment
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 4 - Fine-Tune Models and Training Algorithms

Uploaded by

Chapter 4 - Fine-Tune Models and Training Algorithms

Uploaded by

TRƯỜNG ĐẠI HỌC GIAO THÔNG VẬN TẢI TP.

PhD. Nguyễn Thị Khánh Tiên

Steps Involved in Finetuning:

3. RMSprop (Root Mean Square Propagation):

4. Adagrad (Adaptive Gradient Algorithm):

5. Adadelta (Adaptive Delta):

6. AdamW (Adam with Weight Decay):

7. LBFGS (Limited-memory BFGS):

1.StepLR: Reduces the learning rate by a fixed factor at specified epochs

4. Combining Batching, Augmentation, and Shuffling:

5. Handling Different Data Types:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.