Fix fine-tuning training loss accumulation #725
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Problem:
In /src/llama_recipes/utils/train_utils.py the training loss is correctly divided by the # of gradient accumulation steps to scale down the gradient:
loss = loss / gradient_accumulation_steps
The training loss is then accumulated
total_loss += loss.detach().float()
and used in the following to calculate the average loss across all samples in the epoch:
train_epoch_loss = total_loss / len(train_dataloader)
As the accumulated loss is scaled down by gradient_accumulation_steps and len(train_dataloader) includes all steps (even the gradient accumulation ones), train_epoch_loss is gradient_accumulation_steps times lower than it should be.
Solution:
Accumulate the loss
total_loss += loss.detach().float()
before scaling it down
loss = loss / gradient_accumulation_steps
Before submitting
Pull Request section?
to it if that's the case.
Thanks for contributing 🎉!