0% found this document useful (0 votes)
39 views12 pages

Finetuning LLM For Vulnerability Detection

Uploaded by

rodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views12 pages

Finetuning LLM For Vulnerability Detection

Uploaded by

rodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Finetuning Large Language Models for

Vulnerability Detection
Alexey Shestov Rodion Levichev Ravil Mussabayev∗
Huawei Russian Research Huawei Russian Research Huawei Russian Research
Institute Institute Institute
Moscow, Russia Moscow, Russia Moscow, Russia
shestov.alexey1@huawei- rodion.levichev@huawei- ravil.mussabayev@huawei-
arXiv:2401.17010v4 [cs.CR] 1 Mar 2024

partners.com partners.com partners.com

Evgeny Maslov Anton Cheshkov Pavel Zadorozhny


Huawei Russian Research Huawei Russian Research Huawei Russian Research
Institute Institute Institute
Moscow, Russia Moscow, Russia Moscow, Russia
evgeny.maslov@huawei.com anton@huawei.com pavel.zadorozhny@huawei-
partners.com

Abstract vulnerabilities, vulnerability prediction, vulnerability


This paper presents the results of finetuning large lan- classification, cybersecurity, AI
guage models (LLMs) for the task of detecting vulner-
abilities in source code. We leverage WizardCoder, a 1 Introduction
recent improvement of the state-of-the-art LLM Star-
Detecting vulnerabilities in source code is an impor-
Coder, and adapt it for vulnerability detection through
tant problem in software engineering. This project ex-
further finetuning. To accelerate training, we modify
plores finetuning of LLMs for binary classification of
WizardCoder’s training procedure, also we investigate
Java functions as vulnerable or not. The goals are the
optimal training regimes. For the imbalanced dataset
following:
with many more negative examples than positive, we
also explore different techniques to improve classifica- • Investigate the possibility of applying the new
tion performance. The finetuned WizardCoder model class of big LLM models for the task of vulnera-
achieves improvement in ROC AUC and F1 measures bility detection;
on balanced and imbalanced vulnerability datasets over • Improve over previous results of CodeBERT-style
CodeBERT-like model, demonstrating the effectiveness models by using big LLM models;
of adapting pretrained LLMs for vulnerability detec- • Investigate whether the encountered performance
tion in source code. The key contributions are finetun- limit is due to the limited capacity of CodeBERT-
ing the state-of-the-art code LLM, WizardCoder, increas- like models.
ing its training speed without the performance harm,
optimizing the training procedure and regimes, han- 1.1 Motivation
dling class imbalance, and improving performance on Transformer models pretrained on large code corpus,
difficult vulnerability detection datasets. This demon- like CodeBERT [5] and ContraBERT [10], have shown
strates the potential for transfer learning by finetun- promising results on many tasks of code understand-
ing large pretrained language models for specialized ing. LineVul [6], a model built on CodeBERT, achieved
source code analysis tasks. state-of-the-art results on the vulnerability detection
task, surpassing graph network approaches, like Re-
Keywords: Large language models, vulnerability detec- veal [2]. Recent LLMs, which have billions of param-
tion, finetuning, StarCoder, WizardCoder, PEFT, LoRA, eters, such as StarCoder [8] and WizardCoder [11], are
defect detection, Java dataset, deep learning, software pretrained on thrillions of tokens. They are 2 orders of
magnitude bigger than CodeBERT (CodeBERT has ap-
∗ Corresponding author proximately 125 million parameters) and have orders
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

of magnitude more data used for pretraining. They achieve • We describe the conducted experiments:
better results on language and code understanding tasks, – Comparison with ContraBERT (state-of-the-
providing an opportunity for improved transfer learn- art non-LLM transformer model) on balanced
ing. The vulnerability detection task can benefit from and imbalanced datasets;
the increased expressiveness and knowledge contained – Ablation study of the context size and classi-
in state-of-the-art LLMs. fication loss;
Vulnerability detection is a complex task with many – Investigation of the batch packing technique;
aspects influencing and potentially limiting the solu- – Improving the usage of a more informative
tion quality. One of the factors could be limits on the part of the imbalanced dataset.
capacity and code understanding ability of CodeBERT- • We conclude the study, highlighting our contri-
like models, as vulnerability detection requires a deep butions and stating promising future research di-
understanding of many code aspects. Checking this hy- rections.
pothesis is a crucial problem in the field of vulnerabil-
ity detection. 2 Implementation details
Furthermore, our previous study [3] revealed that 2.1 Challenges
simple prompting techiques for the ChatGPT LLM were
unable to achieve noticeably better results than ran- We briefly describe the main milestones and challenges
dom prediction on the vulnerability detection task. This that we encountered during the implementation of our
finding suggests that finetuning of LLMs on this task approach.
naturally becomes a promising research direction. LLM model selection for finetuning. Recently, there
The real-world distribution of vulnerable functions has been a rise in the number of open-sourced LLM
is highly class imbalanced, containing far more nega- models, trained on large code corpora [8, 11, 13, 16].
tive samples than positive ones. This imbalance makes The task is to choose an optimal model for finetuning,
the task inherently more difficult, as standard training both from the perspective of quality and potential tech-
objectives, like cross-entropy loss, may lead the model nical difficulties encountered. The goal is to select a
to focus on the dominant negative class. Specialized model possessing the following properties:
training techniques, such as focal loss [9] and sample • It shows a strong potential for effective transfer
weighting, are required to properly emphasize the mi- learning to the vulnerability detection task;
nority positive examples during finetuning. • It is technically easy to use and integrate into
This project conducts experiments to find the opti- our framework.
mal model architecture, training hyperparameters, and
loss formulations for effective LLM finetuning on the Adapting LLMs for limited computational resources.
vulnerability detection task (both for balanced and im- Given our computing capabilities, finetuning complete
balanced datasets). The aim is to leverage recent ad- LLM models is infeasible and would be inefficient even
vances in pretrained models to improve over prior bench- with increased resources. To enable effective finetun-
mark results. ing of a 13-billion parameter model on our hardware,
The remainder of the paper is organized as follows: we need methods to reduce the number of trained pa-
rameters and memory requirements.
• We describe our approach for building and adapt-
Adapting LLMs for the classification task. The
ing the model. Specifically, we describe how we
standard pretraining task for LLM decoders is the next
solved the following problems:
token prediction, which differs from our end task of
– Choosing the model;
vulnerability classification. Including the next token
– Using limited GPU resources for the model
prediction component in the loss may misguide weight
size;
updates away from the optimal state for classification.
– Adapting a decoder model for the classifica-
To fully leverage the capabilities of the LLM weights
tion task;
for our target task, a better strategy might be to leave
– Underutilization of the input sequence length,
only the classification term in the loss.
resulting in a slow training speed.
• We describe our main instruments: datasets, mod- Improving a slow training speed. The standard se-
els, metrics, as well as measurement procedures; quence classification approach of padding to a fixed
Finetuning Large Language Models for Vulnerability Detection

length makes training extremely slow. To mitigate the for Python. Subsequently, we primarily used the Wiz-
issue of short sequence lengths, the task is to pack mul- ardCoder model, as it is claimed to be superior to Star-
tiple functions into each training sequence. This can Coder. However, our experiments did not reveal any
decrease unused padded tokens, improving computa- significant differences in performance between these
tional efficiency. two models. In our future writing, we will use the term
“StarCoder” to refer specifically to the WizardCoder
variant of the StarCoder family.
2.2 Choosing an LLM to finetune Generally, more elaborate experiments are needed to
Recently, there has been a proliferation of open-sourced find if there are any differences in the performance of
LLM models trained on large code corpora [8, 11, 13, these models.
16]. The task is to select an optimal model for finetun-
ing. 2.3 Adapting LLMs for limited computational
First, we attempted to use the CodeGeeX model [16]. resources
Unfortunately, we found that using this model was very Finetuning complete multi-billion parameter LLMs is
difficult from the technical perspective, as the code pro- infeasible given our hardware constraints, even if more
vided by the authors does not support standard libraries resources were obtained. To enable effective finetun-
for transformer training and large model adaptation. ing, we require techniques to reduce the number of
We were able to run the model on 8 V100 GPUs for in- trained parameters and memory needs. The Hugging-
ference, but this memory capacity was insufficient for Face Peft [12] library implements several methods tack-
finetuning. The lack of support for common LLM adap- ling this, including the LORA method [7]. LORA en-
tation libraries made it very challenging to overcome ables full-model finetuning by decomposing the weight
the memory limitations, so we decided to discontinue matrices into low-rank approximations, drastically de-
attempts with this model. creasing parameters and memory.
Next, we explored two other models, WizardCoder We successfully applied LORA to the WizardCoder
[11] and CodeGen [13]. In contrast to CodeGeeX, these model. We used the following optimal LORA settings:
models include support for major transformer train- r = 8, alpha = 32, dropout = 0.05. This reduced the 13
ing frameworks like DeepSpeed, HuggingFace Trans- billion parameter models to just 25 million parameters,
formers, Peft [12], etc. This compatibility with stan- smaller than CodeBERT. The results validate LORA’s
dard libraries significantly eased the application of low- effectiveness for LLM adaptation under memory limi-
memory adaptation techniques. tations.
When conducting a simple Question Answering train-
ing, we find that WizardCoder is much better than Code- 2.4 Adapting LLMs for classification
Gen at directly answering “YES” and “NO”. CodeGen Typically, code LLMs are trained on a large amount of
sometimes responded with “TRUE”, “FALSE”, erratic unlabeled code data in a special way. The principle is to
“YES” and “NO” with additional characters, and has a iteratively take code tokens as input, predict the next
worse sequence stopping behavior. Given the Wizard- token, and compare it with the ground truth. This prin-
Coder’s stronger capabilities on this task, we focused ciple drives the next token prediction loss. Specifically,
our subsequent experiments on this model. for any input sequence {𝑥 1 , 𝑥 2, . . . , 𝑥 𝑁 } of length 𝑁 , the
output of an LLM is a probability distribution of the
Choosing the best StarCoder-family model. In our next token 𝑃 (𝑥 𝑁 +1 |𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) = 𝑝 𝑁 +1 ∈ [0, 1] 𝑣 ,
study, we evaluate three models from the StarCoder where Θ represents all parameters of the model, and 𝑣
family for finetuning: StarCoderBase, StarCoder [8] (which is the vocabulary size. By comparing it with the real
𝑣 of the
is StarCoderBase finetuned on the Python dataset), and distribution, i.e., a one-hot vector 𝑦 𝑁 +1 ∈ {0, 1}
WizardCoder [11] (which is an improved version of ground-truth token, we can optimize the cumulative
StarCoder trained using the eval-instruct technique). cross-entropy loss:
We compared the performance of StarCoder and Star-
CoderBase on the question answering task and found Õ
𝑁 −1
no significant differences between the two models, de- Lntp = − 𝑦𝑛+1 log P(𝑥𝑛+1 |𝑥 1, 𝑥 2, . . . , 𝑥𝑛 , Θ) (1)
spite the fact that StarCoder was specifically adapted 𝑛=1
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

In the context of adapting a language model for clas- Function length (tokens) Functions number
sification tasks, the objective function used during train- 0-10 1976
ing needs to be aligned with the classification goal rather 10-20 31627
than the next token prediction. For an input sequence 20-50 51972
{𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 } of length 𝑁 , the traditional generative 50-100 32194
pretraining objective (next token prediction loss) would 100-300 35769
compute the loss across all predicted tokens in the se- 300-500 8756
quence. However, this is suboptimal for a classifica- 500-1000 5515
tion task such as vulnerability classification, which is 1000-2000 2236
concerned with categorizing the entire input sequence >2000 767
rather than predicting subsequent tokens. Table 1. Function length histogram, obtained from
To address this, we propose a classification loss that parsing several projects
leverages only the predicted probability of the final
token 𝑥 𝑁 , which is then matched against the ground
truth label 𝑦 using cross-entropy. This loss is expressed
mathematically as:
Thus, during our study we focus on the following
two regimes of using code LLMs:
Lclass = − log P(𝑦|𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) (2)
1. Next token prediction. This approach corresponds
Here, 𝑦 represents the correct class label for the se- to using the next token prediction loss (1) dur-
quence, and P(𝑦|𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) is the probability that ing training and prediction. During training, an
the model, parameterized by Θ, assigns to the correct LLM’s input sequence is packed with the code of
class. This formulation ensures that weight updates dur- training methods and their ground truth labels
ing training are driven exclusively by the classification one after another in the following format:
objective, without any influence from the generative Method’s code + YES / NO (binary ground truth
task of next token prediction. The classification loss label) + <EOS>
may be viewed as corresponding to the last term of the
Here <EOS> denotes the special “end of sequence”
generative pretraining objective’s (next token predic-
token from the vocabulary. We fit as many as
tion) cross-entropy loss, focusing entirely on the clas-
possible full training examples into the input se-
sification output for 𝑛 = 𝑁 .
quence length, which is equal to 2048 for StarCoder-
Eliminating the next token prediction objective en-
based models and 512 for CodeBERT-based ones.
ables full utilization of the pretrained weights for vul-
The rest of the positions are filled with the spe-
nerability detection, without interference from unre-
cial padding token. For prediction, the binary clas-
lated generation tasks.
sification token is generated according to (1), and
its probability is taken for ROC AUC calculation;
2.5 Speeding up the classification training
2. Binary classification. This regime uses the binary
The standard approach of passing code samples to the cross entropy classification loss (2) for training
LLM by writing tokens into the context window and and prediction. Specifically, the training scheme
padding unused slots is problematic for our data. Most for binary classification loss is formalized as fol-
dataset functions have length under 50 tokens (see Ta- lows:
ble 1), so this padding wastes computation. a. Load a batch with as many complete func-
We see that most functions have less than 50 tokens, tion codes as it can fit in the following for-
with many having 10-20 tokens. Using one function mat:
per batch element (padded to 2048) is highly redun-
dant. Method’s code + <EOS>
To mitigate the issue of short sequence lengths, we b. Populate an array of labels, assigning a label
pack multiple functions into each training sequence. to each function in the batch, where the label
This increases the actual batch size versus using one is either 1 or 0, indicating whether the func-
function per sequence. tion is vulnerable or not, respectively;
Finetuning Large Language Models for Vulnerability Detection

c. During loss computation, match the predicted There is a room for further enhancement of the batch
probability of the next token for the last to- packing method. The dynamic batch size could be sta-
ken of each function with its corresponding bilized using dynamic gradient accumulation steps. The
label from the label array; dynamic accumulation would accumulate gradients un-
d. Calculate the cross-entropy loss for each pair til a target batch size is reached before applying an op-
of predicted probability and actual label us- timizer step. This would stabilize batch size while re-
ing (2); taining the efficiency benefits of packing sequences.
e. Sum up the cross-entropy loss values across However, this poses some technical challenges. It re-
all predictions in the batch (not the average, quires non-trivial modifications to the training loop
which usually works worse, as will be inves- code in HuggingFace Transformers to support dynamic
tigated later); gradient accumulation.
f. The resulting sum is the loss for the batch,
which is used to update the model’s weights 3 Experimental setup
during training.
The loss Lbatch for a batch is thus given by: 3.1 Datasets
Sources. Our vulnerability dataset is constructed from
several open-source vulnerability datasets: CVEfixes
Õ
𝐵 [1], a Manually-Curated Dataset [14] and VCMatch [15].
Lbatch = −𝑦𝑖 · log(𝑝𝑖 ) − (1 − 𝑦𝑖 ) · log(1 − 𝑝𝑖 ) (3) CVEfixes dataset includes information about vulner-
𝑖=1 abilities and their fixes in open-source software, with
data collected from the NVD database and other repos-
where 𝐵 is the number of functions in the batch, itories. However, some inconsistencies and errors may
𝑦𝑖 is the true label for the 𝑖-th function, and 𝑝𝑖 be present due to the automatic construction process.
is the model’s predicted probability that the 𝑖-th A Manually-Curated Dataset focuses on Java language
function is vulnerable. only and includes data for 624 publicly disclosed vul-
Due to the memory limits of our hardware, we nerabilities across 205 distinct open-source Java projects.
limited the batch size to only one input sequence The dataset is considered more reliable as fix-commits
during training. are found by experts.
For prediction, no batch packing is used and only VCMatch provides a ranking-based approach for au-
one test function is fitted into the entire input se- tomatic security patches localization for OSS vulnera-
quence, with the rest of the token positions filled bilities. It includes data for only 10 popular repositories
by the padding token. and only contains fix-commits.
The baseline classification approach achieved only
1500 training samples per hour. This extremely slow Dataset labeling. Each of the aforementioned datasets
training throughput made iterative experiments and have standalone functions as their elements. In each of
tuning infeasible on our dataset. In contrast, batch pack- the datasets functions are labeled as vulnerable or non-
ing provided a 13x speedup to 20000 samples per hour. vulnerable based on heuristics derived from analyzing
By concatenating multiple short sequences into one in- vulnerability-fixing commits:
put example, we can significantly boost efficiency for 1. From vulnerability-fixing commits, the changed
tasks like ours with small input functions. The key ad- functions are extracted;
vantage of batch packing is reducing unused padding 2. The pre-change versions are taken as vulnerable
tokens. By packing sequences, a much larger propor- functions, while the post-change versions are taken
tion of computation goes towards informative function as non-vulnerable ones;
tokens rather than wasted padding. 3. Pairs of changed functions are used to create a
In our case, batch packing enables practical finetun- balanced dataset. Functions that remained unchanged
ing by speeding up training by an order of magnitude. are labeled as non-vulnerable as well.
This allows reasonable iteration times for experimenta-
tion. Also, this approach is generally applicable when Dataset filtering. Vulnerability-fixing commits of-
finetuning decoder models on tasks with short input ten contain changes related to cleanups, refactors, irrel-
sequences. evant functionality changes. Therefore, the described
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

labeling heuristics produce some amount of false pos- dataset. In most cases, we report both validation and
itive labels. To address this issue, we create a modi- test metrics.
fied dataset: we extract a function as vulnerable from Hyperparameter optimization is performed on the
a vulnerable-fixing commit if this function is the only validation dataset using the best validation metrics per-
one that has changed in the commit. We call this dataset formance.
𝑋 1 . This dataset have the following advantages:
• Its labels are more robust, as commits fixing only 3.3 Pretrained models
one function are more likely to fix only vulnera- In this work, the primary pretrained model architec-
bilities; ture employed is WizardCoder [11], a transformer-based
• The dataset is small, which makes it a good in- neural network with a decoder-only architecture. With
strument for investigating the algorithmic per- over 13 billion parameters, WizardCoder was pretrained
formance and finding optimal parameters. using a causal language modeling objective on a large
More specifically, extracting functions from the com- collection of GitHub source code, endowing the model
mits fixing only one function mitigates labeling errors with extensive knowledge of natural programming lan-
that stem from irrelevant code and cleanup changes guage constructs.
[4]. We compare WizardCoder to ContraBERT [10], which
is the state-of-the-art model in the CodeBERT family.
Addition of easy negative functions. As a result In contrast to vanilla CodeBERT, ContraBERT intro-
of these procedures, we obtain a set of vulnerable func- duces an enhanced encoder architecture obtained by
tions (we call it 𝑃1 ) and a set of vulnerable functions cooperation of two pretrained CodeBERT encoders. Con-
with fixed vulnerability (we call it 𝑃2 ). We can also aug- traBERT is pretrained using the usual masked language
ment these sets with a set of functions randomly scratched modeling (MLM) along with a number of contrastive
from random projects, which do not have any vulner- pretraining tasks, and can be finetuned for the defect
abilities. We call this part 𝑃3 . detection task [10]. ContraBERT can be considered as
an enhancement of the CodeBERT model, which show-
Datasets characteristics. Finally, we obtain the fol-
cases more accurate and robust results than CodeBERT
lowing two datasets:
for the defect detection task. Hence, ContraBERT is se-
• 𝑋 1 without 𝑃3 . lected for the experiments.
It has 1334 samples, with 810 samples in the train
part, 272 samples in the validation part and 252 3.4 Full reproducibility package
samples in the test part. The balance of positive All necessary code and data for reproducing our exper-
to negative classes is approximately equal to 1:1; iments are available in our GitHub repository1 .
• 𝑋 1 with 𝑃3 .
It has 22945 samples, with 13247 samples in the
train part, 5131 samples in the validation part
4 Research questions
and 4567 samples in the test part. The balance 4.1 Comparison to CodeBERT-based models.
of positive to negative classes is approximately RQ1: Is a StarCoder-based model more effective
equal to 1:34, and the majority of the negative than a CodeBERT-based for the balanced vulner-
class is drawn from the 𝑃3 part. ability detection task?
The models were finetuned on the dataset without
3.2 Evaluation metrics easy 𝑃3 negatives, using different hyperparameters like
We use a standard procedure for model training and the batch size, learning rate, and number of epochs. Op-
evaluation. We train a model for a predefined number timal settings for this dataset were determined.
of epochs, choose the best checkpoint by ROC AUC on RQ2: Is a StarCoder-based model more effective
the validation dataset, and evaluate the chosen check- than CodeBERT-based for the imbalanced vulner-
point on the test dataset. ability detection task?
The metrics reported include ROC AUC, F1 score for Different training recipes, like the number of epochs
the positive class (vulnerable functions), optimal clas- and loss functions, were compared between the datasets
sification threshold (used for F1 score). The optimal
classification threshold is determined on the validation 1 https://github.com/rmusab/vul-llm-finetune
Finetuning Large Language Models for Vulnerability Detection

with and without 𝑃3 in order to analyze the transfer- 5 Experimental results


ability of hyperparameters. 5.1 Comparison to CodeBERT-based models
RQ1: Is a StarCoder-based model more effective
4.2 Ablation study. than a CodeBERT-based for the balanced vulner-
ability detection task?
RQ3: Is the standard LLM training approach effec-
We compared the Starcoder-based model with the
tive for vulnerability detection?
CodeBERT-based model on the dataset 𝑋 1 without 𝑃3 .
We formulate the vulnerability detection task as a
This dataset is a balanced dataset without easy nega-
question answering problem where the model predicts
tive examples.
the next token in addition to a binary vulnerability la-
The methodology is the following: we take a pre-
bel. The goal is to leverage the standard training ap-
trained model, manually optimize its learning hyper-
proach for LLM decoder models, applying it to our clas-
parameters using the training and validation datasets,
sification task. Training utilizes a cross-entropy loss
and then determine the quality of the best model on the
for both the token prediction and vulnerability classi-
test dataset. The optimized hyperparameters include
fication objectives.
the batch size, learning rate and number of epochs.
The results are presented in Table 2. We see that the
Batch packing properties. Batch packing enables finetuned WizardCoder surpasses finetuned Con-
dynamic batch sizes but may cause unstable gradients. taBERT both in ROC AUC and F1 metrics on the
The key questions in this direction are: balanced vulnerability detection task. This supe-
rior performance of WizardCoder can be attributed to
• RQ4: Does the dynamic batch size harm the its larger model capacity of 13 billion parameters and
model quality or not? pretraining on a much larger code corpus.
• RQ5: Does the mean or the sum loss reduc-
tion perform better? Base model ROC AUC F1
ContraBERT 0.66 0.68
RQ6: Does an increased context size provide im- WizardCoder 0.69 0.71
provements in quality? Table 2. ROC AUC and F1 scores for finetuned Contra-
To determine sensitivity to context, the model was BERT and WizardCoder models on the 𝑋 1 without 𝑃3
trained with reduced input sequence lengths of 512 to- dataset.
kens instead of 2048 on the dataset without the 𝑃3 part.
Then, the results were compared in order to analyze
impact. To finetune the WizardCoder model, several impor-
tant hyperparameters were identified: a batch size of
1 (the actual batch size becomes approximately 120 by
4.3 Imbalanced classification improvement. packing functions in the batch), learning rate of 0.0001,
RQ7: Is the focal loss with sample weighting effec- and 50 epochs of training with cosine annealing sched-
tive for tackling the vulnerability detection class ule. The best checkpoint occurred at epoch 19, indicat-
imbalance problem? ing sufficient training time is needed.
On the full dataset, focal loss and sample weight- The number of epochs for the cosine schedule is cru-
ing techniques were applied to emphasize the minority cial, as it controls the learning rate distribution across
positive examples from 𝑃1 . The gamma parameter and epochs and greatly impacts training regimes.
sampling weights of the focal loss were systematically RQ2: Is a StarCoder-based model more effective
varied to ensure the performance. than CodeBERT-based for the imbalanced vulner-
For all experiments and analyses, the main evalua- ability detection task?
tion metrics were ROC AUC, F1 score, accuracy, and We compared the Starcoder-based model with the
optimal classification threshold. The goal was to make CodeBERT-based model on the 𝑋 1 with 𝑃3 dataset. This
finetuning possible for classification, as well as to de- dataset is an imbalanced dataset with majority of sam-
termine the best practices for finetuning on an imbal- ples belonging to easy negative examples. This dataset
anced task. is closer to the real-world vulnerability distribution.
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

The methodology of comparison is the same as in • The actual batch size becomes dynamic during
RQ1. The optimal training hyperparameters appeared training as sequences are packed;
to be the same for both datasets (𝑋 1 with and without • Using the standard mean loss reduction gives dif-
𝑃3 ). The identical optimal settings imply that the mod- ferent functions unequal influence on the gradi-
els train similarly on both datasets. ent, which may harm the overall performance;
The results are presented in Table 3. We see that • Using the sum loss reduction training steps un-
finetuned WizardCoder surpasses finetuned Con- evenly, which may also be harmful to performance.
taBERT both in ROC AUC and F1 metrics on the This raises two research questions:
imbalanced vulnerability detection task. However, • RQ4: Does the dynamic batch size harm the
achieved gains over ContraBERT in ROC AUC score model quality or not?
are smaller compared to the 𝑋 1 without 𝑃3 dataset. • RQ5: Does the mean or the sum loss reduc-
A potential reason is underutilization of the more infor- tion perform better?
mative 𝑃1 and 𝑃2 partitions on this dataset.
RQ4: Does the dynamic batch size harm the model
quality or not?
Base model ROC AUC F1
The batch packing approach actually introduces the
ContraBERT 0.85 0.22
dynamic batch size during training. This dynamic batch
WizardCoder 0.86 0.27
size allows to decrease the training time by up to 13
Table 3. ROC AUC and F1 scores for finetuned Contra- times. Dynamic batch size is not a standard approach
BERT and WizardCoder models on 𝑋 1 with 𝑃3 dataset. for training neural networks, so the following question
arises: does it sacrifice the model quality for training
speed improvement? In order to answer this question
we tried a variant of batch packing that limits the max-
5.2 Ablation Study
imum number of functions in a batch. This limiting de-
RQ3: Is the standard LLM training approach effec- creases the dispersion of the actual batch sizes distri-
tive for vulnerability detection? bution.
The WizardCoder model was finetuned on the vul- We tested two different values for the maximum num-
nerability dataset by formulating the task as a ques- ber of functions per batch: 50 and 100 on the valida-
tion answering problem. Specifically, the model was tion part of the 𝑋 1 without 𝑃3 dataset. Our results are
trained to predict the next token in addition to the bi- present in Table 5. The results imply that there is no
nary vulnerability label on the 𝑋 1 with 𝑃3 dataset. The significant influence of limiting the maximum number
results are presented in Table 4. of functions per batch, indicating that dynamic batch-
ing may not have a detrimental effect on the model’s
Base model Training approach ROC AUC performance.
WizardCoder Next token prediction 0.75
CodeBERT Binary classification 0.85 Max. functions per batch Validation ROC AUC
WizardCoder Binary classification 0.86 No limit 0.72
Table 4. ROC AUC Scores for finetuning with the next 50 0.72
token prediction and classification objectives on the 100 0.72
imbalanced 𝑋 1 with 𝑃3 dataset. Table 5. WizardCoder’s results with limiting the max-
imum number of functions per batch on the validation
part of the 𝑋 1 without 𝑃3 dataset.
The next token prediction approach achieved the ROC
AUC score of 0.75 for WizardCoder on the 𝑋 1 with 𝑃3
In summary, dynamic batch size works well em-
dataset. This result is inferior to CodeBERT and
pirically and limiting it does not lead to any im-
WizardCoder models trained with classification-
provements.
only objectives (0.85 and 0.86, respectively), but
RQ5: Does the mean or the sum loss reduction
still surpasses random performance.
perform better?
Batch packing properties. The batch packing ap- Each kind of loss reduction has arguments against
proach introduces some unique properties: it:
Finetuning Large Language Models for Vulnerability Detection

• Using the standard mean loss reduction gives dif- Table 7 shows that reducing the context size from
ferent functions unequal influence on the gradi- 2048 tokens to 512 tokens resulted in the identi-
ent, which may harm the overall performance; cal validation ROC AUC score of 0.69 for Wizard-
• Using the sum loss reduction training steps un- Coder. This suggests the performance gains are
evenly, which may also be harmful to performance. mainly due to the model’s learning of improved
code representations, rather than the increased
Thus, it becomes important to investigate whether
context size.
the mean or sum loss reduction is better suited for the
batch packing approach. In order to answer this ques-
5.3 Imbalanced classification improvement
tion, we performed testing on the validation part of the
𝑋 1 without 𝑃3 dataset. The results are present in Ta- RQ7: Is the focal loss with sample weighting effec-
ble 6. The mean loss reduction performed poorly, high- tive for tackling the vulnerability detection class
lighting issues with the mean loss in comparison with imbalance problem?
the sum reduction. The obtained gains of the StarCoder-based model
The superior performance of the sum loss reduction over the CodeBERT-based on the imbalanced 𝑋 1 with
indicates that it better handles the uneven sequence 𝑃3 dataset are minor compared to the 𝑋 1 without 𝑃3
lengths arising from the batch packing. By summing dataset. A potential reason is underutilization of the
over the packed sequences, each function contributes more informative 𝑃1 and 𝑃2 parts on this dataset. In or-
equally to the gradient. der to explore the influence of better utilization of 𝑃1
and 𝑃2 parts on the model’s quality, we conducted ex-
periments that incorporate the focal loss and sample
Loss reduction Validation ROC AUC weighting techniques:
Mean 0.57
Sum 0.72 • Focal loss emphasizes hard and rare examples
contained in the 𝑃1 and 𝑃2 parts;
Table 6. WizardCoder’s performance by a loss reduc-
• Sample weighting also focuses model on the mi-
tion method on the validation part of the 𝑋 1 without
nority cases from the 𝑃1 and 𝑃2 parts.
𝑃3 dataset.
Focal Loss. Focal loss [9] applies a modulating fac-
tor (1 − 𝑝𝑡 )𝛾 to the standard cross entropy loss. This
factor down-weights well-classified examples and fo-
In summary, the sum loss reduction method out-
cuses training on hard misclassified cases. The stan-
performs the mean one by better handling uneven
dard cross-entropy loss corresponds to 𝛾 = 0.
sequence lengths.
We tested 𝛾 values from 1 to 5 on the validation part
RQ6: Does an increased context size provide im-
of the 𝑋 1 with 𝑃3 dataset. The results are presented in
provements in quality?
Table 8.
An open question was whether the quality improve-
ments obtained by WizardCoder stemmed from the larger
2048 context size versus the 512 baseline, or better code 𝛾 =0 𝛾 =1 𝛾 =3 𝛾 =5
understanding by the model. ROC AUC 0.858 0.873 0.868 0.852
We conducted an experiment to isolate the impact of F1 0.277 0.265 0.272 0.250
context size. The 2048 context size WizardCoder model Best val. epoch 12 12 17 11
was compared to the 512 context size version on the Best threshold 0.087 0.288 0.265 0.4
test part of the 𝑋 1 without 𝑃3 dataset. Table 8. Resulting quality of WizardCoder for differ-
ent 𝛾 values in focal loss on the imbalanced 𝑋 1 with 𝑃3
dataset.
Context size Test ROC AUC
512 Tokens 0.69
2048 Tokens 0.69
Table 7. The impact of the WizardCoder’s context size In summary, the focal loss with 𝛾 = 1 achieves a
on performance. small improvement in ROC AUC over the standard cross
entropy loss (𝛾 = 0). However, the gains are small.
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

The peak quality at 𝛾 = 1 followed by a decrease The best weighting scheme of 3𝑥 attains marginal
at higher 𝛾 values suggests the importance of balanc- gains in both ROC AUC and F1 over unweighted train-
ing the emphasis on hard examples with retaining suf- ing. However, excessive weighting, like 30𝑥, hurts the
ficient training signal from easier instances. validation and test metrics.
One potential reason for such small gains is that the This indicates there is an optimal moderate weight-
noisy or imperfect labels limit the ability of the focal ing that slightly emphasizes the hard and rare 𝑃1 and 𝑃2
loss to accurately identify the most important hard ex- examples without skewing the distribution too heavily.
amples to prioritize. If the label of an example does not Further tuning of the weighting factor may yield addi-
precisely reflect its difficulty, a heavy focus on hard tional gains.
cases may be less effective.
Combination of sample weights with focal loss.
The inconsistent improvements demonstrate the need
Both the focal loss and additional weights do similar
for further exploration of methods to leverage hard ex-
work. In the original work [9], it was stated that adding
amples without detriment to learning on easier cases,
weights to the focal loss leads to some improvements.
while taking the label quality into account.
Therefore, we combined the focal loss with the sam-
An interesting observation is that the focal loss leads
ple weighting technique. The results are presented in
to better calibrated models, with thresholds closer to
Table 10.
0.5. This indicates that the predicted class probabilities
are closer to the true values under the focal loss. The Weight 𝛾 Val. AUC Test AUC Test F1
better calibration is a useful characteristic. 3 1.0 0.88 0.877 0.273
Adding sample weights. Sample weighting assigns 10 3 0.872 0.853 0.226
higher importance weights to the 𝑃1 and 𝑃2 examples 10 1 0.877 0.863 0.243
compared to 𝑃3 during training. Weights from 1.0 to 3 3 0.874 0.865 0.286
30.0 were tried. Larger weights place more emphasis None 1 0.87 0.873 0.265
on the minority informative data. The similar technique None 3 0.87 0.868 0.272
was reported to be effective with the focal loss in the Table 10. The results of WizardCoder using the focal
original article [9]. loss and the sample weighting on the imbalanced 𝑋 1
The applied technique is different from the usual class with 𝑃3 dataset.
weighting scheme since it puts more weight on both
the samples of the positive (𝑃1 ) and negative (𝑃2 ) parts.
In the previous experiments, we found that the class Experiments with the focal loss and sample weight-
weighting technique was ineffective, so the sample weight- ing demonstrated minor improvements over the
ing scheme provides a different perspective. baseline training, with the best model achieving
The numeric results are presented in Table 9. 0.877 ROC AUC versus 0.86 for the baseline. How-
ever, neither technique provided substantial gains.
Best val. Best Test Test More advanced methods that can explicitly account for
Weight AUC epoch AUC F1
Threshold a severe class imbalance are needed.
30 0.876 6 0.863 0.22 0.42
10.0 0.878 13 0.86 0.24 0.44 6 Threats to validity
3.0 0.88 12 0.875 0.265 0.07 There are some potential threats to validity that could
1.0 (base) 0.87 12 0.858 0.277 0.08 limit the objectiveness of our study:
Table 9. Resulting quality of WizardCoder for different • The dataset is not split by projects, and splitting
values of 𝑃1 + 𝑃2 weights on the imbalanced 𝑋 1 with 𝑃3 by projects might have resulted in a significant
dataset. decrease in quality;
• When finetuning a model in the classification
regime with batch packing, the model is able to
From Table 9 we conclude that adding sample weights see the functions that precede the current one in
for the 𝑃1 and 𝑃2 partitions can provide small improve- the input. This could limit the training effective-
ments in ROC AUC and F1 score over the baseline. How- ness of the resulting model. However, the nega-
ever, large weight values degrade performance. tive impact on training should be mitigated since
Finetuning Large Language Models for Vulnerability Detection

these functions are placed randomly, so it would broader tasks involving code analysis and understand-
be hard for the model to learn irrelevant features. ing.
Also, during inference for test data, only one is While we show an improvement in quality of the
function was considered in any single input se- vulnerability detection task, the improvement is not so
quence, so there is no bias in the test scores; big as we could expect knowing the used LLM’s scale
• The function-level granularity of our dataset lim- and capacity. This gives us the answer to the question
its the types of vulnerabilities that span across posed at the beginning of the paper: whether the en-
multiple functions; countered performance limit is due to the limited ca-
• The dataset size is relatively small. This might pacity of CodeBERT-like models. We see that the main
limit its representativeness of the true vulnera- bottlenecks of the task that limit the performance lie
bility distribution. somewhere else, possibly in the field of dataset quality
and usage of the project-level context information.
7 Conclusion
Our work demonstrates the effectiveness of finetun-
References
[1] Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVE-
ing large language models for the vulnerability detec-
fixes: automated collection of vulnerabilities and their fixes
tion problem in source code. The WizardCoder model from open-source software. In Proceedings of the 17th Interna-
achieved state-of-the-art results, with the ROC AUC tional Conference on Predictive Models and Data Analytics in
score of 0.69 on the dataset without easy negative ex- Software Engineering (Athens, Greece) (PROMISE 2021). Asso-
amples. This improves over previous CodeBERT-based ciation for Computing Machinery, New York, NY, USA, 30–39.
models, likely due to the WizardCoder’s larger model https://doi.org/10.1145/3475960.3475985
[2] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. 2022. Deep
capacity and pretraining corpus. Learning Based Vulnerability Detection: Are We There Yet?
Several key contributions are made: IEEE Transactions on Software Engineering 48, 09 (sep 2022),
3280–3296. https://doi.org/10.1109/TSE.2021.3087402
• An efficient batch packing strategy is developed [3] Anton Cheshkov, Pavel Zadorozhny, and Rodion Levichev.
to mitigate small sequence lengths, providing over 2023. Evaluation of ChatGPT Model for Vulnerability Detec-
13𝑥 speedup in training time. This enables faster tion. arXiv:2304.07232 [cs.CR]
iteration and tuning; [4] Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023.
Data Quality for Software Vulnerability Datasets. In Proceed-
• An improvement in ROC AUC from 0.66 to 0.69
ings of the 45th International Conference on Software Engineer-
was obtained over the state-of-the-art non-LLM ing (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press,
model ContraBERT on the 𝑋 1 dataset without 𝑃3 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
part. An improvement of the F1 score was ob- [5] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
tained for the dataset with 𝑃3 (0.27 vs 0.22); aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
• For the highly imbalanced dataset, it was shown Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A
Pre-Trained Model for Programming and Natural Languages.
that the focal loss with sample weighting gives In Findings of the Association for Computational Linguistics:
improvements from 0.86 to 0.878. Despite these EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). As-
improvements, more advanced methods are needed sociation for Computational Linguistics, Online, 1536–1547.
to properly emphasize the minority positive ex- https://doi.org/10.18653/v1/2020.findings-emnlp.139
amples; [6] Michael Fu and Chakkrit Tantithamthavorn. 2022. Line-
Vul: A Transformer-based Line-Level Vulnerability
• A new more precise vulnerability benchmark dataset
Prediction. In 2022 IEEE/ACM 19th International Con-
is collected. It is smaller than most of the avail- ference on Mining Software Repositories (MSR). 608–620.
able datasets, as well as possesses quality labels https://doi.org/10.1145/3524842.3528452
that are free from the errors stemming from ir- [7] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
relevant code and cleanup changes [4]. Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2022. LoRA: Low-Rank Adaptation of Large Language Mod-
Opportunities remain for further improvements in els. In International Conference on Learning Representations.
this approach, mainly related to training on the dataset https://openreview.net/forum?id=nZeVKeeFYf9
[8] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muen-
with the 𝑃3 part included. Future work should explore
nighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
techniques like curriculum learning, active sampling, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii
and data augmentation to better leverage the scarce mi- Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De-
nority data. The insights gained can guide research on haene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro,
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny

Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Ze- Mining (, Long Beach, CA, USA,) (KDD ’23). Association
baze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Ben- for Computing Machinery, New York, NY, USA, 5673–5684.
jamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra https://doi.org/10.1145/3580305.3599790
Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Ab-
ulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour
Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh,
Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zh-
danov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer
Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Car-
olyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contrac-
tor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine
Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas
Wolf, Arjun Guha, Leandro von Werra, and Harm de
Vries. 2023. StarCoder: may the source be with you!
arXiv:2305.06161 [cs.CL]
[9] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollár. 2017. Focal Loss for Dense Object Detection. In
2017 IEEE International Conference on Computer Vision (ICCV).
2999–3007. https://doi.org/10.1109/ICCV.2017.324
[10] Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and
Yang Liu. 2023. ContraBERT: Enhancing Code Pre-Trained
Models via Contrastive Learning. In Proceedings of the
45th International Conference on Software Engineering (Mel-
bourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2476–2487.
https://doi.org/10.1109/ICSE48619.2023.00207
[11] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing-
wei Lin, and Daxin Jiang. 2023. WizardCoder: Empow-
ering Code Large Language Models with Evol-Instruct.
arXiv:2306.08568 [cs.CL]
[12] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes
Belkada, Sayak Paul, and Benjamin Bossan. 2022. PEFT:
State-of-the-art Parameter-Efficient Fine-Tuning methods.
https://github.com/huggingface/peft.
[13] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio
Savarese, and Yingbo Zhou. 2023. CodeGen2: Lessons for
Training LLMs on Programming and Natural Languages.
arXiv:2305.02309 [cs.LG]
[14] Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele
Bezzi, and Cédric Dangremont. 2019. A manually-
curated dataset of fixes to vulnerabilities of open-
source software. In Proceedings of the 16th Interna-
tional Conference on Mining Software Repositories (Mon-
treal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387.
https://doi.org/10.1109/MSR.2019.00064
[15] Shichao Wang, Yun Zhang, Liagfeng Bao, Xin Xia, and
Minghui Wu. 2022. VCMatch: A Ranking-based Approach
for Automatic Security Patches Localization for OSS Vul-
nerabilities. In 2022 IEEE International Conference on Soft-
ware Analysis, Evolution and Reengineering (SANER). 589–600.
https://doi.org/10.1109/SANER53432.2022.00076
[16] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang,
Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li,
Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A
Pre-Trained Model for Code Generation with Multilingual
Benchmarking on HumanEval-X. In Proceedings of the 29th
ACM SIGKDD Conference on Knowledge Discovery and Data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy