Finetuning LLM For Vulnerability Detection
Finetuning LLM For Vulnerability Detection
Vulnerability Detection
Alexey Shestov Rodion Levichev Ravil Mussabayev∗
Huawei Russian Research Huawei Russian Research Huawei Russian Research
Institute Institute Institute
Moscow, Russia Moscow, Russia Moscow, Russia
shestov.alexey1@huawei- rodion.levichev@huawei- ravil.mussabayev@huawei-
arXiv:2401.17010v4 [cs.CR] 1 Mar 2024
of magnitude more data used for pretraining. They achieve • We describe the conducted experiments:
better results on language and code understanding tasks, – Comparison with ContraBERT (state-of-the-
providing an opportunity for improved transfer learn- art non-LLM transformer model) on balanced
ing. The vulnerability detection task can benefit from and imbalanced datasets;
the increased expressiveness and knowledge contained – Ablation study of the context size and classi-
in state-of-the-art LLMs. fication loss;
Vulnerability detection is a complex task with many – Investigation of the batch packing technique;
aspects influencing and potentially limiting the solu- – Improving the usage of a more informative
tion quality. One of the factors could be limits on the part of the imbalanced dataset.
capacity and code understanding ability of CodeBERT- • We conclude the study, highlighting our contri-
like models, as vulnerability detection requires a deep butions and stating promising future research di-
understanding of many code aspects. Checking this hy- rections.
pothesis is a crucial problem in the field of vulnerabil-
ity detection. 2 Implementation details
Furthermore, our previous study [3] revealed that 2.1 Challenges
simple prompting techiques for the ChatGPT LLM were
unable to achieve noticeably better results than ran- We briefly describe the main milestones and challenges
dom prediction on the vulnerability detection task. This that we encountered during the implementation of our
finding suggests that finetuning of LLMs on this task approach.
naturally becomes a promising research direction. LLM model selection for finetuning. Recently, there
The real-world distribution of vulnerable functions has been a rise in the number of open-sourced LLM
is highly class imbalanced, containing far more nega- models, trained on large code corpora [8, 11, 13, 16].
tive samples than positive ones. This imbalance makes The task is to choose an optimal model for finetuning,
the task inherently more difficult, as standard training both from the perspective of quality and potential tech-
objectives, like cross-entropy loss, may lead the model nical difficulties encountered. The goal is to select a
to focus on the dominant negative class. Specialized model possessing the following properties:
training techniques, such as focal loss [9] and sample • It shows a strong potential for effective transfer
weighting, are required to properly emphasize the mi- learning to the vulnerability detection task;
nority positive examples during finetuning. • It is technically easy to use and integrate into
This project conducts experiments to find the opti- our framework.
mal model architecture, training hyperparameters, and
loss formulations for effective LLM finetuning on the Adapting LLMs for limited computational resources.
vulnerability detection task (both for balanced and im- Given our computing capabilities, finetuning complete
balanced datasets). The aim is to leverage recent ad- LLM models is infeasible and would be inefficient even
vances in pretrained models to improve over prior bench- with increased resources. To enable effective finetun-
mark results. ing of a 13-billion parameter model on our hardware,
The remainder of the paper is organized as follows: we need methods to reduce the number of trained pa-
rameters and memory requirements.
• We describe our approach for building and adapt-
Adapting LLMs for the classification task. The
ing the model. Specifically, we describe how we
standard pretraining task for LLM decoders is the next
solved the following problems:
token prediction, which differs from our end task of
– Choosing the model;
vulnerability classification. Including the next token
– Using limited GPU resources for the model
prediction component in the loss may misguide weight
size;
updates away from the optimal state for classification.
– Adapting a decoder model for the classifica-
To fully leverage the capabilities of the LLM weights
tion task;
for our target task, a better strategy might be to leave
– Underutilization of the input sequence length,
only the classification term in the loss.
resulting in a slow training speed.
• We describe our main instruments: datasets, mod- Improving a slow training speed. The standard se-
els, metrics, as well as measurement procedures; quence classification approach of padding to a fixed
Finetuning Large Language Models for Vulnerability Detection
length makes training extremely slow. To mitigate the for Python. Subsequently, we primarily used the Wiz-
issue of short sequence lengths, the task is to pack mul- ardCoder model, as it is claimed to be superior to Star-
tiple functions into each training sequence. This can Coder. However, our experiments did not reveal any
decrease unused padded tokens, improving computa- significant differences in performance between these
tional efficiency. two models. In our future writing, we will use the term
“StarCoder” to refer specifically to the WizardCoder
variant of the StarCoder family.
2.2 Choosing an LLM to finetune Generally, more elaborate experiments are needed to
Recently, there has been a proliferation of open-sourced find if there are any differences in the performance of
LLM models trained on large code corpora [8, 11, 13, these models.
16]. The task is to select an optimal model for finetun-
ing. 2.3 Adapting LLMs for limited computational
First, we attempted to use the CodeGeeX model [16]. resources
Unfortunately, we found that using this model was very Finetuning complete multi-billion parameter LLMs is
difficult from the technical perspective, as the code pro- infeasible given our hardware constraints, even if more
vided by the authors does not support standard libraries resources were obtained. To enable effective finetun-
for transformer training and large model adaptation. ing, we require techniques to reduce the number of
We were able to run the model on 8 V100 GPUs for in- trained parameters and memory needs. The Hugging-
ference, but this memory capacity was insufficient for Face Peft [12] library implements several methods tack-
finetuning. The lack of support for common LLM adap- ling this, including the LORA method [7]. LORA en-
tation libraries made it very challenging to overcome ables full-model finetuning by decomposing the weight
the memory limitations, so we decided to discontinue matrices into low-rank approximations, drastically de-
attempts with this model. creasing parameters and memory.
Next, we explored two other models, WizardCoder We successfully applied LORA to the WizardCoder
[11] and CodeGen [13]. In contrast to CodeGeeX, these model. We used the following optimal LORA settings:
models include support for major transformer train- r = 8, alpha = 32, dropout = 0.05. This reduced the 13
ing frameworks like DeepSpeed, HuggingFace Trans- billion parameter models to just 25 million parameters,
formers, Peft [12], etc. This compatibility with stan- smaller than CodeBERT. The results validate LORA’s
dard libraries significantly eased the application of low- effectiveness for LLM adaptation under memory limi-
memory adaptation techniques. tations.
When conducting a simple Question Answering train-
ing, we find that WizardCoder is much better than Code- 2.4 Adapting LLMs for classification
Gen at directly answering “YES” and “NO”. CodeGen Typically, code LLMs are trained on a large amount of
sometimes responded with “TRUE”, “FALSE”, erratic unlabeled code data in a special way. The principle is to
“YES” and “NO” with additional characters, and has a iteratively take code tokens as input, predict the next
worse sequence stopping behavior. Given the Wizard- token, and compare it with the ground truth. This prin-
Coder’s stronger capabilities on this task, we focused ciple drives the next token prediction loss. Specifically,
our subsequent experiments on this model. for any input sequence {𝑥 1 , 𝑥 2, . . . , 𝑥 𝑁 } of length 𝑁 , the
output of an LLM is a probability distribution of the
Choosing the best StarCoder-family model. In our next token 𝑃 (𝑥 𝑁 +1 |𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) = 𝑝 𝑁 +1 ∈ [0, 1] 𝑣 ,
study, we evaluate three models from the StarCoder where Θ represents all parameters of the model, and 𝑣
family for finetuning: StarCoderBase, StarCoder [8] (which is the vocabulary size. By comparing it with the real
𝑣 of the
is StarCoderBase finetuned on the Python dataset), and distribution, i.e., a one-hot vector 𝑦 𝑁 +1 ∈ {0, 1}
WizardCoder [11] (which is an improved version of ground-truth token, we can optimize the cumulative
StarCoder trained using the eval-instruct technique). cross-entropy loss:
We compared the performance of StarCoder and Star-
CoderBase on the question answering task and found Õ
𝑁 −1
no significant differences between the two models, de- Lntp = − 𝑦𝑛+1 log P(𝑥𝑛+1 |𝑥 1, 𝑥 2, . . . , 𝑥𝑛 , Θ) (1)
spite the fact that StarCoder was specifically adapted 𝑛=1
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny
In the context of adapting a language model for clas- Function length (tokens) Functions number
sification tasks, the objective function used during train- 0-10 1976
ing needs to be aligned with the classification goal rather 10-20 31627
than the next token prediction. For an input sequence 20-50 51972
{𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 } of length 𝑁 , the traditional generative 50-100 32194
pretraining objective (next token prediction loss) would 100-300 35769
compute the loss across all predicted tokens in the se- 300-500 8756
quence. However, this is suboptimal for a classifica- 500-1000 5515
tion task such as vulnerability classification, which is 1000-2000 2236
concerned with categorizing the entire input sequence >2000 767
rather than predicting subsequent tokens. Table 1. Function length histogram, obtained from
To address this, we propose a classification loss that parsing several projects
leverages only the predicted probability of the final
token 𝑥 𝑁 , which is then matched against the ground
truth label 𝑦 using cross-entropy. This loss is expressed
mathematically as:
Thus, during our study we focus on the following
two regimes of using code LLMs:
Lclass = − log P(𝑦|𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) (2)
1. Next token prediction. This approach corresponds
Here, 𝑦 represents the correct class label for the se- to using the next token prediction loss (1) dur-
quence, and P(𝑦|𝑥 1, 𝑥 2, . . . , 𝑥 𝑁 , Θ) is the probability that ing training and prediction. During training, an
the model, parameterized by Θ, assigns to the correct LLM’s input sequence is packed with the code of
class. This formulation ensures that weight updates dur- training methods and their ground truth labels
ing training are driven exclusively by the classification one after another in the following format:
objective, without any influence from the generative Method’s code + YES / NO (binary ground truth
task of next token prediction. The classification loss label) + <EOS>
may be viewed as corresponding to the last term of the
Here <EOS> denotes the special “end of sequence”
generative pretraining objective’s (next token predic-
token from the vocabulary. We fit as many as
tion) cross-entropy loss, focusing entirely on the clas-
possible full training examples into the input se-
sification output for 𝑛 = 𝑁 .
quence length, which is equal to 2048 for StarCoder-
Eliminating the next token prediction objective en-
based models and 512 for CodeBERT-based ones.
ables full utilization of the pretrained weights for vul-
The rest of the positions are filled with the spe-
nerability detection, without interference from unre-
cial padding token. For prediction, the binary clas-
lated generation tasks.
sification token is generated according to (1), and
its probability is taken for ROC AUC calculation;
2.5 Speeding up the classification training
2. Binary classification. This regime uses the binary
The standard approach of passing code samples to the cross entropy classification loss (2) for training
LLM by writing tokens into the context window and and prediction. Specifically, the training scheme
padding unused slots is problematic for our data. Most for binary classification loss is formalized as fol-
dataset functions have length under 50 tokens (see Ta- lows:
ble 1), so this padding wastes computation. a. Load a batch with as many complete func-
We see that most functions have less than 50 tokens, tion codes as it can fit in the following for-
with many having 10-20 tokens. Using one function mat:
per batch element (padded to 2048) is highly redun-
dant. Method’s code + <EOS>
To mitigate the issue of short sequence lengths, we b. Populate an array of labels, assigning a label
pack multiple functions into each training sequence. to each function in the batch, where the label
This increases the actual batch size versus using one is either 1 or 0, indicating whether the func-
function per sequence. tion is vulnerable or not, respectively;
Finetuning Large Language Models for Vulnerability Detection
c. During loss computation, match the predicted There is a room for further enhancement of the batch
probability of the next token for the last to- packing method. The dynamic batch size could be sta-
ken of each function with its corresponding bilized using dynamic gradient accumulation steps. The
label from the label array; dynamic accumulation would accumulate gradients un-
d. Calculate the cross-entropy loss for each pair til a target batch size is reached before applying an op-
of predicted probability and actual label us- timizer step. This would stabilize batch size while re-
ing (2); taining the efficiency benefits of packing sequences.
e. Sum up the cross-entropy loss values across However, this poses some technical challenges. It re-
all predictions in the batch (not the average, quires non-trivial modifications to the training loop
which usually works worse, as will be inves- code in HuggingFace Transformers to support dynamic
tigated later); gradient accumulation.
f. The resulting sum is the loss for the batch,
which is used to update the model’s weights 3 Experimental setup
during training.
The loss Lbatch for a batch is thus given by: 3.1 Datasets
Sources. Our vulnerability dataset is constructed from
several open-source vulnerability datasets: CVEfixes
Õ
𝐵 [1], a Manually-Curated Dataset [14] and VCMatch [15].
Lbatch = −𝑦𝑖 · log(𝑝𝑖 ) − (1 − 𝑦𝑖 ) · log(1 − 𝑝𝑖 ) (3) CVEfixes dataset includes information about vulner-
𝑖=1 abilities and their fixes in open-source software, with
data collected from the NVD database and other repos-
where 𝐵 is the number of functions in the batch, itories. However, some inconsistencies and errors may
𝑦𝑖 is the true label for the 𝑖-th function, and 𝑝𝑖 be present due to the automatic construction process.
is the model’s predicted probability that the 𝑖-th A Manually-Curated Dataset focuses on Java language
function is vulnerable. only and includes data for 624 publicly disclosed vul-
Due to the memory limits of our hardware, we nerabilities across 205 distinct open-source Java projects.
limited the batch size to only one input sequence The dataset is considered more reliable as fix-commits
during training. are found by experts.
For prediction, no batch packing is used and only VCMatch provides a ranking-based approach for au-
one test function is fitted into the entire input se- tomatic security patches localization for OSS vulnera-
quence, with the rest of the token positions filled bilities. It includes data for only 10 popular repositories
by the padding token. and only contains fix-commits.
The baseline classification approach achieved only
1500 training samples per hour. This extremely slow Dataset labeling. Each of the aforementioned datasets
training throughput made iterative experiments and have standalone functions as their elements. In each of
tuning infeasible on our dataset. In contrast, batch pack- the datasets functions are labeled as vulnerable or non-
ing provided a 13x speedup to 20000 samples per hour. vulnerable based on heuristics derived from analyzing
By concatenating multiple short sequences into one in- vulnerability-fixing commits:
put example, we can significantly boost efficiency for 1. From vulnerability-fixing commits, the changed
tasks like ours with small input functions. The key ad- functions are extracted;
vantage of batch packing is reducing unused padding 2. The pre-change versions are taken as vulnerable
tokens. By packing sequences, a much larger propor- functions, while the post-change versions are taken
tion of computation goes towards informative function as non-vulnerable ones;
tokens rather than wasted padding. 3. Pairs of changed functions are used to create a
In our case, batch packing enables practical finetun- balanced dataset. Functions that remained unchanged
ing by speeding up training by an order of magnitude. are labeled as non-vulnerable as well.
This allows reasonable iteration times for experimenta-
tion. Also, this approach is generally applicable when Dataset filtering. Vulnerability-fixing commits of-
finetuning decoder models on tasks with short input ten contain changes related to cleanups, refactors, irrel-
sequences. evant functionality changes. Therefore, the described
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny
labeling heuristics produce some amount of false pos- dataset. In most cases, we report both validation and
itive labels. To address this issue, we create a modi- test metrics.
fied dataset: we extract a function as vulnerable from Hyperparameter optimization is performed on the
a vulnerable-fixing commit if this function is the only validation dataset using the best validation metrics per-
one that has changed in the commit. We call this dataset formance.
𝑋 1 . This dataset have the following advantages:
• Its labels are more robust, as commits fixing only 3.3 Pretrained models
one function are more likely to fix only vulnera- In this work, the primary pretrained model architec-
bilities; ture employed is WizardCoder [11], a transformer-based
• The dataset is small, which makes it a good in- neural network with a decoder-only architecture. With
strument for investigating the algorithmic per- over 13 billion parameters, WizardCoder was pretrained
formance and finding optimal parameters. using a causal language modeling objective on a large
More specifically, extracting functions from the com- collection of GitHub source code, endowing the model
mits fixing only one function mitigates labeling errors with extensive knowledge of natural programming lan-
that stem from irrelevant code and cleanup changes guage constructs.
[4]. We compare WizardCoder to ContraBERT [10], which
is the state-of-the-art model in the CodeBERT family.
Addition of easy negative functions. As a result In contrast to vanilla CodeBERT, ContraBERT intro-
of these procedures, we obtain a set of vulnerable func- duces an enhanced encoder architecture obtained by
tions (we call it 𝑃1 ) and a set of vulnerable functions cooperation of two pretrained CodeBERT encoders. Con-
with fixed vulnerability (we call it 𝑃2 ). We can also aug- traBERT is pretrained using the usual masked language
ment these sets with a set of functions randomly scratched modeling (MLM) along with a number of contrastive
from random projects, which do not have any vulner- pretraining tasks, and can be finetuned for the defect
abilities. We call this part 𝑃3 . detection task [10]. ContraBERT can be considered as
an enhancement of the CodeBERT model, which show-
Datasets characteristics. Finally, we obtain the fol-
cases more accurate and robust results than CodeBERT
lowing two datasets:
for the defect detection task. Hence, ContraBERT is se-
• 𝑋 1 without 𝑃3 . lected for the experiments.
It has 1334 samples, with 810 samples in the train
part, 272 samples in the validation part and 252 3.4 Full reproducibility package
samples in the test part. The balance of positive All necessary code and data for reproducing our exper-
to negative classes is approximately equal to 1:1; iments are available in our GitHub repository1 .
• 𝑋 1 with 𝑃3 .
It has 22945 samples, with 13247 samples in the
train part, 5131 samples in the validation part
4 Research questions
and 4567 samples in the test part. The balance 4.1 Comparison to CodeBERT-based models.
of positive to negative classes is approximately RQ1: Is a StarCoder-based model more effective
equal to 1:34, and the majority of the negative than a CodeBERT-based for the balanced vulner-
class is drawn from the 𝑃3 part. ability detection task?
The models were finetuned on the dataset without
3.2 Evaluation metrics easy 𝑃3 negatives, using different hyperparameters like
We use a standard procedure for model training and the batch size, learning rate, and number of epochs. Op-
evaluation. We train a model for a predefined number timal settings for this dataset were determined.
of epochs, choose the best checkpoint by ROC AUC on RQ2: Is a StarCoder-based model more effective
the validation dataset, and evaluate the chosen check- than CodeBERT-based for the imbalanced vulner-
point on the test dataset. ability detection task?
The metrics reported include ROC AUC, F1 score for Different training recipes, like the number of epochs
the positive class (vulnerable functions), optimal clas- and loss functions, were compared between the datasets
sification threshold (used for F1 score). The optimal
classification threshold is determined on the validation 1 https://github.com/rmusab/vul-llm-finetune
Finetuning Large Language Models for Vulnerability Detection
The methodology of comparison is the same as in • The actual batch size becomes dynamic during
RQ1. The optimal training hyperparameters appeared training as sequences are packed;
to be the same for both datasets (𝑋 1 with and without • Using the standard mean loss reduction gives dif-
𝑃3 ). The identical optimal settings imply that the mod- ferent functions unequal influence on the gradi-
els train similarly on both datasets. ent, which may harm the overall performance;
The results are presented in Table 3. We see that • Using the sum loss reduction training steps un-
finetuned WizardCoder surpasses finetuned Con- evenly, which may also be harmful to performance.
taBERT both in ROC AUC and F1 metrics on the This raises two research questions:
imbalanced vulnerability detection task. However, • RQ4: Does the dynamic batch size harm the
achieved gains over ContraBERT in ROC AUC score model quality or not?
are smaller compared to the 𝑋 1 without 𝑃3 dataset. • RQ5: Does the mean or the sum loss reduc-
A potential reason is underutilization of the more infor- tion perform better?
mative 𝑃1 and 𝑃2 partitions on this dataset.
RQ4: Does the dynamic batch size harm the model
quality or not?
Base model ROC AUC F1
The batch packing approach actually introduces the
ContraBERT 0.85 0.22
dynamic batch size during training. This dynamic batch
WizardCoder 0.86 0.27
size allows to decrease the training time by up to 13
Table 3. ROC AUC and F1 scores for finetuned Contra- times. Dynamic batch size is not a standard approach
BERT and WizardCoder models on 𝑋 1 with 𝑃3 dataset. for training neural networks, so the following question
arises: does it sacrifice the model quality for training
speed improvement? In order to answer this question
we tried a variant of batch packing that limits the max-
5.2 Ablation Study
imum number of functions in a batch. This limiting de-
RQ3: Is the standard LLM training approach effec- creases the dispersion of the actual batch sizes distri-
tive for vulnerability detection? bution.
The WizardCoder model was finetuned on the vul- We tested two different values for the maximum num-
nerability dataset by formulating the task as a ques- ber of functions per batch: 50 and 100 on the valida-
tion answering problem. Specifically, the model was tion part of the 𝑋 1 without 𝑃3 dataset. Our results are
trained to predict the next token in addition to the bi- present in Table 5. The results imply that there is no
nary vulnerability label on the 𝑋 1 with 𝑃3 dataset. The significant influence of limiting the maximum number
results are presented in Table 4. of functions per batch, indicating that dynamic batch-
ing may not have a detrimental effect on the model’s
Base model Training approach ROC AUC performance.
WizardCoder Next token prediction 0.75
CodeBERT Binary classification 0.85 Max. functions per batch Validation ROC AUC
WizardCoder Binary classification 0.86 No limit 0.72
Table 4. ROC AUC Scores for finetuning with the next 50 0.72
token prediction and classification objectives on the 100 0.72
imbalanced 𝑋 1 with 𝑃3 dataset. Table 5. WizardCoder’s results with limiting the max-
imum number of functions per batch on the validation
part of the 𝑋 1 without 𝑃3 dataset.
The next token prediction approach achieved the ROC
AUC score of 0.75 for WizardCoder on the 𝑋 1 with 𝑃3
In summary, dynamic batch size works well em-
dataset. This result is inferior to CodeBERT and
pirically and limiting it does not lead to any im-
WizardCoder models trained with classification-
provements.
only objectives (0.85 and 0.86, respectively), but
RQ5: Does the mean or the sum loss reduction
still surpasses random performance.
perform better?
Batch packing properties. The batch packing ap- Each kind of loss reduction has arguments against
proach introduces some unique properties: it:
Finetuning Large Language Models for Vulnerability Detection
• Using the standard mean loss reduction gives dif- Table 7 shows that reducing the context size from
ferent functions unequal influence on the gradi- 2048 tokens to 512 tokens resulted in the identi-
ent, which may harm the overall performance; cal validation ROC AUC score of 0.69 for Wizard-
• Using the sum loss reduction training steps un- Coder. This suggests the performance gains are
evenly, which may also be harmful to performance. mainly due to the model’s learning of improved
code representations, rather than the increased
Thus, it becomes important to investigate whether
context size.
the mean or sum loss reduction is better suited for the
batch packing approach. In order to answer this ques-
5.3 Imbalanced classification improvement
tion, we performed testing on the validation part of the
𝑋 1 without 𝑃3 dataset. The results are present in Ta- RQ7: Is the focal loss with sample weighting effec-
ble 6. The mean loss reduction performed poorly, high- tive for tackling the vulnerability detection class
lighting issues with the mean loss in comparison with imbalance problem?
the sum reduction. The obtained gains of the StarCoder-based model
The superior performance of the sum loss reduction over the CodeBERT-based on the imbalanced 𝑋 1 with
indicates that it better handles the uneven sequence 𝑃3 dataset are minor compared to the 𝑋 1 without 𝑃3
lengths arising from the batch packing. By summing dataset. A potential reason is underutilization of the
over the packed sequences, each function contributes more informative 𝑃1 and 𝑃2 parts on this dataset. In or-
equally to the gradient. der to explore the influence of better utilization of 𝑃1
and 𝑃2 parts on the model’s quality, we conducted ex-
periments that incorporate the focal loss and sample
Loss reduction Validation ROC AUC weighting techniques:
Mean 0.57
Sum 0.72 • Focal loss emphasizes hard and rare examples
contained in the 𝑃1 and 𝑃2 parts;
Table 6. WizardCoder’s performance by a loss reduc-
• Sample weighting also focuses model on the mi-
tion method on the validation part of the 𝑋 1 without
nority cases from the 𝑃1 and 𝑃2 parts.
𝑃3 dataset.
Focal Loss. Focal loss [9] applies a modulating fac-
tor (1 − 𝑝𝑡 )𝛾 to the standard cross entropy loss. This
factor down-weights well-classified examples and fo-
In summary, the sum loss reduction method out-
cuses training on hard misclassified cases. The stan-
performs the mean one by better handling uneven
dard cross-entropy loss corresponds to 𝛾 = 0.
sequence lengths.
We tested 𝛾 values from 1 to 5 on the validation part
RQ6: Does an increased context size provide im-
of the 𝑋 1 with 𝑃3 dataset. The results are presented in
provements in quality?
Table 8.
An open question was whether the quality improve-
ments obtained by WizardCoder stemmed from the larger
2048 context size versus the 512 baseline, or better code 𝛾 =0 𝛾 =1 𝛾 =3 𝛾 =5
understanding by the model. ROC AUC 0.858 0.873 0.868 0.852
We conducted an experiment to isolate the impact of F1 0.277 0.265 0.272 0.250
context size. The 2048 context size WizardCoder model Best val. epoch 12 12 17 11
was compared to the 512 context size version on the Best threshold 0.087 0.288 0.265 0.4
test part of the 𝑋 1 without 𝑃3 dataset. Table 8. Resulting quality of WizardCoder for differ-
ent 𝛾 values in focal loss on the imbalanced 𝑋 1 with 𝑃3
dataset.
Context size Test ROC AUC
512 Tokens 0.69
2048 Tokens 0.69
Table 7. The impact of the WizardCoder’s context size In summary, the focal loss with 𝛾 = 1 achieves a
on performance. small improvement in ROC AUC over the standard cross
entropy loss (𝛾 = 0). However, the gains are small.
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny
The peak quality at 𝛾 = 1 followed by a decrease The best weighting scheme of 3𝑥 attains marginal
at higher 𝛾 values suggests the importance of balanc- gains in both ROC AUC and F1 over unweighted train-
ing the emphasis on hard examples with retaining suf- ing. However, excessive weighting, like 30𝑥, hurts the
ficient training signal from easier instances. validation and test metrics.
One potential reason for such small gains is that the This indicates there is an optimal moderate weight-
noisy or imperfect labels limit the ability of the focal ing that slightly emphasizes the hard and rare 𝑃1 and 𝑃2
loss to accurately identify the most important hard ex- examples without skewing the distribution too heavily.
amples to prioritize. If the label of an example does not Further tuning of the weighting factor may yield addi-
precisely reflect its difficulty, a heavy focus on hard tional gains.
cases may be less effective.
Combination of sample weights with focal loss.
The inconsistent improvements demonstrate the need
Both the focal loss and additional weights do similar
for further exploration of methods to leverage hard ex-
work. In the original work [9], it was stated that adding
amples without detriment to learning on easier cases,
weights to the focal loss leads to some improvements.
while taking the label quality into account.
Therefore, we combined the focal loss with the sam-
An interesting observation is that the focal loss leads
ple weighting technique. The results are presented in
to better calibrated models, with thresholds closer to
Table 10.
0.5. This indicates that the predicted class probabilities
are closer to the true values under the focal loss. The Weight 𝛾 Val. AUC Test AUC Test F1
better calibration is a useful characteristic. 3 1.0 0.88 0.877 0.273
Adding sample weights. Sample weighting assigns 10 3 0.872 0.853 0.226
higher importance weights to the 𝑃1 and 𝑃2 examples 10 1 0.877 0.863 0.243
compared to 𝑃3 during training. Weights from 1.0 to 3 3 0.874 0.865 0.286
30.0 were tried. Larger weights place more emphasis None 1 0.87 0.873 0.265
on the minority informative data. The similar technique None 3 0.87 0.868 0.272
was reported to be effective with the focal loss in the Table 10. The results of WizardCoder using the focal
original article [9]. loss and the sample weighting on the imbalanced 𝑋 1
The applied technique is different from the usual class with 𝑃3 dataset.
weighting scheme since it puts more weight on both
the samples of the positive (𝑃1 ) and negative (𝑃2 ) parts.
In the previous experiments, we found that the class Experiments with the focal loss and sample weight-
weighting technique was ineffective, so the sample weight- ing demonstrated minor improvements over the
ing scheme provides a different perspective. baseline training, with the best model achieving
The numeric results are presented in Table 9. 0.877 ROC AUC versus 0.86 for the baseline. How-
ever, neither technique provided substantial gains.
Best val. Best Test Test More advanced methods that can explicitly account for
Weight AUC epoch AUC F1
Threshold a severe class imbalance are needed.
30 0.876 6 0.863 0.22 0.42
10.0 0.878 13 0.86 0.24 0.44 6 Threats to validity
3.0 0.88 12 0.875 0.265 0.07 There are some potential threats to validity that could
1.0 (base) 0.87 12 0.858 0.277 0.08 limit the objectiveness of our study:
Table 9. Resulting quality of WizardCoder for different • The dataset is not split by projects, and splitting
values of 𝑃1 + 𝑃2 weights on the imbalanced 𝑋 1 with 𝑃3 by projects might have resulted in a significant
dataset. decrease in quality;
• When finetuning a model in the classification
regime with batch packing, the model is able to
From Table 9 we conclude that adding sample weights see the functions that precede the current one in
for the 𝑃1 and 𝑃2 partitions can provide small improve- the input. This could limit the training effective-
ments in ROC AUC and F1 score over the baseline. How- ness of the resulting model. However, the nega-
ever, large weight values degrade performance. tive impact on training should be mitigated since
Finetuning Large Language Models for Vulnerability Detection
these functions are placed randomly, so it would broader tasks involving code analysis and understand-
be hard for the model to learn irrelevant features. ing.
Also, during inference for test data, only one is While we show an improvement in quality of the
function was considered in any single input se- vulnerability detection task, the improvement is not so
quence, so there is no bias in the test scores; big as we could expect knowing the used LLM’s scale
• The function-level granularity of our dataset lim- and capacity. This gives us the answer to the question
its the types of vulnerabilities that span across posed at the beginning of the paper: whether the en-
multiple functions; countered performance limit is due to the limited ca-
• The dataset size is relatively small. This might pacity of CodeBERT-like models. We see that the main
limit its representativeness of the true vulnera- bottlenecks of the task that limit the performance lie
bility distribution. somewhere else, possibly in the field of dataset quality
and usage of the project-level context information.
7 Conclusion
Our work demonstrates the effectiveness of finetun-
References
[1] Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVE-
ing large language models for the vulnerability detec-
fixes: automated collection of vulnerabilities and their fixes
tion problem in source code. The WizardCoder model from open-source software. In Proceedings of the 17th Interna-
achieved state-of-the-art results, with the ROC AUC tional Conference on Predictive Models and Data Analytics in
score of 0.69 on the dataset without easy negative ex- Software Engineering (Athens, Greece) (PROMISE 2021). Asso-
amples. This improves over previous CodeBERT-based ciation for Computing Machinery, New York, NY, USA, 30–39.
models, likely due to the WizardCoder’s larger model https://doi.org/10.1145/3475960.3475985
[2] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. 2022. Deep
capacity and pretraining corpus. Learning Based Vulnerability Detection: Are We There Yet?
Several key contributions are made: IEEE Transactions on Software Engineering 48, 09 (sep 2022),
3280–3296. https://doi.org/10.1109/TSE.2021.3087402
• An efficient batch packing strategy is developed [3] Anton Cheshkov, Pavel Zadorozhny, and Rodion Levichev.
to mitigate small sequence lengths, providing over 2023. Evaluation of ChatGPT Model for Vulnerability Detec-
13𝑥 speedup in training time. This enables faster tion. arXiv:2304.07232 [cs.CR]
iteration and tuning; [4] Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023.
Data Quality for Software Vulnerability Datasets. In Proceed-
• An improvement in ROC AUC from 0.66 to 0.69
ings of the 45th International Conference on Software Engineer-
was obtained over the state-of-the-art non-LLM ing (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press,
model ContraBERT on the 𝑋 1 dataset without 𝑃3 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
part. An improvement of the F1 score was ob- [5] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
tained for the dataset with 𝑃3 (0.27 vs 0.22); aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
• For the highly imbalanced dataset, it was shown Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A
Pre-Trained Model for Programming and Natural Languages.
that the focal loss with sample weighting gives In Findings of the Association for Computational Linguistics:
improvements from 0.86 to 0.878. Despite these EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). As-
improvements, more advanced methods are needed sociation for Computational Linguistics, Online, 1536–1547.
to properly emphasize the minority positive ex- https://doi.org/10.18653/v1/2020.findings-emnlp.139
amples; [6] Michael Fu and Chakkrit Tantithamthavorn. 2022. Line-
Vul: A Transformer-based Line-Level Vulnerability
• A new more precise vulnerability benchmark dataset
Prediction. In 2022 IEEE/ACM 19th International Con-
is collected. It is smaller than most of the avail- ference on Mining Software Repositories (MSR). 608–620.
able datasets, as well as possesses quality labels https://doi.org/10.1145/3524842.3528452
that are free from the errors stemming from ir- [7] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
relevant code and cleanup changes [4]. Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2022. LoRA: Low-Rank Adaptation of Large Language Mod-
Opportunities remain for further improvements in els. In International Conference on Learning Representations.
this approach, mainly related to training on the dataset https://openreview.net/forum?id=nZeVKeeFYf9
[8] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muen-
with the 𝑃3 part included. Future work should explore
nighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
techniques like curriculum learning, active sampling, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii
and data augmentation to better leverage the scarce mi- Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De-
nority data. The insights gained can guide research on haene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro,
Shestov, Levichev, Mussabayev, Maslov, Cheshkov, Zadorozhny
Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Ze- Mining (, Long Beach, CA, USA,) (KDD ’23). Association
baze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Ben- for Computing Machinery, New York, NY, USA, 5673–5684.
jamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra https://doi.org/10.1145/3580305.3599790
Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Ab-
ulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour
Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh,
Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zh-
danov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer
Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri
Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Car-
olyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contrac-
tor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine
Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas
Wolf, Arjun Guha, Leandro von Werra, and Harm de
Vries. 2023. StarCoder: may the source be with you!
arXiv:2305.06161 [cs.CL]
[9] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
Piotr Dollár. 2017. Focal Loss for Dense Object Detection. In
2017 IEEE International Conference on Computer Vision (ICCV).
2999–3007. https://doi.org/10.1109/ICCV.2017.324
[10] Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and
Yang Liu. 2023. ContraBERT: Enhancing Code Pre-Trained
Models via Contrastive Learning. In Proceedings of the
45th International Conference on Software Engineering (Mel-
bourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2476–2487.
https://doi.org/10.1109/ICSE48619.2023.00207
[11] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing-
wei Lin, and Daxin Jiang. 2023. WizardCoder: Empow-
ering Code Large Language Models with Evol-Instruct.
arXiv:2306.08568 [cs.CL]
[12] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes
Belkada, Sayak Paul, and Benjamin Bossan. 2022. PEFT:
State-of-the-art Parameter-Efficient Fine-Tuning methods.
https://github.com/huggingface/peft.
[13] Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio
Savarese, and Yingbo Zhou. 2023. CodeGen2: Lessons for
Training LLMs on Programming and Natural Languages.
arXiv:2305.02309 [cs.LG]
[14] Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele
Bezzi, and Cédric Dangremont. 2019. A manually-
curated dataset of fixes to vulnerabilities of open-
source software. In Proceedings of the 16th Interna-
tional Conference on Mining Software Repositories (Mon-
treal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387.
https://doi.org/10.1109/MSR.2019.00064
[15] Shichao Wang, Yun Zhang, Liagfeng Bao, Xin Xia, and
Minghui Wu. 2022. VCMatch: A Ranking-based Approach
for Automatic Security Patches Localization for OSS Vul-
nerabilities. In 2022 IEEE International Conference on Soft-
ware Analysis, Evolution and Reengineering (SANER). 589–600.
https://doi.org/10.1109/SANER53432.2022.00076
[16] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang,
Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li,
Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A
Pre-Trained Model for Code Generation with Multilingual
Benchmarking on HumanEval-X. In Proceedings of the 29th
ACM SIGKDD Conference on Knowledge Discovery and Data