SO Snippet ENASE
SO Snippet ENASE
SO Snippet ENASE
Keywords: TF-IDF, BERT, Word2Vec, Regex, Stack Overflow, Programming language classification
Abstract: Software Question-Answer (SQA) sites such as Stack Overflow (SO) comprise a significant portion of
developer’s resource for knowledge sharing. Owing to their mass popularity, these SQA sites receive huge
number of queries on a daily basis and therefore require an appropriate tagging mechanism to better
facilitate discussion among users. An intrinsic part of predicting these tags is predicting programming
languages of the code segments associated with the questions. Although source code classification is a
common task in software engineering domain, identifying programming language of short code snippets is
relatively much difficult task compared to working on complete source code. Usually, state of art models
such as BERT and embedding based algorithms such as word2vec are preferred for text classification task,
however, in the case of code snippets which are different from natural language in both syntactic as well as
semantic composition, embedding techniques might not yield as precise results as traditional methods. To
this predicament, we propose a regex based tf-idf vectorization approach followed by chi-square feature
reduction over an ANN classifier. We record interesting observations as our modified tf-idf approach
outperforms both word2vec over Bi-LSTM and distil-BERT in classifying the snippets while consuming
lesser resources and computational time. Our method achieves an accuracy of 85% over a corpus of 232,727
stack overflow code snippets which surpasses several baselines.
Table 4: Performance record for individual programming In this subsection, we compare our model’s
language for ANN classifier (all results in percentile) performance with earlier existing baselines and also
investigate whether embedding based techniques and
Classifier/Metric Precision Recall F1 state of art models like Bert can perform better on
R 85 92 88 source code or not.
C 80 88 84 Most of the earlier works have either considered
scala 96 92 94 snippets insignificant to be included or have merged
c# 78 82 80 their features with the features extracted from body
perl 95 86 90
and description of the post. However, we can still
Vb.net 93 89 91
make an overall comparison keeping in view the aim
Css 80 79 80
swift 97 94 96
of achieving the end goal to classify the
haskell 93 93 93 programming language of code snippets.
Html 53 53 53 However, identifying the language in large
PhP 88 83 85 source code files is very much different from
java 86 76 81 classifying snippets and therefore an equivalent
java_script 71 78 75 comparison can’t be made between the two.
bash 76 89 82 Table 5 shows the overall comparison of similar
tuby 88 89 89 works in the domain. Although (Alreshedy et al.,
lua 95 89 92 2018; 2020) in their work achieve better
c++ 88 81 84 performance overall, but it is to be noted that when
objective_c 93 93 93 applied on the snippets alone, their accuracy
markdown 94 19 32 dropped and our model surpasses their work. To
python 87 87 87 combine the effect of textual part of the body and
Sql 79 88 83 description is our future work. Table 6 further draws
Accuracy - - 85
comparison between the work of (Alreshedy et al.,
Macro Avg. 86 82 82
2020) and our work on individual programming
Weighted Avg. 85 85 85
language level.
Besides, our investigation concerning the
performance of embedding based approaches and
pre-trained models are also presented in table 7 and 2018) al., al.,
figure 5. 2018) 2020)
lua 50 84 70 92
Table 5: Comparative analysis of our approach against C 56 76 81 84
existing baselines for identifying tags in SO posts ruby 43 70 72 89
C# 51 79 78 80
Study Dataset based on Acc, Pr, Rc, F1 C++ 65 51 73 84
(Kuo, Title and 47% python 69 88 79 87
2011) Description R 72 77 78 88
(Saha et Only Title 68% Css 30 86 77 80
al., 2013) Vb.net 60 83 77 91
(Stanley et Title and 65% swift 54 84 89 96
al., 2013) Description haskell 67 89 78 93
Html 35 54 55 53
(Baquero, Title and 60.8%, 68%, Sql 50 65 79 83
2017) Description 60%, and 60% Java 46 70 76 81
(Alreshed Title and 81%, 83% 81%, markdown 28 76 91 32
y et al., Description and 0.81%; bash 67 76 85 82
2018) objectivec 77 57 88 93
Title, Description 91.1%, 91%, Perl 69 74 41 90
and code Snippet 91% and 91% PhP 62 74 88 85
Scala 72 76 81 94
Only code 73%, 72%, 72%, Javascript 48 78 74 75
Snippets 72%
Table 7: Comparative analysis of our approach against
(Saini and Title, Description 65%, 59%, 36%, Word2Vec and Distil-Bert
Tripathi, and code Snippet and 42%
2018) Classifier/ Tf-Idf Word2Vec Distil-
(Jain and Only Title 75%, F1 Score- Metric +ANN Bert
Lodhavia, 81% Accuracy 85.04% 68.30% 61.2%
2020) Time taken 139.4s 3936.6s 36996s
(Kavuk Title and 75%, 62%, 55% per step
and Description and 39%
Tosum,
2020)
(Alreshed Title and 78.9%, 81%,
y et al., Description 79% and 79%
2020)
Title, Description 88.9%, 88%, 88%
and code Snippet and 88%
Classifier/ PLI SCC SCC+ Our The significant performance improvement of our tf-
Model (PLI (Alresh (Alresh Work idf approach over existing works can be attributed to
tool, edy et edy et the decisive pre-processing of the corpus based on
regular expressions and feature reduction technique. languages in stack overflow posts is still a less
Neural network classifier further outperforming ML explored and challenging task. To this predicament
based classifiers also adds to the boost of we present our work for classifying programming
performance. languages in SO post snippets trained on regex-
Further, from table 4 and 6, we can see that out based tf-idf vectorizer over an ANN. We achieve
of all the 21 languages, HTML and Markdown decent accuracy of 85% which excels several
performed worst. While Markdown had only 1300 baselines in the task of code snippet classification.
snippets as compared to 12000 snippets of other 20 Further, we also investigate the utility of
languages which very much explains it low embedding based algorithms such as word2vec and
performance due to lack of training data, the reason pre-trained models such as distil-bert. Our
for bad performance of HTML could be attributed to investigation shows that traditional approaches are
the ambiguity of HTML snippets which much suited for source code classification tasks
simultaneously contain CSS and JavaScript code owing to the lack of semantic dependence in code
segments which leads to miss-classification. One scripts.
possible solution to fixing this anomaly could be to
increase the training data for Markdown and to
generate multiple tags for HTML snippets and not REFERENCES
consider the miss-classification count of JavaScript
and CSS tags. Alreshedy, K., Dharmaretnam, D., German, D. M.,
Elsewhere, the performance of tf-idf approach Srinivasan, V., & Gulliver, T. A. (2018). Predicting
excelling the embedding based approaches can be the Programming Language of Questions and Snippets
attributed to the fact that code classification is very of StackOverflow Using Natural Language
much independent of semantic association of words Processing. arXiv preprint arXiv:1809.07954.
as compared to regular text.
One may argue that recent advances that have Alrashedy, K., Dharmaretnam, D., German, D. M.,
Srinivasan, V., & Gulliver, T. A. (2020). Scc++:
transpired codebert (Feng et al., 2020), which is
Predicting the programming language of questions and
specifically trained on code corpus comprising six snippets of stack overflow. Journal of Systems and
languages can outperform traditional methods. In Software, 162, 110505.
this regard, we wish to point out that the
computational time needed for distil bert only was Barua, A., Thomas, S. W., & Hassan, A. E. (2014). What
around 1000 times more than that of tf-idf approach are developers talking about? an analysis of topics and
with a RF classifier, and distil bert is the most trends in stack overflow. Empirical Software
preliminary version of the Bert series. Therefore, Engineering, 19, 619-654.
keeping the basis of performance and computational
Baquero, J. F., Camargo, J. E., Restrepo-Calle, F., Aponte,
time trade-off, we can safely argue that traditional
J. H., & González, F. A. (2017, September). Predicting
approaches are more suited for code classification the programming language: Extracting knowledge
purposes. from stack overflow posts. In Colombian Conference
However, a proper in-depth investigation in this on Computing (pp. 199-210). Springer, Cham.
regard can hold a promising future work.
Breiman, L. (2001). Random forests. Machine learning,
45, 5-32.
5 CONCLUSION Cao, K., Chen, C., Baltes, S., Treude, C., & Chen, X.
(2021, May). Automated query reformulation for
Stack Overflow is one of the foremost resources for efficient search based on query logs from stack
developers to seek technical assistance. Owing to its overflow. In 2021 IEEE/ACM 43rd International
massive popularity, the site receives massive traffic Conference on Software Engineering (ICSE) (pp.
on a daily basis and thus a proper mechanism for 1273-1285). IEEE.
segregation of posts is needed. Most of the queries at
Chen, T., & Guestrin, C. (2016, August). Xgboost: A
SO contain code snippets which are relatively very scalable tree boosting system. In Proceedings of the
small compared to the complete source files. 22nd acm sigkdd international conference on
Besides, source code classification is a topic of knowledge discovery and data mining (pp. 785-794).
importance in itself. However, earlier works have
focused on classification of large source code files Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
from github repositories. Prediction of programming M., ... & Zhou, M. (2020). Codebert: A pre-trained
model for programming and natural languages. arXiv
preprint arXiv:2002.08155. Saini, T., & Tripathi, S. (2018, March). Predicting tags for
stack overflow questions using different classifiers. In
Gilda, S. (2017, July). Source code classification using 2018 4th International Conference on Recent
Neural Networks. In 2017 14th international joint Advances in Information Technology (RAIT) (pp. 1-
conference on computer science and software 5). IEEE.
engineering (JCSSE) (pp. 1-6). IEEE.
Stanley, C., & Byrne, M. D. (2013, July). Predicting tags
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. for stackoverflow posts. In Proceedings of ICCM
(2018). Bert: Pre-training of deep bidirectional (Vol. 2013).
transformers for language understanding. arXiv
preprint arXiv:1810.04805. Swaraj, A. and Kumar, S. A Methodology for Detecting
Programming Languages in Stack Overflow
Jain, V., & Lodhavia, J. (2020, June). Automatic Question Questions. DOI: 10.5220/0011310400003266. In
Tagging using k-Nearest Neighbors and Random Proceedings of the 17th International Conference on
Forest. In 2020 International Conference on Intelligent Software Technologies (ICSOFT 2022)
Systems and Computer Vision (ISCV) (pp. 1-4).
IEEE. Van Dam, J. K., & Zaytsev, V. (2016, March). Software
language identification with natural language
Kavuk, E. M., & Tosun, A. (2020, June). Predicting Stack classifiers. In 2016 IEEE 23rd international conference
Overflow question tags: a multi-class, multi-label on software analysis, evolution, and reengineering
classification. In Proceedings of the IEEE/ACM 42nd (SANER) (Vol. 1, pp. 624-628). IEEE.
International Conference on Software Engineering
Workshops (pp. 489-493).