SO Snippet ENASE

Programming language identification in stack overflow post snippets
with Regex based Tf-Idf vectorization over ANN
Aman Swaraj1, Sandeep Kumar1

1
Indian Institute of Technology- Roorkee, India
{aman_s, sandeep.garg}@cs.iitr.ac.in
Keywords: TF-IDF, BERT, Word2Vec, Regex, Stack Overflow, Programming language classification
Abstract: Software Question-Answer (SQA) sites such as Stack Overflow (SO) comprise a significant portion of
developer’s resource for knowledge sharing. Owing to their mass popularity, these SQA sites receive huge
number of queries on a daily basis and therefore require an appropriate tagging mechanism to better
facilitate discussion among users. An intrinsic part of predicting these tags is predicting programming
languages of the code segments associated with the questions. Although source code classification is a
common task in software engineering domain, identifying programming language of short code snippets is
relatively much difficult task compared to working on complete source code. Usually, state of art models
such as BERT and embedding based algorithms such as word2vec are preferred for text classification task,
however, in the case of code snippets which are different from natural language in both syntactic as well as
semantic composition, embedding techniques might not yield as precise results as traditional methods. To
this predicament, we propose a regex based tf-idf vectorization approach followed by chi-square feature
reduction over an ANN classifier. We record interesting observations as our modified tf-idf approach
outperforms both word2vec over Bi-LSTM and distil-BERT in classifying the snippets while consuming
lesser resources and computational time. Our method achieves an accuracy of 85% over a corpus of 232,727
stack overflow code snippets which surpasses several baselines.
1 INTRODUCTION SO search engine associates tag with only the

question title and body (including code snippet),
In the recent decade, seeking online assistance has answers and comments are not included for tag
become the alma mater of almost every developer. generation as of now.
Especially in the software engineering domain,
dedicated forums like Stack Overflow (SO) play a
significant role in knowledge sharing.
The potent ability of these Software Question-
Answer (SQA) sites to facilitate discussion between
so many developers on such a colossal scale has led
to their rapid growth and popularity. As a result,
they witness huge traffic on a daily basis and SO
alone comprises of around 23 million questions till
date. Further, since the queries are vastly diverse,
SQA sites encourage users to tag their questions
appropriately and adequately. These tags play a
crucial role as they make the context of the question
more concise, thus making it easier for potential
experts to find relevant queries of their domain.
If we consider a general SO post, it has broadly Figure 1: A sample stack overflow post.
five aspects to it (see fig.1). First is the question title,
then the body of the question, the snippet, the tag,
and finally, the answers and comments. However,
However, the quality of tags is often associated GPT (Radford et al., 2018) have further
with the questioner's expertise, proficiency in revolutionized the domain of NLP cum deep
English, writing styles, bias, and so on. Since these learning related tasks. However, these pre-trained
factors vary among developers, and further due to models are trained on general purpose corpus such
the inexperience of novice developers, keeping the as Wikipedia or newspaper text, and therefore fine
tags consistent becomes challenging and gives rise tuning them for source code classification task might
to issues such as tag synonyms and tag explosion not yield accurate results.
(Barua et al., 2014). Such complications point to the To this reasoning, in this work, we employ a
need for a mechanized approach to tag regex based tf-idf approach where regex patterns are
recommendation that can predict tags for unseen used for creating the tokens followed by tf-idf
posts based on historical data or generate tags based vectorization. Since the tf-idf vectorization yields
on the post's content. huge sparse matrices containing lots of zeros, we
Traditionally, the source code identification devise chi square test to reduce the dimensionality of
problem was carried out by identifying the file the feature matrices. This significantly reduces our
extensions, however, manually assigning the correct computational time as well. Finally, the reduced
extensions for large number of snippets can be a subset of features are fed to a suitable classifier.
cumbersome task and thus subsequently researchers Although logistic regression and random forest have
have sought natural language processing (NLP) and produced promising results in earlier works, we wish
machine learning (ML) based techniques for the task to see if neural networks can further boost the
at hand. performance. Simultaneously, we also run the
Unsurprisingly, NLP and ML based techniques experiment on distil-Bert and Word2Vec methods to
have shown promising results in the domain of investigate the role of semantics in source code
programming language identification and similar classification.
tasks such as code completion, code comment Rest of the paper is structured as follows: we
generation and source code summary generation etc. start with literature survey in section 2. In section 3,
Where earlier works have mostly worked on the we describe all the tools and methodologies adopted
complete source code file, the task of identifying in the experiments. We present our results in section
programming language of a source code snippet is 4 along with the comparative analysis with existing
still relatively less explored. baselines. Finally, section 5 concludes the work
Applying ML and NLP based techniques on along with future scope.
large samples allow the models to extract many
features, however, the same task becomes more
challenging in the context of a code snippet as 2 RELATED WORK
snippets are relatively very short in size with respect
to a complete code file. SQA sites are pretty diverse and involve a vast span
For our work, we have gathered code snippets of topics. Therefore, predicting the tags of a post in
from a corpus of stack overflow posts shared by an SQA community becomes a non-trivial task.
(Alreshedy et al., 2020). Since, SO code snippets are Since it is difficult to accurately capture the
quite unstructured, it further makes the classification semantics of a post based on a single model alone,
task a non-trivial one. different researchers approach the task in different
In general, text classification tasks involves ways.
working on raw text which is often unstructured, Many authors predict the tags based on the post's
inconsistent and of variable length, and thus it title and description while neglecting the post's
becomes important to adequately pre-process the snippet.
text and apply suitable vectorization method. Again, One of the earliest works in this connection is
the method for text vectorization has to be chosen attributed to (Kuo, 2011) who classified SO posts
carefully as the quality of text vectorization directly with the help of a KNN classifier. Similarly,
impacts the task at hand. (Stanley and Byrne, 2013) made use of a Bayesian
Traditional methods for text vectorization probabilistic model to classify the tag of the SO
include bag of words and Tf-Idf, whereas, more posts. (Kavuk and Tosum, 2020) also classified the
recent ones comprise of embedding based body and title of the SO posts with the help of a
algorithms such as Word2Vec (Mikolov et al., multi tag classifier based on latent dirichlet
2013), GloVe (Pennington et al., 2014) etc. State of allocation.
art models such as BERT (Devlin et al., 2018) and
Even there are some cases where snippets are their work included SO snippets along with the post
completely absent in the post and thus the tag is title and description and achieved an accuracy of
generated solely based on the title or description. 65%.
(Saha et al., 2013; Jain and Lodhavia, 2020; Swaraj More recently, (Alrashedy et al., 2020) presented
and Kumar, 2022). a method SCC, where they generated tags for their
On the other hand, some works consider snippets corpus of SO posts with a Random Forrest and XG
vital in discerning the tag correctly since the queries boost classifier. They achieved an accuracy of 78%
are closely related to the attached code snippets (Cao when they tried to predict the tags on the basis of
et al., 2021). snippets alone. However, combining the features of
Apart from predicting tags in SQA sites, source snippets with title and body significantly increased
code classification in itself has been a topic of rising the overall performance which goes to show the
interest for developers. In early days, source code of crucial role of utilizing code snippets while
a file was determined through their respective classifying the posts.
extensions. However, this technique is cumbersome
owing to extensive manual involvement.
Subsequently, ML and NLP based techniques 3 METHODOLOGY
have been devised by researchers for automatic
classification. The overall flow of our approach is presented in
In this connection, (David et al., 2011) trained figure 2. It comprises of 4 steps, namely – data
their model on a corpus of 41,000 files gathered collection, pre-processing (tokenization),
from github. However, their classifier based on vectorization, dimensionality reduction, and
selection of intelligent statistical features could classification.
achieve only 48% accuracy.
Similarly, (Khasnabish et al., 2014), gathered
around 20,000 files from multiple github
repositories. They employed Bayesian classifier to
predict ten sets of languages yielding around 93.48%
accuracy.
(Kennedy et al., 2016) also proposed a method
based on statistical language model to classify
source code files taken from github varying across
19 different languages giving an accuracy of around
97%. Another work by (Gilda, 2017) made use of
convolutional neural networks to classify 60
programming languages taken from github
repositories with a decent accuracy of 97%.
However, all these works have considered github
repositories as their training dataset. Since a large
source code file contains many distinguishing
features, it is relatively easy for the classifier to
predict the programming language as evident in the
case of (Kennedy et al., 2016) and (Gilda, 2017).
On the other hand, detection of programming
language in a code segment on SQA sites is
relatively much difficult.
In this connection, (Rekha et al., 2014) presented
a hybrid model based on multinomial naïve bayes
algorithm which automatically classifies the code
snippets of SO posts with an accuracy of 72%.
On similar lines, (Baquero et al., 2017) detected
programming language in 18000 code segments
gathered from SO. However, their classifier based
on support vector machine achieved very low
accuracy of 44.6%. (Saini and Tripathi, 2018) in
3.2 Regex based tokenization
Machine learning classifiers operate on numerical
feature vectors and therefore it is necessary to
transform the code snippets into feature vectors
before they can be classified. However, prior to
vectorization, we need to tokenize our data that
would comprise the vocabulary of our model.
Earlier works have treated code snippets as
regular text and simply removed white spaces and
punctuation. However, since source code is different
from regular language, we chose to customize
Sklearn’s tokenizer to better complement our
requirements.
To describe the pattern of a token, we make use
of regular expressions on three sub levels, i.e., for
identifying the keywords, the operators and braces.
Table 1 demonstrate one set of all the three
corresponding regex.
Table 1: Regex patterns and their corresponding target

keywords with example
Pattern Correspond Sample Expected
ing target tokens
[A-Za- Keywords, import ['import',
z_]\w*\b identifiers, matplotlib. 'matplotlib
variables etc. pyplot as ', '.',
plt 'pyplot',
'as', 'plt']
[@\#\ Operators 'x-= 7-5' ['-=', '-']
&\\\*\
+^_\-\.$
%\/<=\|\
~!\:>\?]+
[\)\,;\{\}\ Spaces, plt.show(); ['(', ')', ';']
[\]`t\("'] Braces and
Tabs
Figure 2: Overview of our proposed approach.
3.1 Data Pre-processing

3.3 Vectorization
Stack overflow is typically a discussion forum where
users often ask queries in a specific context and Usually, in a NLP task, tokenization is followed by
therefore the code snippets associated with the stemming and lemmatization, however, since we are
question can appear incoherent and unstructured if dealing with source code, changing the structure of
analysed independent of the title and description of words might lead of loss of information.
the respective post. Some snippets are as small as Post tokenization, we are left with a vocabulary
having less than two lines of code, while some of size ‘N’, where N denotes the total number of
contain noise in terms of large comments and empty unique tokens comprising our code snippet corpus.
lines. To this predicament, we start by removing Next, we align a particular index to all the tokens by
those anomalies followed by deletion of duplicate transforming each snippet into an array of size N.
rows, NaN values and unnecessary stop words Figure 3 depicts the process of vectorization on a
which don’t add any relevant information to the sample snippet.
model.
ith word of the vocabulary based on tokenization, and
¿ { j∨W i € D j∨¿ represents the total count of
sample snippets containing the word W i. Wk,i
denotes the ith frequency of the vocabulary in the K th
text. The count of the ith word in the Kth text is
denoted by Count (W k ,i ) and ∑ Count (W k ,i )
j
represents the summation of the word frequencies of
all words in the vocabulary in the k th sample snippet.
Finally, equation (3) is used for calculating the tf-idf
factor of each token.
3.4 Dimensionality Reduction

Since our experiment subject is vast and diverse, the
generated features are expected to be similarly huge
as well. However, many a times, the feature matrix
is occupied by irrelevant tokens that don’t add up in
the classification process in true sense. To this
predicament, we make use of ‘chi-square’ or ‘chi2’
feature selection method (McHugh and Mary, 2013)
to filter our impertinent data.
The chi2 technique aims to measure the degree
of independence between a specific feature (tokens
Figure 3: Tokenization followed by vectorization on a of the snippet in our case) and its respective class
sample code snippet (programming language in our case).
After performing the chi-squared test, we keep
For our work, we choose Term Frequency – only selective features having a certain p-value.
Inverse Document Frequency (TF-IDF) vectorizer to
assign indices to all the respective tokens present in 3.5 Model selection
the snippet. The main idea of TF-IDF is to measure
the relevance of specific word in proportion to the After performing the chi-squared test, we are left
frequency of its appearance in the entire corpus. This with the subset of distinguishable features which
potent ability of tf-idf to identify distinguishable would be fed to a suitable classifier.
words in a corpus makes it suitable for our code Based on previous studies, we selected two
classification task as various keywords and prominent ML based classifiers for our work,
identifiers can be easily recognized which can aid namely - Random Forrest (Breiman, 2001 and XG
the classifier in segregating SO posts. Boost (Chen et al., 2016) to better compare with the
Equation (1 – 3) respectively depict the formula existing baseline.
used for calculating tf and idf score of a word in a We also employ neural networks as an
document followed by the combined tf-idf factor: alternative to our ML classifiers. Although, in terms
of structure, logistic regression can be regarded as a
Count ( W k ,i ) (1) basic neural network with no hidden layer, still we
Tf(k,i) = ∑ Count (W ) wish to investigate if neural networks could make a
k ,i significant difference or not.
j
Finally, for our comparative analysis, we train
Idf(wi) = log (2) our corpus on word2vec and distil Bert as separate
¿ tasks.
¿ D∨ ¿
¿ { j∨W i € D j∨¿ ¿
Tf-dif(k,i) = tf(k,i) * idf(wk,i) (3) 4 RESULTS AND COMPARITIVE

ANALYSIS
Here, ‘D’ represents the entire corpus while D j,
denotes the jth snippet in the corpus. Wi donates the 4.1 Dataset
We gather the snippets for our experiment from the foreach, public, new
SO post corpus gathered by (Alreshedy et al., 2020). C++ operator, endl, int,
The corpus consisted of a total 232,727 questions push_back, const, cpp,
varying across 21 different programming languages. boost, void, std, cout
These languages which include - Python, Objective- Css width, moz, hover, border,
style, webkit, margin, div,
C, Bash, Ruby, Perl, C++, Lua, CSS, Markdown, background, css
Java, HTML, , C#, Scala, JavaScript, PHP, C, R, haskell io, haskell, ghc, xs, cabal,
SQL, , VB.Net, Swift and Haskell comprise 80% of hs, do, otherwise, where, let,
questions on SO (Developer, 2017). java inputstream, synchronized,
javax, jsp, system out,
4.2 System Settings public, public void, extends,
new, java
We employ Intel(R) Xeon(R) CPU E5-1650 v3 @ Java_script settimeout, new date,
3.50GHz 3.50 GHz with 32 GB Ram and 2GB document, javascript, js,
NVIDIA Quadro K620 GPU with running OS of prototype, onclick, alert,
windows 10 for all the experiments. function, var
We implement our model on keras framework lua torch, require, nil, function,
lua_state, print, then, end,
with Tensorflow backend on a Jupyter notebook. For
local, lua,
implementing distil-Bert, we make use of python
library transformers made available through
Earlier works have shown the efficacy of
Hugging Face Corporation.
Random Forrest and XG Boost algorithms for this
4.3 Experimentation task. To have a better comparative analysis with the
existing baselines, we decided to compete both these
In this subsection, we would elaborate on the various models.
steps carried out during the experiment. For this purpose, we employ Sklearn’s
As fig 2 depicts the methodology, the first step is GridSearchCV and pipeline method to select the best
to pre-process the data. We start with our estimators and hyper parameters. To keep the
customized tokenization based on regular computational cost low, we perform grid search over
expressions. 25% of our dataset.
Once the corpus is transformed into individual Finally, we deploy our winner classifier by
tokens, we apply tf-idf vectorization on them. We unpacking the best parameters tuned through grid
limit the maximum number of features up to 5000 search, which in our case was Random Forrest
tokens based on the frequency of their appearance in which is in line with earlier work of (Alreshedy et
the corpus. al., 2020) as well.
Further, to select the subset of relevant features, Additionally, we also implement artificial neural
we perform the chi-squared test with p value of 0.95. network with 100,000 nodes in the input layer, 64 in
Some of the most statistically pertinent keywords are the hidden layer and then 21 dense layers followed
listed in table 2. The final filtered features are then by a sigmoid function for classification. We make
passed out to the ML and Neural Network use of ADAM optimizer which is an optimized
classifiers. version of ‘RMSProp’ and ‘momentum’ combined
followed by categorical cross entropy for fitting the
model.
We keep the train-test split ratio to 75-25 percent
for an optimum analysis.
Table 2: Top keywords filtered after chi-square test Since neural networks cannot handle sparse data
straight away, we convert the tf-idf vectors into
Languages Top features dense array. However, owing to the huge size of the
bash bin bash, then, grep, sh, fi, dataset, we pass the data in batches by creating an
awk, done, sed, bash, echo
iterable generator object.
C fopen, gcc, malloc, define,
int, sizeof, struct, char, void, Further, we manually fine tune the hyper
printf, parameters such as changing the learning rate,
C# get set, using, ilist, increasing the number of hidden nodes, adding a
assembly, private void, drop out layer and so on for an optimized result.
typeof, ienumerable,
4.3 Results of proposed methodology
After the posts are classified through respective ML
and NN classifiers, we plot the confusion matrix and
evaluate the performance. While Random Forrest
outperformed other ML classifiers in grid search,
NN even performed better than RF. However, the
computational time of NN was slightly more than
the RF classifier, so in a sense the trade-off was
comparable. Still, for performance considerations,
we chose to proceed with our ANN classifier.
We evaluate the performance for all the
languages individually (table 3) as well as
collectively (table 4) based on four metrics, namely
accuracy, precision, recall and F1 score. Figure 4
depicts the confusion matrix of the ANN classifier.
Table 3: Overall performance of the classifier

Classifier/Metric ANN RF
Precision 85.60% 83.40% Figure 4: Confusion matrix generated for the neural
network classifier.
Recall 81.90% 80.20%
F1 Score 82.50% 81.30%
Accuracy 85.04% 83.80% 4.4 Comparative analysis
Table 4: Performance record for individual programming In this subsection, we compare our model’s
language for ANN classifier (all results in percentile) performance with earlier existing baselines and also
investigate whether embedding based techniques and
Classifier/Metric Precision Recall F1 state of art models like Bert can perform better on
R 85 92 88 source code or not.
C 80 88 84 Most of the earlier works have either considered
scala 96 92 94 snippets insignificant to be included or have merged
c# 78 82 80 their features with the features extracted from body
perl 95 86 90
and description of the post. However, we can still
Vb.net 93 89 91
make an overall comparison keeping in view the aim
Css 80 79 80
swift 97 94 96
of achieving the end goal to classify the
haskell 93 93 93 programming language of code snippets.
Html 53 53 53 However, identifying the language in large
PhP 88 83 85 source code files is very much different from
java 86 76 81 classifying snippets and therefore an equivalent
java_script 71 78 75 comparison can’t be made between the two.
bash 76 89 82 Table 5 shows the overall comparison of similar
tuby 88 89 89 works in the domain. Although (Alreshedy et al.,
lua 95 89 92 2018; 2020) in their work achieve better
c++ 88 81 84 performance overall, but it is to be noted that when
objective_c 93 93 93 applied on the snippets alone, their accuracy
markdown 94 19 32 dropped and our model surpasses their work. To
python 87 87 87 combine the effect of textual part of the body and
Sql 79 88 83 description is our future work. Table 6 further draws
Accuracy - - 85
comparison between the work of (Alreshedy et al.,
Macro Avg. 86 82 82
2020) and our work on individual programming
Weighted Avg. 85 85 85
language level.
Besides, our investigation concerning the
performance of embedding based approaches and
pre-trained models are also presented in table 7 and 2018) al., al.,
figure 5. 2018) 2020)
lua 50 84 70 92
Table 5: Comparative analysis of our approach against C 56 76 81 84
existing baselines for identifying tags in SO posts ruby 43 70 72 89
C# 51 79 78 80
Study Dataset based on Acc, Pr, Rc, F1 C++ 65 51 73 84
(Kuo, Title and 47% python 69 88 79 87
2011) Description R 72 77 78 88
(Saha et Only Title 68% Css 30 86 77 80
al., 2013) Vb.net 60 83 77 91
(Stanley et Title and 65% swift 54 84 89 96
al., 2013) Description haskell 67 89 78 93
Html 35 54 55 53
(Baquero, Title and 60.8%, 68%, Sql 50 65 79 83
2017) Description 60%, and 60% Java 46 70 76 81
(Alreshed Title and 81%, 83% 81%, markdown 28 76 91 32
y et al., Description and 0.81%; bash 67 76 85 82
2018) objectivec 77 57 88 93
Title, Description 91.1%, 91%, Perl 69 74 41 90
and code Snippet 91% and 91% PhP 62 74 88 85
Scala 72 76 81 94
Only code 73%, 72%, 72%, Javascript 48 78 74 75
Snippets 72%
Table 7: Comparative analysis of our approach against
(Saini and Title, Description 65%, 59%, 36%, Word2Vec and Distil-Bert
Tripathi, and code Snippet and 42%
2018) Classifier/ Tf-Idf Word2Vec Distil-
(Jain and Only Title 75%, F1 Score- Metric +ANN Bert
Lodhavia, 81% Accuracy 85.04% 68.30% 61.2%
2020) Time taken 139.4s 3936.6s 36996s
(Kavuk Title and 75%, 62%, 55% per step
and Description and 39%
Tosum,
2020)
(Alreshed Title and 78.9%, 81%,
y et al., Description 79% and 79%
2020)
Title, Description 88.9%, 88%, 88%
and code Snippet and 88%
Only code 79.9%, 80%,

Snippets 80%, 80%
Our work Only Code 85%, 85%, 81%
Snippets and 82%
Figure 5: Performance of Word2Vec embedding based

approach over Bi-LSTM
Table 6: Comparative analysis of F1 score of our approach

against existing baselines for individual programming
languages (all results in percentile)
4.5 Discussion
Classifier/ PLI SCC SCC+ Our The significant performance improvement of our tf-
Model (PLI (Alresh (Alresh Work idf approach over existing works can be attributed to
tool, edy et edy et the decisive pre-processing of the corpus based on
regular expressions and feature reduction technique. languages in stack overflow posts is still a less
Neural network classifier further outperforming ML explored and challenging task. To this predicament
based classifiers also adds to the boost of we present our work for classifying programming
performance. languages in SO post snippets trained on regex-
Further, from table 4 and 6, we can see that out based tf-idf vectorizer over an ANN. We achieve
of all the 21 languages, HTML and Markdown decent accuracy of 85% which excels several
performed worst. While Markdown had only 1300 baselines in the task of code snippet classification.
snippets as compared to 12000 snippets of other 20 Further, we also investigate the utility of
languages which very much explains it low embedding based algorithms such as word2vec and
performance due to lack of training data, the reason pre-trained models such as distil-bert. Our
for bad performance of HTML could be attributed to investigation shows that traditional approaches are
the ambiguity of HTML snippets which much suited for source code classification tasks
simultaneously contain CSS and JavaScript code owing to the lack of semantic dependence in code
segments which leads to miss-classification. One scripts.
possible solution to fixing this anomaly could be to
increase the training data for Markdown and to
generate multiple tags for HTML snippets and not REFERENCES
consider the miss-classification count of JavaScript
and CSS tags. Alreshedy, K., Dharmaretnam, D., German, D. M.,
Elsewhere, the performance of tf-idf approach Srinivasan, V., & Gulliver, T. A. (2018). Predicting
excelling the embedding based approaches can be the Programming Language of Questions and Snippets
attributed to the fact that code classification is very of StackOverflow Using Natural Language
much independent of semantic association of words Processing. arXiv preprint arXiv:1809.07954.
as compared to regular text.
One may argue that recent advances that have Alrashedy, K., Dharmaretnam, D., German, D. M.,
Srinivasan, V., & Gulliver, T. A. (2020). Scc++:
transpired codebert (Feng et al., 2020), which is
Predicting the programming language of questions and
specifically trained on code corpus comprising six snippets of stack overflow. Journal of Systems and
languages can outperform traditional methods. In Software, 162, 110505.
this regard, we wish to point out that the
computational time needed for distil bert only was Barua, A., Thomas, S. W., & Hassan, A. E. (2014). What
around 1000 times more than that of tf-idf approach are developers talking about? an analysis of topics and
with a RF classifier, and distil bert is the most trends in stack overflow. Empirical Software
preliminary version of the Bert series. Therefore, Engineering, 19, 619-654.
keeping the basis of performance and computational
Baquero, J. F., Camargo, J. E., Restrepo-Calle, F., Aponte,
time trade-off, we can safely argue that traditional
J. H., & González, F. A. (2017, September). Predicting
approaches are more suited for code classification the programming language: Extracting knowledge
purposes. from stack overflow posts. In Colombian Conference
However, a proper in-depth investigation in this on Computing (pp. 199-210). Springer, Cham.
regard can hold a promising future work.
Breiman, L. (2001). Random forests. Machine learning,
45, 5-32.
5 CONCLUSION Cao, K., Chen, C., Baltes, S., Treude, C., & Chen, X.
(2021, May). Automated query reformulation for
Stack Overflow is one of the foremost resources for efficient search based on query logs from stack
developers to seek technical assistance. Owing to its overflow. In 2021 IEEE/ACM 43rd International
massive popularity, the site receives massive traffic Conference on Software Engineering (ICSE) (pp.
on a daily basis and thus a proper mechanism for 1273-1285). IEEE.
segregation of posts is needed. Most of the queries at
Chen, T., & Guestrin, C. (2016, August). Xgboost: A
SO contain code snippets which are relatively very scalable tree boosting system. In Proceedings of the
small compared to the complete source files. 22nd acm sigkdd international conference on
Besides, source code classification is a topic of knowledge discovery and data mining (pp. 785-794).
importance in itself. However, earlier works have
focused on classification of large source code files Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong,
from github repositories. Prediction of programming M., ... & Zhou, M. (2020). Codebert: A pre-trained
model for programming and natural languages. arXiv
preprint arXiv:2002.08155. Saini, T., & Tripathi, S. (2018, March). Predicting tags for
stack overflow questions using different classifiers. In
Gilda, S. (2017, July). Source code classification using 2018 4th International Conference on Recent
Neural Networks. In 2017 14th international joint Advances in Information Technology (RAIT) (pp. 1-
conference on computer science and software 5). IEEE.
engineering (JCSSE) (pp. 1-6). IEEE.
Stanley, C., & Byrne, M. D. (2013, July). Predicting tags
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. for stackoverflow posts. In Proceedings of ICCM
(2018). Bert: Pre-training of deep bidirectional (Vol. 2013).
transformers for language understanding. arXiv
preprint arXiv:1810.04805. Swaraj, A. and Kumar, S. A Methodology for Detecting
Programming Languages in Stack Overflow
Jain, V., & Lodhavia, J. (2020, June). Automatic Question Questions. DOI: 10.5220/0011310400003266. In
Tagging using k-Nearest Neighbors and Random Proceedings of the 17th International Conference on
Forest. In 2020 International Conference on Intelligent Software Technologies (ICSOFT 2022)
Systems and Computer Vision (ISCV) (pp. 1-4).
IEEE. Van Dam, J. K., & Zaytsev, V. (2016, March). Software
language identification with natural language
Kavuk, E. M., & Tosun, A. (2020, June). Predicting Stack classifiers. In 2016 IEEE 23rd international conference
Overflow question tags: a multi-class, multi-label on software analysis, evolution, and reengineering
classification. In Proceedings of the IEEE/ACM 42nd (SANER) (Vol. 1, pp. 624-628). IEEE.
International Conference on Software Engineering
Workshops (pp. 489-493).
Khasnabish, J. N., Sodhi, M., Deshmukh, J., &

Srinivasaraghavan, G. (2014, July). Detecting
programming language from source code using
bayesian learning techniques. In International
Workshop on Machine Learning and Data Mining in
Pattern Recognition (pp. 513-522). Springer, Cham.
Kuo, D. (2011). On word prediction methods. Technical

report, Technical report, EECS Department.
McHugh, M. L. (2013). The chi-square test of

independence. Biochemia medica, 23(2), 143-149.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013).

Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
Pennington, J., Socher, R., & Manning, C. D. (2014,

October). Glove: Global vectors for word
representation. In Proceedings of the 2014 conference
on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).
Programming language identification tool, 2018.

Available: https://www.algorithmia.com [Online].
Radford, A., Narasimhan, K., Salimans, T., & Sutskever,

I. (2018). Improving language understanding by
generative pre-training.
Saha, A. K., Saha, R. K., & Schneider, K. A. (2013, May).

A discriminative model approach for suggesting tags
automatically for stack overflow questions. In 2013
10th Working Conference on Mining Software
Repositories (MSR) (pp. 73-76). IEEE.

SO Snippet ENASE

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

SO Snippet ENASE

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SO Snippet ENASE

Uploaded by

Copyright:

Available Formats

Programming language identification in stack overflow post snippets

with Regex based Tf-Idf vectorization over ANN

Aman Swaraj1, Sandeep Kumar1

1 INTRODUCTION SO search engine associates tag with only the

Table 1: Regex patterns and their corresponding target

Figure 2: Overview of our proposed approach.

3.1 Data Pre-processing

3.4 Dimensionality Reduction

Tf-dif(k,i) = tf(k,i) * idf(wk,i) (3) 4 RESULTS AND COMPARITIVE

Table 3: Overall performance of the classifier

Only code 79.9%, 80%,

Figure 5: Performance of Word2Vec embedding based

Table 6: Comparative analysis of F1 score of our approach

Khasnabish, J. N., Sodhi, M., Deshmukh, J., &

Kuo, D. (2011). On word prediction methods. Technical

McHugh, M. L. (2013). The chi-square test of

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013).

Pennington, J., Socher, R., & Manning, C. D. (2014,

Programming language identification tool, 2018.

Radford, A., Narasimhan, K., Salimans, T., & Sutskever,

Saha, A. K., Saha, R. K., & Schneider, K. A. (2013, May).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.