Predicting Students Academic Perfomace U
Predicting Students Academic Perfomace U
Balamurugan E.
Sujith Jayaprakash Assoc. Professor, Department of ICT
Sr. Lecturer, Department of ICT BlueCrest College
BlueCrest College Accra, Ghana.
Accra, Ghana. e.balamurugan@bluecrest.edu.gh
sujith.jayaprakash@bluecrest.edu.gh 0279509431
0263011390
Vibin Chandar
Lecturer, Department of ICT
BlueCrest College
Accra, Ghana.
Vibin.chandar@bluecrest.edu.gh
0263011399
Abstract— In the present days, education plays a vital role to stimulate the people to lead their life more comfortable.
Due to sudden rising of various educational institutions all around the world most of the institutions are trying hard
to survive. Institutions offering specially higher education are striving hard to maintain the quality offered to the
students. There are lots of factors are influencing the quality of education institutions like Infrastructure, Teaching
and learning methods, Laboratories, Campus Placements, Linkages with Industries etc. One among the major factor
which influences the quality of an institution is the student feedback. Now a days institutions are paying more
attention towards the student feedback on their experience with their lecturers on the quality of delivery of course
content’s in Classroom. Retention of institutions with a good numbers is dependent on the understanding and
satisfying students need. Hence maintaining high quality standards is eminent for any institution to improve the
academic performance of students and to retain them in the system. In this paper, Naive Bayes algorithm is applied
for predicting student’s academic performance at the end semester exams by analyzing students feedback and their
performance in the mid-semester exams. This work helps the educational institutions to identify the weaker studens
in advance and arrange necessary training before they are going to appear for their final exams.
Keywords: Naive Bayes Algorithm, Student Feedback, Academic Performance, Student Retention, Knowledge discovery.
1
1. INTRODUCTION
Data mining has been used in the areas of Science and Engineering, such as Education, Genetics, Medicine,
Bioinformatics and electical power engineering. Data mining techniques and tools are used to extract meaning from
large set of data generated to peoples learning activites. It has been widely used in the areas of Business to analyse
the Customer Relation Management, Human Resource management, marketing etc., Data Mining has high impact in
the Business sector, Education is also tapping into the power of Data Mining.
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information. Information that can be used to increase revenue, cuts
costs, or both. It can be classified as Supervised and Unsupervised learning. In the supervised learning classification
requires the training data has to specify what we are trying to learn (the classes) and where as in unsupervised
learning the training data doesn’t specify what we are trying to learn (the clusters). Supervised learning is analogous
to human learning from past experiences to gain new knowledge in order to improve our ability to real world tasks
[1]. Various algorithms are used to perform supervised learning and few among them are Symbolic Machine Learning
algorithm, Semisymbolic machine learning algorithm, Nearest Neighbour Algorithm, Naive Bayes algorithm.
The Naive Bayes algorithm is a simple probabilistic classifier which is based on Bayes theorem with strong and naive
independence assumptions. It is one of the most basic classification techniques with various applications in email
spam detection, personal email sorting, document categorization, sexually explicit content detection, language
detection and sentiment detection. Despite the naive design and oversimplified assumptions that this technique uses,
Naive Bayes performs well in many complex real-world problems. Naive Bayes algorithm is highly scalable and
requires a number of parameters linear in the number of variables. A Naive Bayes classifier is a simple probabilistic
classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions.
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. Several colleges and universities have adopted feedback
analysis system using various models in data-mining to improve student retention and to channel students to courses
In this paper, Supervised learning approach through Naives Bayes algorithm [2] is used for the prediction of final
examination results of student’s based on their Feedback and their mid semester results as a training data to analyse
their academic performance. Various attributes of Students feedback has been taken as dependent variables and mid
2
semester exam result is taken as a explanatory variable.This paper is organised as chapter I : Introduction, Chapter II
:Related works, Chapter III : Proposed Methodology, Chapter IV : Results and Discussions Chapter V : Conclusion.
2. RELATED WORKS
Contemporaneous researches are introduced using various data mining technique to analysis the academic
performance of students at various levels, following are the few of some especially used for academic progression in
various modes.
M. Wook, Y. Hani Yamaya, N. Wahab, M. Rizal Mohd Isa, N. Fatimah Awang and H. Yann Seong compared two data
mining techniques which are: Artificial Neural Network and the combination of clustering and decision tree
classification techniques for predicting and classifying student's academic performance. As a result, the technique that
provides accurate prediction and classification was chosen as the best model. Using this model, the pattern that
influences the student's academic performance was identified. S. Kumar Yadav, B. Bharadwaj and S. Pal obtained the
university students data such as attendance, class test, seminar and assignment marks from the students' database, to
predict the performance at the end of the semester using three algorithms ID3, C4.5 and CART and shoes that CART
is thebest algorithm for classification of data[3] . N. Thai Nghe, P. Janecek and P. Haddawy compared the accuracy of
decision tree and Bayesian network algorithms for predicting the academic performance of undergraduateand
postgraduate students at two very different academic institutes. These predictions are most useful for identifying and
assisting failing students, and better determine scholarships. As a result, the decision tree classifier provides better
accuracy in comparison with the Bayesian network classifier [4]. M. Alam and S. A. Alam have presented a novel
algorithm implementing decision trees to maximize the profit-based objective function under resource constraints.
More specifically, they take any decision tree as input, and mine the best actions to be chosen in order to maximize
the expected net profit of all the customers. NBTree - The Naive Bayesian tree learner, NBTree (Kohavi 1996),
combined Naive Bayesian classification and decision tree learning. Bayesian classifiers are statistical classifier. The
Naive Bayes algorithm is a simple probabilistic classifier that calculates a set of probabilities by counting the
frequency and combinations of values in a given data set. In an NBTree, a local naive Bayes is deployed on each leaf
of a traditional decision tree, and an instance is classified using the local naive Bayes on the leaf into which it falls.
After a tree is grown, a naive Bayes is constructed for each leaf using the data associated with that leaf. An NBTree
classifies an example by sorting it to a leaf and applying the naive Bayes in that leaf to assign a class label to it.
3
3. METHODOLOGY
In this research proposed data mining technique is for predicting student’s academic performance by analyzing
student’s feedback using Naive Bayes algorithm [5]. The research process includes the following process (Figure 1).
A. Data Selection
B. Data Transformation
C. Implementation of Naive Bayes algorithm
D. Classification
A. Data Selection :
The data herein was collected by means of feedback rating-scale questionnaire, which is presented in Table 1. In
Table 1 there are nine questions which completely related to teaching and learning process of an institute. The
questions in the questionnaire are measured with a scale value of 1 to 5 whereas in Table 2. Then, the data was
collected from 700 students in various departments of BlueCrest College, Accra, Ghana in the academic year 2014
with the internal examination score. The internal score is taken as an average course wise score (Average of Internal
Data Transformation
Naive Bayes
Classification Results
B. Data Transformation :
The data derived from the feedback questionnaire was transformed into the proper format in order to be analysed
a Below average 1
b Average 2
c Satisfactory 3
d Good 4
e Excellent 5
The Naive Bayesian algorithm is based on Bayes theorem with independence assumptions between
predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes
it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does
surprisingly well and is widely used because it often outperforms more sophisticated classification methods.
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive
Bayes classifier assumes that the effect of the value of a predictor (x) on a given class (c) is independent of the values
5
D. Classification rule
Classification rule is generated based on the classification process based on users request or research needs. This
can be derived specially for the needs on better understanding for each class of data in a database.
Apply Bayesian rule to convert them into posterior probabilities from (1) and (2).
P( X x | C ci ) P(C ci )
P(C ci | X x)
P ( X x)
P( X x | C ci ) P(C ci )
for i 1,2, , L (3)
The implementation work is based on the collected data which possess various data mining aspects [6]. The Student
data is taken into account for the performance prediction. The proposed research work is categorized into two
modules. First the feedback results are analysed and same is compared with the internal test performance.
The same can be implemented using open source language Java whereas problem is designed as follows:
(ii) Input values are given with a scale value between 1 to 5 for all the 9 feedback questioners.
(iii) Mean, Variation values are computed for each questions in the Questionnaire.
For the above implementation test samples were taken from the student’s feedback and same as transformed
as input for the Procedure 1 finally the prediction result is shown as PASS / FAIL. For implementation a random
sample of 200 values are taken from the student feedback dataset and the values are imported to Ms-SQL database
and same can be given as input for procedure 1. Output for the result dataset with a value of PASS/FAIL.
6
4. RESULTS AND DISCUSSIONS
Data samples are taken out of a total number of 700 student’s record dataset, we chosen sample 200 students
record for our analysis [7]. The confusion matrix [8] demonstrates number of pass, fail in an Internal Examination.
The performance of the above algorithm evaluated using the following three methods are explained below:
Performance Measures
There are some parameters on the basis of which we evaluated the performance of the classifiers such as TP
rate, FP rate, precision, Recall, F- Measure. The Accuracy of a classifier on a given test set is the percentage of test set
tuples that are correctly classified by the classifier. The Error Rate or Misclassification rate of a classifier M, which is 1-
Acc (M), where Acc (M) is the accuracy of M. The Confusion Matrix is a useful tool for analysing how well your
classifier can recognize tuples of different classes. The sensitivity and specificity measures can be used to calculate
accuracy of classifiers. Sensitivity is also referred to as the true positive rate (the proportion of positive tuples that are
correctly identified), while Specificity is the true negative rate (that is, the proportion of negative tuples that are
(4)
(5)
(6)
Where (4),(5),(6) T-Pos is the number of true positives tuples that were correctly classified, Pos is the number of
positive tuples, T-Neg is the number of true negatives tuples that were correctly classified, Neg is the number of
negative tuples, and F-Pos is the number of false positives tuples that were incorrectly labelled. It can be shown that
(7)
True Positive Rate: It is the proportion of actual positives which are predicted as positive. The formula is defines as,
(8)
7
Where Tp stands for true positive and Fn stands for false negative.
FP rate: It is the rate of negatives tuples that are incorrectly labelled. The formula is defined as,
(9.a)
(9.b)
200 0 0
0 166 0
0 0 34
From a total number of 700 students record dataset, we chosen sample 200 students record for our analysis. The
confusion matrix demonstrates number of pass, fail in their internal examination. Number of pass students are 166.
Number of Fail student is 34. The data analysis is performed with the methods of precision, recall and f-measure.
Precision
Prediction is a calculation of positive predicted values precision, which is the fraction of retrieved documents
that are relevant. The precision is calculated using the formula as:
(10)
Precision takes all retrieved documents into account, but it can also be evaluated at a given cut-off rank,
considering only the topmost results returned by the system. This measure is called precision at n.
Recall
Recall in information retrieval is the fraction of the documents that are relevant to the query and that are
(11)
8
F-Measure
This is a measure that combines precision and recall, a harmonic mean of precision and recall, is known as the
traditional F-measure.
(12)
Naive Bayes
1
0.5
Naive
0
Bayes
5. CONCLUSIONS
Using Naive Bayes algorithm, we predicted the pass percentage and fail percentage of the Overall students
appeared for a particular examination with a comparison of their feedback regarding their course sessions and
internal marks. The results show the students’ performance and it is seems to be accurate. The comparison between
feedback and internal examination marks Navie Bayes algorithm gives the better prediction result and it is measured
using confusion matrix. The results are predicted within 2 seconds. This simple analysis works show that the proper
data mining application on student’s performance data can be efficiently used for vital hidden knowledge /
information retrieval from the vast data, which can be used for the process of decision making by the management of
an educational institution. It helps the institutions to identify the weaker students in advance and they can arrange
special measures to get good score. This paper also concludes with that for data mining application for effective and
faster results prediction, classification and clustering and the institutions can improve their quality based on the
9
6. REFERENCES
1. Alam, M. and Alam, S. (2012). Actionable Knowledge Mining from Improved Post Processing Decision Trees,
2. Ayinde, A. Adetunji, A. Bello, M. and Odeniyi, O. (2013). Performance Evaluation of Naive Bayes and Decision
Stump Algorithms in Mining Students’ Educational Data. “IJCSI International Journal of Computer Science Issues,
3. Azwa Abdul Aziz and Nor Hafieza Ismailand Fadhilah Ahmad, (2014). First Semester Computer Science
Students’ Academic Performances Analysis by Using Data Mining Classification Algorithms, Proceeding of the
4. Durairaj, M. and Vanitha, M. (2014). Educational Data mining for Prediction of Student Performance Using
Clustering Algorithms, International Journal of Computer Science and Information Technologies, pp. 5987-5991.
5. Jason, D. and Rennie, M. and Lawrence, S. (2010). Tackling the Poor Assumptions of Naive Bayes Text
6. Liu, B. (2011). Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Data-Centric Systems and
7. Rauf, A. and Sheeba (2012). Enhanced K-Mean Clustering Algorithm to Reduce Number of Iterations and
Dealing with Class Imbalance, International Swaps and Derivatives Association, pp. 878-883.
10