Predicting Student Retentionin Random Forest
Predicting Student Retentionin Random Forest
Predicting Student Retentionin Random Forest
net/publication/382585981
CITATIONS READS
0 26
5 authors, including:
Khaled Shaalan
British University in Dubai
403 PUBLICATIONS 13,157 CITATIONS
SEE PROFILE
All content following this page was uploaded by Said A. Salloum on 26 July 2024.
khaled.shaalan@buid.ac.ae
1 Introduction
Student retention is a critical concern for higher education institutions, impacting both
institutional success and student outcomes. High dropout rates can lead to significant
financial losses for universities and adverse effects on students’ personal and professional
futures. According to [1], student retention is influenced by a complex interplay of
factors including academic performance, socioeconomic background, engagement with
the university community, and personal circumstances. Early identification of at-risk
students allows institutions to provide targeted interventions that can improve retention
rates. For instance, academic support programs, counseling services, and financial aid
can be strategically deployed to support students who are at risk of dropping out.
Despite the importance of student retention, many higher education institutions strug-
gle to develop effective predictive models that can accurately identify students who are
likely to drop out. Traditional methods often fail to capture the complexity of the factors
involved and may not be adaptable to the unique contexts of different institutions. Astin
[2] emphasizes that understanding what matters in college is crucial for improving reten-
tion rates. However, there is a need for more sophisticated approaches that can leverage
the vast amounts of data available to institutions and provide actionable insights. This gap
in the existing literature and practices highlights the necessity for innovative solutions
in predicting student retention.
This study leverages advanced machine learning techniques to predict student reten-
tion. Specifically, we use a RandomForestClassifier to analyze a comprehensive dataset
of student records. The dataset includes various features related to demographics, aca-
demic performance, and other relevant factors that influence retention. Our methodol-
ogy involves several steps: data preprocessing to encode categorical variables and scale
numerical features, hyperparameter tuning using GridSearchCV to optimize the model,
and evaluation of the model’s performance using metrics such as accuracy, precision,
recall, F1-score, and ROC curves. Visualizations are generated to provide deeper insights
into the model’s behavior.
The primary contribution of this study is the development of a predictive model for
student retention that can be used by higher education institutions to identify at-risk
students early. By integrating such predictive models into their student support systems,
universities can implement targeted interventions to enhance student success. This proac-
tive approach can lead to better academic outcomes for students and reduced financial
losses for institutions due to dropouts. Furthermore, the study provides a methodological
framework that can be adapted and extended by other researchers and practitioners in the
field. The remainder of this paper is structured as follows: Sect. 2 describes the dataset
used in this study, Sect. 3 outlines the methodology, Sect. 4 presents the results, and
Sect. 5 discusses the conclusions and suggests directions for future research.
Predicting Student Retention in Higher Education 199
2 Literature Review
Research has extensively studied the various factors influencing student retention in
higher education. Tinto [1] identifies academic integration and social integration as crit-
ical components affecting a student’s decision to stay or leave an institution. Academic
integration refers to the extent to which students feel connected to their academic pro-
grams, while social integration pertains to their involvement in the campus community
[3, 4]. Additionally, financial issues, family responsibilities, and personal challenges
significantly impact retention rates. These findings suggest that retention strategies need
to be multifaceted, addressing both academic and non-academic aspects of student life.
The advent of data analytics has brought new dimensions to predicting student reten-
tion. According to [5], traditional statistical models have been supplemented by machine
learning algorithms, which can handle large datasets and identify complex patterns.
Logistic regression, decision trees, and neural networks are among the models used to
predict retention. Recent studies, such as those by [6], have demonstrated that machine
learning models can achieve higher accuracy in predicting dropouts compared to tra-
ditional methods. These models consider a wide range of variables, from academic
performance to extracurricular involvement, providing a more comprehensive analysis.
Machine learning techniques have proven effective in educational contexts due to
their ability to analyze vast amounts of data and uncover hidden patterns. Random
forests, support vector machines, and ensemble methods are popular choices for educa-
tional data mining. For instance, [7] found that ensemble methods combining multiple
algorithms often outperform single models in predicting student outcomes. Furthermore,
the flexibility of machine learning models allows them to be tailored to the specific char-
acteristics of different institutions, enhancing their predictive power. However, these
techniques also require careful tuning and validation to avoid overfitting and ensure
generalizability.
Despite the advances in predictive modeling, several challenges and gaps remain.
Many studies, including those by [8], highlight the need for more personalized and adap-
tive models that can accommodate the unique needs of individual students. There is also
a call for integrating qualitative data, such as student feedback and engagement metrics,
which are often overlooked in quantitative analyses. Moreover, most existing models
focus on short-term predictions, while long-term retention strategies require continu-
ous monitoring and adjustment. Addressing these gaps could lead to more effective and
sustainable retention initiatives.
This study aims to fill these gaps by developing a predictive model that not only
leverages advanced machine learning techniques but also incorporates a broad spectrum
of variables, including demographic, academic, and engagement factors. By optimizing
and validating the model using GridSearchCV, this research seeks to enhance its accuracy
and applicability across different institutional contexts. The goal is to create a tool that
higher education institutions can use to implement timely and targeted interventions,
ultimately improving student retention rates.
200 S. A. Salloum et al.
3 Methodology
3.1 Data Source
The dataset used in this study is sourced from Kaggle [9]. It contains various features
related to students’ demographics, academic performance, and other relevant factors. The
dataset includes columns such as Marital status, Application mode, Application order,
Course, Daytime/evening attendance, Previous qualification, Nationality, Mother’s qual-
ification, Father’s qualification, Mother’s occupation, Father’s occupation, and several
others, ending with the target variable indicating whether a student is a “Dropout,”
“Graduate,” or “Enrolled.”
4 Results
4.1 Confusion Matrix
The confusion matrix (Fig. 1) provides a detailed breakdown of the model’s perfor-
mance in classifying the students into the three categories: Dropout (0), Enrolled (1),
and Graduate (2). The matrix shows that the model correctly identified 241 Dropouts,
42 Enrolled, and 396 Graduates. However, there were also some misclassifications: 52
Dropouts were classified as Graduates, 72 Enrolled were classified as Graduates, and 14
Graduates were classified as Enrolled.
202 S. A. Salloum et al.
The Receiver Operating Characteristic (ROC) curve (Fig. 2) illustrates the true positive
rate (sensitivity) against the false positive rate (1-specificity) for the RandomForestClas-
sifier. The ROC curve is plotted for each class, and the area under the curve (AUC) is
calculated to measure the model’s ability to distinguish between the classes. The AUC
values for Class 0 (Dropout) and Class 1 (Enrolled) are 0.91 and 0.79, respectively.
These values indicate that the model performs well in differentiating between Dropouts
and Graduates but has some difficulty distinguishing between Enrolled and the other
two classes.
The training and testing accuracy and loss curves (Fig. 3) provide insights into the
model’s learning process and performance across different hyperparameter combina-
tions. The curves show that the model achieves high training accuracy, indicating that it
fits well to the training data. However, there is a slight drop in testing accuracy, suggest-
ing some degree of overfitting. The testing loss is higher than the training loss, which
is expected in real-world scenarios where models typically perform better on training
data.
Predicting Student Retention in Higher Education 203
The classification report (Table 1) summarizes the precision, recall, and F1-score for
each class. Precision measures the accuracy of the positive predictions, recall measures
the ability to capture all positive instances, and the F1-score is the harmonic mean of
precision and recall. The results indicate that the model performs best for the Graduate
class (2), with a precision of 0.76, recall of 0.95, and F1-score of 0.84. The model
performs moderately well for the Dropout class (0) with an F1-score of 0.80, but less
effectively for the Enrolled class (1) with an F1-score of 0.37.
204 S. A. Salloum et al.
5 Conclusion
This study demonstrates the potential of using a RandomForestClassifier to predict stu-
dent retention in higher education. By leveraging a comprehensive dataset that includes
demographic, academic, and socio-economic factors, we were able to build a model that
achieved an overall accuracy of 76.72%. The model’s performance, particularly in accu-
rately identifying Dropouts and Graduates, underscores its robustness and applicability
in real-world educational settings. The detailed analysis provided by the confusion matrix
revealed the model’s strengths and areas for improvement. While the model performed
well in classifying Dropouts and Graduates, it faced challenges in accurately identifying
Enrolled students. The ROC curves and corresponding AUC values further highlighted
these findings, showing strong performance for Dropouts and Graduates but a lower
AUC for the Enrolled category. This indicates that while the model is effective in certain
areas, there is room for refinement. The implications of these findings are significant
for higher education institutions. Integrating predictive models like the RandomForest-
Classifier into student support systems can enable institutions to proactively identify
and assist at-risk students. This proactive approach aligns with [1]’s framework, which
emphasizes the importance of both academic and social integration in improving stu-
dent retention. By identifying students who are likely to drop out, institutions can deploy
targeted interventions such as academic support, counseling, and financial aid, thereby
Predicting Student Retention in Higher Education 205
enhancing student success and retention rates. Furthermore, the insights gained from this
study can inform the development of more effective retention strategies. Understanding
the factors that contribute to student retention allows institutions to design policies and
programs that address these issues directly. This can lead to more holistic support systems
that not only help students academically but also address their social and personal needs.
Despite its strengths, this study has several limitations. The model’s lower performance
in predicting Enrolled students suggests that additional features or alternative models
may be needed to improve accuracy. The dataset, while comprehensive, may not capture
all relevant factors influencing student retention. Future research could incorporate qual-
itative data, such as student surveys or engagement metrics, to provide a more holistic
view of the factors affecting retention. Another limitation is the potential for bias in the
dataset. If certain groups of students are underrepresented, the model’s predictions may
not be equally accurate across all demographics. Ensuring a diverse and representative
dataset is crucial for developing fair and equitable predictive models [15]. Additionally,
the study’s reliance on historical data means that the model’s predictions are based on
past trends, which may not fully capture future changes in student behavior or insti-
tutional policies. There are several avenues for future research that could build on the
findings of this study. Firstly, incorporating additional features, such as behavioral data
from learning management systems or psychological factors, could enhance the model’s
predictive accuracy. Exploring other machine learning models, such as neural networks
or gradient boosting machines, may also yield better performance, particularly for the
Enrolled category [17]. Moreover, future studies should consider longitudinal data to
assess the model’s performance over time and its ability to adapt to changing student
populations and institutional contexts. Longitudinal studies would provide insights into
how retention predictors evolve and whether early interventions have long-term impacts
on student success. Additionally, developing adaptive and personalized models that can
provide individualized predictions and recommendations would be a valuable direc-
tion. Such models could continuously learn from new data and adjust predictions and
interventions in real-time, offering more dynamic and responsive support to students.
Finally, collaboration between researchers and educational institutions is essential to
ensure that predictive models are practically applicable and aligned with the needs of
the institutions. Engaging with stakeholders, including educators, administrators, and
students, can provide valuable feedback and ensure that the models are used ethically
and effectively. In conclusion, this study demonstrates the significant potential of using
machine learning techniques to predict student retention in higher education. The Ran-
domForestClassifier, with its robust performance and detailed insights, offers a valuable
tool for institutions aiming to improve student retention rates. By continuing to refine and
enhance these predictive models, higher education institutions can better understand and
support their students, ultimately leading to improved academic outcomes and reduced
dropout rates. The journey towards more effective retention strategies is ongoing, and
this study provides a solid foundation for future research and practical applications.
206 S. A. Salloum et al.
References
1. Tinto, V.: Leaving College: Rethinking the Causes and Cures of Student Attrition, University
of Chicago Press, Chicago (2012)
2. Astin, A.W.: What Matters in College: Four Critical Years Examined, Jossey-Bass, San
Francisco (1997)
3. Tinto, V. Leaving College: Rethinking the Causes and Cures of Student Attrition, ERIC (1987)
4. Braxton, J.M.: Leaving college: rethinking the causes and cures of student attrition by Vincent
Tinto. J. Coll. Stud. Dev. 60, 129–134 (2019)
5. Herzog, S.: Estimating student retention and degree-completion time: decision trees and
neural networks vis-à-vis regression. New Dir. Inst. Res. 131, 17–33 (2006)
6. Aulck, L., Velagapudi, N., Blumenstock, J., West, J.: Predicting student dropout in higher
education. arXiv Preprint arXiv:1606.06364 (2016)
7. Rahul, Katarya, R.A.: Systematic review on predicting the performance of students in higher
education in offline mode using machine learning techniques. Wireless Pers. Commun. 133(1),
1–32 (2024). https://doi.org/10.1007/s11277-023-10838-x
8. Braxton, J.M., Doyle, W.R., Hartley, H.V., III., et al.: Rethinking College Student Retention.
Wiley (2013)
9. Predict students’ dropout and academic success. In: Kaggle (2023). https://www.kaggle.com/
datasets/thedevastator/higher-education-predictors-of-student-retention
10. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python.
J. Mach. Learn. Res. 12, 2825–2830 (2011)
11. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning. Springer (2006)
12. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
13. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn.
Res. 3, 1157–1182 (2003)
14. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn.
Res. 13(2) (2012)
15. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification
tasks. Inf. Process. Manag. 45, 427–437 (2009)
16. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006)
17. Akmeşe, Ö.F., Kör, H., Erbay, H.: Use of machine learning techniques for the forecast of
student achievement in higher education. Inf. Technol. Learn. Tools 82, 297–311 (2021)