Predicting Student Retentionin Random Forest

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/382585981
Predicting Student Retention in Higher Education Using Machine Learning
Chapter · July 2024

DOI: 10.1007/978-3-031-65996-6_17
CITATIONS READS
0 26
5 authors, including:
Said A. Salloum Ayham Salloum

University of Sharjah University of Sharjah
280 PUBLICATIONS 12,768 CITATIONS 14 PUBLICATIONS 42 CITATIONS
SEE PROFILE SEE PROFILE
Khaled Shaalan
British University in Dubai
403 PUBLICATIONS 13,157 CITATIONS
SEE PROFILE
All content following this page was uploaded by Said A. Salloum on 26 July 2024.
The user has requested enhancement of the downloaded file.

Predicting Student Retention in Higher
Education Using Machine Learning
Said A. Salloum1,2(B) , Azza Basiouni3 , Raghad Alfaisal4 , Ayham Salloum5 ,

and Khaled Shaalan6
1 Health Economic and Financing Group, University of Sharjah, Sharjah, UAE
ssalloum@sharjah.ac.ae
2 School of Science, Engineering, and Environment, University of Salford, Salford, United
Kingdom
3 Faculty of Information Technology, Liwa College, Abu Dhabi, UAE
azza.basiouni@lc.ac.ae
4 Faculty of Computing and Meta-Technology, Universiti Pendidikan Sultan Idris,
Perak, Malaysia
5 College of Medicine, University of Sharjah, Sharjah, UAE
6 Faculty of Engineering and IT, The British University in Dubai, Dubai, UAE
khaled.shaalan@buid.ac.ae
Abstract. Student retention is a critical concern for higher education institu-

tions worldwide, impacting both institutional success and student outcomes. High
dropout rates can lead to significant financial losses for universities and detri-
mental effects on students’ personal and professional futures. Predicting student
retention accurately enables institutions to proactively address factors leading
to dropouts and implement targeted interventions to support at-risk students. This
study addresses the problem of student retention prediction by leveraging advanced
machine learning techniques. Specifically, we utilized a RandomForestClassifier
to analyze a comprehensive dataset of student records, which includes various
features related to demographics, academic performance, and other relevant fac-
tors influencing student retention. Our methodology involved several steps: data
preprocessing to encode categorical variables and scale numerical features, hyper-
parameter tuning using GridSearchCV to optimize the model, and evaluation of the
model’s performance using metrics such as accuracy, precision, recall, F1-score,
and ROC curves. Visualizations were generated to provide deeper insights into
the model’s performance and behavior. The results of our analysis indicate that
the RandomForestClassifier can effectively predict student retention, achieving
an accuracy score of 76.72%. This performance demonstrates the model’s poten-
tial as a valuable tool for higher education institutions aiming to improve student
retention rates. By integrating such predictive models into their student support
systems, universities can identify at-risk students early and provide targeted sup-
port to enhance their chances of success. This proactive approach can lead to better
academic outcomes for students and reduced financial losses for institutions due
to dropouts. Future research could explore the integration of additional features
and alternative machine learning models to further improve predictive accuracy
and applicability in diverse educational contexts.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024

A. Basiouni and C. Frasson (Eds.): BBGI 2024, CCIS 2162, pp. 197–206, 2024.
https://doi.org/10.1007/978-3-031-65996-6_17
198 S. A. Salloum et al.
Keywords: Dropout · higher education · machine learning · predictive model ·

Random Forest Classifier · student retention
1 Introduction
Student retention is a critical concern for higher education institutions, impacting both
institutional success and student outcomes. High dropout rates can lead to significant
financial losses for universities and adverse effects on students’ personal and professional
futures. According to [1], student retention is influenced by a complex interplay of
factors including academic performance, socioeconomic background, engagement with
the university community, and personal circumstances. Early identification of at-risk
students allows institutions to provide targeted interventions that can improve retention
rates. For instance, academic support programs, counseling services, and financial aid
can be strategically deployed to support students who are at risk of dropping out.
Despite the importance of student retention, many higher education institutions strug-
gle to develop effective predictive models that can accurately identify students who are
likely to drop out. Traditional methods often fail to capture the complexity of the factors
involved and may not be adaptable to the unique contexts of different institutions. Astin
[2] emphasizes that understanding what matters in college is crucial for improving reten-
tion rates. However, there is a need for more sophisticated approaches that can leverage
the vast amounts of data available to institutions and provide actionable insights. This gap
in the existing literature and practices highlights the necessity for innovative solutions
in predicting student retention.
This study leverages advanced machine learning techniques to predict student reten-
tion. Specifically, we use a RandomForestClassifier to analyze a comprehensive dataset
of student records. The dataset includes various features related to demographics, aca-
demic performance, and other relevant factors that influence retention. Our methodol-
ogy involves several steps: data preprocessing to encode categorical variables and scale
numerical features, hyperparameter tuning using GridSearchCV to optimize the model,
and evaluation of the model’s performance using metrics such as accuracy, precision,
recall, F1-score, and ROC curves. Visualizations are generated to provide deeper insights
into the model’s behavior.
The primary contribution of this study is the development of a predictive model for
student retention that can be used by higher education institutions to identify at-risk
students early. By integrating such predictive models into their student support systems,
universities can implement targeted interventions to enhance student success. This proac-
tive approach can lead to better academic outcomes for students and reduced financial
losses for institutions due to dropouts. Furthermore, the study provides a methodological
framework that can be adapted and extended by other researchers and practitioners in the
field. The remainder of this paper is structured as follows: Sect. 2 describes the dataset
used in this study, Sect. 3 outlines the methodology, Sect. 4 presents the results, and
Sect. 5 discusses the conclusions and suggests directions for future research.
Predicting Student Retention in Higher Education 199
2 Literature Review
Research has extensively studied the various factors influencing student retention in
higher education. Tinto [1] identifies academic integration and social integration as crit-
ical components affecting a student’s decision to stay or leave an institution. Academic
integration refers to the extent to which students feel connected to their academic pro-
grams, while social integration pertains to their involvement in the campus community
[3, 4]. Additionally, financial issues, family responsibilities, and personal challenges
significantly impact retention rates. These findings suggest that retention strategies need
to be multifaceted, addressing both academic and non-academic aspects of student life.
The advent of data analytics has brought new dimensions to predicting student reten-
tion. According to [5], traditional statistical models have been supplemented by machine
learning algorithms, which can handle large datasets and identify complex patterns.
Logistic regression, decision trees, and neural networks are among the models used to
predict retention. Recent studies, such as those by [6], have demonstrated that machine
learning models can achieve higher accuracy in predicting dropouts compared to tra-
ditional methods. These models consider a wide range of variables, from academic
performance to extracurricular involvement, providing a more comprehensive analysis.
Machine learning techniques have proven effective in educational contexts due to
their ability to analyze vast amounts of data and uncover hidden patterns. Random
forests, support vector machines, and ensemble methods are popular choices for educa-
tional data mining. For instance, [7] found that ensemble methods combining multiple
algorithms often outperform single models in predicting student outcomes. Furthermore,
the flexibility of machine learning models allows them to be tailored to the specific char-
acteristics of different institutions, enhancing their predictive power. However, these
techniques also require careful tuning and validation to avoid overfitting and ensure
generalizability.
Despite the advances in predictive modeling, several challenges and gaps remain.
Many studies, including those by [8], highlight the need for more personalized and adap-
tive models that can accommodate the unique needs of individual students. There is also
a call for integrating qualitative data, such as student feedback and engagement metrics,
which are often overlooked in quantitative analyses. Moreover, most existing models
focus on short-term predictions, while long-term retention strategies require continu-
ous monitoring and adjustment. Addressing these gaps could lead to more effective and
sustainable retention initiatives.
This study aims to fill these gaps by developing a predictive model that not only
leverages advanced machine learning techniques but also incorporates a broad spectrum
of variables, including demographic, academic, and engagement factors. By optimizing
and validating the model using GridSearchCV, this research seeks to enhance its accuracy
and applicability across different institutional contexts. The goal is to create a tool that
higher education institutions can use to implement timely and targeted interventions,
ultimately improving student retention rates.
3 Methodology
3.1 Data Source
The dataset used in this study is sourced from Kaggle [9]. It contains various features
related to students’ demographics, academic performance, and other relevant factors. The
dataset includes columns such as Marital status, Application mode, Application order,
Course, Daytime/evening attendance, Previous qualification, Nationality, Mother’s qual-
ification, Father’s qualification, Mother’s occupation, Father’s occupation, and several
others, ending with the target variable indicating whether a student is a “Dropout,”
“Graduate,” or “Enrolled.”
3.2 Data Preprocessing

Data preprocessing is a crucial step in preparing the data for machine learning algo-
rithms. This involves handling missing values, encoding categorical variables, and scal-
ing numerical features. In our study, we used LabelEncoder from the sklearn library
to encode categorical variables, transforming them into numerical format suitable for
machine learning models [10]. Numerical features were standardized using Standard-
Scaler to ensure they have a mean of zero and a standard deviation of one, which
helps improve the performance of the machine learning models by ensuring all features
contribute equally to the model’s learning process [11].
3.3 Model Selection

The Random Forest algorithm was selected for this study due to its robustness and effec-
tiveness in handling both binary and multi-class classification problems. Random Forest
operates by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes predicted by individual trees. This ensemble method
is advantageous due to its ability to improve prediction accuracy and control over-fitting,
making it highly reliable for complex classification tasks [12]. The One-vs-Rest (OvR)
strategy was employed to adapt the Random Forest algorithm for multi-class classifica-
tion tasks. In the OvR approach, a separate model is trained for each class to distinguish
instances of that class from all other classes, which simplifies the multi-class problem
into multiple binary classification problems. This method was chosen because of its
scalability and ease of implementation, especially useful when dealing with a limited
number of classes as in this dataset.
3.4 Feature Selection and Engineering

Feature selection and engineering are vital in enhancing the predictive power of the
model. We explored various techniques to select the most relevant features that con-
tribute significantly to the target variable. According to [13], feature selection helps
reduce the dimensionality of the dataset, remove noise, and improve model perfor-
mance. In this study, we initially included all features provided in the dataset, and later
refined them based on their importance scores from the RandomForestClassifier. Feature
engineering involved creating new features and transforming existing ones to capture
more information that might improve the model’s predictive accuracy.
3.5 Model Training and Evaluation

For model training, we employed the RandomForestClassifier, a robust ensemble learn-
ing method known for its high accuracy and ability to handle large datasets with numer-
ous features [12]. Random forests operate by constructing multiple decision trees dur-
ing training and outputting the mode of the classes (classification) or mean prediction
(regression) of the individual trees. This method is effective in reducing overfitting and
improving generalization.
Hyperparameter tuning is essential for optimizing the model’s performance. We used
GridSearchCV to systematically explore the hyperparameters of the RandomForestClas-
sifier, including the number of trees in the forest (n_estimators), the maximum depth
of the tree (max_depth), and the function to measure the quality of a split (criterion).
GridSearchCV performs an exhaustive search over specified parameter values and uses
cross-validation to ensure the model’s performance is robust and not reliant on specific
data splits [14].
The model’s performance was evaluated using several metrics to provide a com-
prehensive understanding of its effectiveness. The confusion matrix, precision, recall,
F1-score, and accuracy score were calculated to assess the classification performance.
According to [15], these metrics offer insights into different aspects of the model’s pre-
diction capabilities, such as its ability to correctly identify true positives and avoid false
positives.
Moreover, we plotted ROC (Receiver Operating Characteristic) curves and calculated
the AUC (Area Under the Curve) to evaluate the model’s discriminatory power. ROC
curves are a graphical representation of a classifier’s ability to distinguish between
classes, and the AUC provides a single scalar value summarizing the performance [16].
Higher AUC values indicate better performance of the model in distinguishing between
the classes.
3.6 Visualization and Interpretation

Visualizations were generated to interpret the model’s performance and understand its
behavior. We used seaborn and matplotlib libraries to create detailed plots of the confu-
sion matrix, ROC curves, and the training and testing accuracy and loss curves. These
visualizations help in diagnosing the model’s strengths and weaknesses, providing a
clearer picture of its predictive power and areas for improvement.
4 Results
4.1 Confusion Matrix
The confusion matrix (Fig. 1) provides a detailed breakdown of the model’s perfor-
mance in classifying the students into the three categories: Dropout (0), Enrolled (1),
and Graduate (2). The matrix shows that the model correctly identified 241 Dropouts,
42 Enrolled, and 396 Graduates. However, there were also some misclassifications: 52
Dropouts were classified as Graduates, 72 Enrolled were classified as Graduates, and 14
Graduates were classified as Enrolled.
Fig. 1. The confusion matrix
4.2 ROC Curve
The Receiver Operating Characteristic (ROC) curve (Fig. 2) illustrates the true positive
rate (sensitivity) against the false positive rate (1-specificity) for the RandomForestClas-
sifier. The ROC curve is plotted for each class, and the area under the curve (AUC) is
calculated to measure the model’s ability to distinguish between the classes. The AUC
values for Class 0 (Dropout) and Class 1 (Enrolled) are 0.91 and 0.79, respectively.
These values indicate that the model performs well in differentiating between Dropouts
and Graduates but has some difficulty distinguishing between Enrolled and the other
two classes.
4.3 Training and Testing Loss/Accuracy Curves
The training and testing accuracy and loss curves (Fig. 3) provide insights into the
model’s learning process and performance across different hyperparameter combina-
tions. The curves show that the model achieves high training accuracy, indicating that it
fits well to the training data. However, there is a slight drop in testing accuracy, suggest-
ing some degree of overfitting. The testing loss is higher than the training loss, which
is expected in real-world scenarios where models typically perform better on training
data.
Fig. 2. ROC Curve
Fig. 3. The training and testing accuracy and loss curves
4.4 Classification Report
The classification report (Table 1) summarizes the precision, recall, and F1-score for
each class. Precision measures the accuracy of the positive predictions, recall measures
the ability to capture all positive instances, and the F1-score is the harmonic mean of
precision and recall. The results indicate that the model performs best for the Graduate
class (2), with a precision of 0.76, recall of 0.95, and F1-score of 0.84. The model
performs moderately well for the Dropout class (0) with an F1-score of 0.80, but less
effectively for the Enrolled class (1) with an F1-score of 0.37.
Table 1. Classification Report
Class Precision Recall F1-score Support

0 0.84 0.76 0.80 316
1 0.53 0.28 0.37 151
2 0.76 0.95 0.84 418
Macro avg. 0.71 0.66 0.67 855
Weighted avg. 0.75 0.77 0.75 855
Accuracy 0.76.7 885
The overall accuracy score of the RandomForestClassifier is 0.767, indicating that

approximately 76.7% of the predictions made by the model are correct. This level of
accuracy demonstrates the model’s potential as a tool for predicting student retention
in higher education. However, there is room for improvement, particularly in enhanc-
ing the model’s ability to correctly classify Enrolled students. The results from the
confusion matrix, ROC curve, and classification report collectively demonstrate that
the RandomForestClassifier is effective in predicting student retention, with notable
strengths in identifying Dropouts and Graduates. However, the model’s performance is
less robust for the Enrolled category, suggesting the need for further refinement and
potential integration of additional features or alternative machine learning techniques to
improve classification accuracy. The visualizations provide a clear and comprehensive
understanding of the model’s performance and areas for enhancement.
5 Conclusion
This study demonstrates the potential of using a RandomForestClassifier to predict stu-
dent retention in higher education. By leveraging a comprehensive dataset that includes
demographic, academic, and socio-economic factors, we were able to build a model that
achieved an overall accuracy of 76.72%. The model’s performance, particularly in accu-
rately identifying Dropouts and Graduates, underscores its robustness and applicability
in real-world educational settings. The detailed analysis provided by the confusion matrix
revealed the model’s strengths and areas for improvement. While the model performed
well in classifying Dropouts and Graduates, it faced challenges in accurately identifying
Enrolled students. The ROC curves and corresponding AUC values further highlighted
these findings, showing strong performance for Dropouts and Graduates but a lower
AUC for the Enrolled category. This indicates that while the model is effective in certain
areas, there is room for refinement. The implications of these findings are significant
for higher education institutions. Integrating predictive models like the RandomForest-
Classifier into student support systems can enable institutions to proactively identify
and assist at-risk students. This proactive approach aligns with [1]’s framework, which
emphasizes the importance of both academic and social integration in improving stu-
dent retention. By identifying students who are likely to drop out, institutions can deploy
targeted interventions such as academic support, counseling, and financial aid, thereby
enhancing student success and retention rates. Furthermore, the insights gained from this
study can inform the development of more effective retention strategies. Understanding
the factors that contribute to student retention allows institutions to design policies and
programs that address these issues directly. This can lead to more holistic support systems
that not only help students academically but also address their social and personal needs.
Despite its strengths, this study has several limitations. The model’s lower performance
in predicting Enrolled students suggests that additional features or alternative models
may be needed to improve accuracy. The dataset, while comprehensive, may not capture
all relevant factors influencing student retention. Future research could incorporate qual-
itative data, such as student surveys or engagement metrics, to provide a more holistic
view of the factors affecting retention. Another limitation is the potential for bias in the
dataset. If certain groups of students are underrepresented, the model’s predictions may
not be equally accurate across all demographics. Ensuring a diverse and representative
dataset is crucial for developing fair and equitable predictive models [15]. Additionally,
the study’s reliance on historical data means that the model’s predictions are based on
past trends, which may not fully capture future changes in student behavior or insti-
tutional policies. There are several avenues for future research that could build on the
findings of this study. Firstly, incorporating additional features, such as behavioral data
from learning management systems or psychological factors, could enhance the model’s
predictive accuracy. Exploring other machine learning models, such as neural networks
or gradient boosting machines, may also yield better performance, particularly for the
Enrolled category [17]. Moreover, future studies should consider longitudinal data to
assess the model’s performance over time and its ability to adapt to changing student
populations and institutional contexts. Longitudinal studies would provide insights into
how retention predictors evolve and whether early interventions have long-term impacts
on student success. Additionally, developing adaptive and personalized models that can
provide individualized predictions and recommendations would be a valuable direc-
tion. Such models could continuously learn from new data and adjust predictions and
interventions in real-time, offering more dynamic and responsive support to students.
Finally, collaboration between researchers and educational institutions is essential to
ensure that predictive models are practically applicable and aligned with the needs of
the institutions. Engaging with stakeholders, including educators, administrators, and
students, can provide valuable feedback and ensure that the models are used ethically
and effectively. In conclusion, this study demonstrates the significant potential of using
machine learning techniques to predict student retention in higher education. The Ran-
domForestClassifier, with its robust performance and detailed insights, offers a valuable
tool for institutions aiming to improve student retention rates. By continuing to refine and
enhance these predictive models, higher education institutions can better understand and
support their students, ultimately leading to improved academic outcomes and reduced
dropout rates. The journey towards more effective retention strategies is ongoing, and
this study provides a solid foundation for future research and practical applications.
References
1. Tinto, V.: Leaving College: Rethinking the Causes and Cures of Student Attrition, University
of Chicago Press, Chicago (2012)
2. Astin, A.W.: What Matters in College: Four Critical Years Examined, Jossey-Bass, San
Francisco (1997)
3. Tinto, V. Leaving College: Rethinking the Causes and Cures of Student Attrition, ERIC (1987)
4. Braxton, J.M.: Leaving college: rethinking the causes and cures of student attrition by Vincent
Tinto. J. Coll. Stud. Dev. 60, 129–134 (2019)
5. Herzog, S.: Estimating student retention and degree-completion time: decision trees and
neural networks vis-à-vis regression. New Dir. Inst. Res. 131, 17–33 (2006)
6. Aulck, L., Velagapudi, N., Blumenstock, J., West, J.: Predicting student dropout in higher
education. arXiv Preprint arXiv:1606.06364 (2016)
7. Rahul, Katarya, R.A.: Systematic review on predicting the performance of students in higher
education in offline mode using machine learning techniques. Wireless Pers. Commun. 133(1),
1–32 (2024). https://doi.org/10.1007/s11277-023-10838-x
8. Braxton, J.M., Doyle, W.R., Hartley, H.V., III., et al.: Rethinking College Student Retention.
Wiley (2013)
9. Predict students’ dropout and academic success. In: Kaggle (2023). https://www.kaggle.com/
datasets/thedevastator/higher-education-predictors-of-student-retention
10. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python.
J. Mach. Learn. Res. 12, 2825–2830 (2011)
11. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning. Springer (2006)
12. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
13. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn.
Res. 3, 1157–1182 (2003)
14. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn.
Res. 13(2) (2012)
15. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification
tasks. Inf. Process. Manag. 45, 427–437 (2009)
16. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006)
17. Akmeşe, Ö.F., Kör, H., Erbay, H.: Use of machine learning techniques for the forecast of
student achievement in higher education. Inf. Technol. Learn. Tools 82, 297–311 (2021)
View publication stats

Predicting Student Retentionin Random Forest

Uploaded by

Copyright:

Available Formats

Predicting Student Retentionin Random Forest

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Student Retentionin Random Forest

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Predicting Student Retention in Higher Education Using Machine Learning

Chapter · July 2024

Said A. Salloum Ayham Salloum

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Said A. Salloum1,2(B) , Azza Basiouni3 , Raghad Alfaisal4 , Ayham Salloum5 ,

Abstract. Student retention is a critical concern for higher education institu-

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024

Keywords: Dropout · higher education · machine learning · predictive model ·

3.2 Data Preprocessing

3.3 Model Selection

3.4 Feature Selection and Engineering

3.5 Model Training and Evaluation

3.6 Visualization and Interpretation

Fig. 1. The confusion matrix

4.2 ROC Curve

4.3 Training and Testing Loss/Accuracy Curves

Fig. 2. ROC Curve

Fig. 3. The training and testing accuracy and loss curves

4.4 Classification Report

Table 1. Classification Report

Class Precision Recall F1-score Support

The overall accuracy score of the RandomForestClassifier is 0.767, indicating that

View publication stats

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.