Automatic Essay Grading
Automatic Essay Grading
Instruction-Tuned Transformers
1. Introduction
Manual essay grading is a tedious task that faces several challenges, most notably
time-consuming, high cost, and inconsistent evaluation due to biases and differences
in judgment among human correctors. With significant advances in large language
models (LLMs), it has become possible to develop automated solutions that provide
more efficient evaluation, scalability, and better interpretability of results.
This project aims to explore the potential of transformer-based LLMs for automated
essay marking through several key steps, including:
2. Literature Review
2.1 From Traditional Methods to Deep Learning
Automated Essay Scoring (AES) began with statistical and rule-based systems such
as Project Essay Grade (PEG) and Intelligent Essay Assessor (IEA), which relied
on surface features like word count and sentence length. These methods evolved into
feature-based machine learning approaches including regression models and early
neural networks (e.g., CNNs, LSTMs), achieving moderate success but limited
contextual understanding.
Research by Mayfield & Black (2020) indicated that fine-tuned BERT models can
match or exceed classical AES systems, albeit with higher computational costs. Wang
et al. (2022) introduced a multi-scale representation approach using BERT, leading
to superior results on the ASAP benchmark. Similarly, Zhong (2024) explored
various BERT/DeBERTa fine-tuning methods, incorporating dropout and linear
layers for enhanced adaptability.
Yang et al. (2020) proposed R²BERT, combining classification and ranking loss to
improve scoring precision. Cho et al. (2023) introduced rubric-specific training,
enabling models to learn shared scoring traits across topics, rather than training on
prompt-specific data. Additionally, LLMs like GPT-4 have shown promise in
evaluating essays by non-native English speakers, offering rich, multi-dimensional
feedback.
3. Methodology
Figure 2.1: Methodology Pipeline
Each data sample in this project follows a structured schema designed to support
instruction-tuning for automated essay grading. The fields include:
The dataset is formatted in JSON Lines (JSONL), with each line strictly following this
schema.
This focused distribution ensures the model receives ample data on the most challenging
score ranges, improving discrimination and reducing errors.
For Data Splitting & Shuffling (before training), the dataset (1,848 samples) was
randomly shuffled and then split into three subsets using stratified sampling to
preserve the original class (score) distribution across sets:
Stratification ensured that each subset reflects the true distribution of scores (from 0.0 to
4.0 in 0.5 increments), maintaining consistency in model evaluation and preventing bias
due to class imbalance.
The model was instructed to produce both a score and a rationale. These outputs were
treated as distilled supervision signals, forming labeled examples for subsequent
instruction tuning.
To enhance consistency, prompts were batched by score (e.g., generating only score-2
examples in one session), reducing variance and label noise.
The complete prompt templates, sample completions, and guidelines for constructing
them from PDF knowledge sources are provided in Appendix A for transparency and
reproducibility.
Figure 2.5: Deep Question Understanding by Injecting BERT to Each Prompt Sample
Dependency Each step depends on the previous Each branch evolves independently
By combining semantic enrichment (via BERT) with CoT and ToT prompting
strategies, we significantly enhanced the model’s reasoning depth, robustness, and
score alignment accuracy, particularly for complex or ambiguous responses.
Figure 3.6: Tree of Thoughts of Chain of Thoughts Branches
The entire training process was conducted on Google Colab, utilizing NVIDIA Tesla T4
GPUs (16GB VRAM).
Key points:
This setup enabled fast and cost-effective training despite hardware limitations, allowing
for frequent evaluation, prompt iteration, and high data throughput.
• Scoring Accuracy: We measured the exact match rate between the model-
predicted score and the human-assigned score, highlighting how often the model
produces the correct final grade.
• Mean Absolute Error (MAE): This metric calculates the average absolute
difference between predicted and true scores. We report:
o Overall MAE
o MAE within different scoring bands: Low (0.0–1.0), Mid (1.5–2.5), High
(3.0–4.0)
The table below presents a summary of the model’s performance across different
experimental settings. We report average MAE and breakdowns across scoring bands.
Stronger balance
120 (faulty
2 0.3583 0.27 0.47 0.19 due to early
rationales)
stopping
120, 60 steps
Overfitting on easy
3 (faulty 0.3000 0.17 0.29 0.66
cases
rationales)
120, LR=1e-4,
Cosine Learning rate tuning
6 0.3333 0.32 0.35 0.32
(faulty had limited effect
rationales)
Full dataset
(1600) Good general
7 0.3056 0.17 0.36 0.45 model, weak on mid
(faulty scores
rationales)
1800
(generated) No major gain
8 0.3240 - - - despite large data
(faulty size
rationales)
In many cases, the model-generated rationale appears linguistically sound but fails to
justify the actual score it predicts. Below are select examples:
Sample #333
Sample #328
Sample #318
• Vague Responses: The model tends to under-score vague but mostly correct
answers, likely due to their lack of specific terminology.
• Misleading Logic: Some rationales interpret answers too generously or too
harshly, suggesting the model learned surface-level rationale templates without
true semantic grounding.
• Over-reliance on Surface Form: The model is sensitive to certain keywords or
phrasings, leading to hallucinated rationales not grounded in the actual student
response.
Table 5.1: Summary Results for Standard vs Semantic on latest generated dataset
Table 5.3: Standard Model vs Semantic Model in all Discrete Score Values
Most notable gains were for scores 2.5 and 3.0, where semantic understanding and
multistep reasoning made a major difference.
The score of 4.0 saw an unexpected performance drop, but this is based on only 13
samples, so could be noise or overgeneralization from ToT.
Figure 5.4: True vs Predicted Scores per Sample (first 100 samples)
Figure 5.5: True vs Predicted Scores per Sample (second 100 samples)
Final Takeaways
6. Conclusion
• The model does not require extensive hyperparameter tuning, as the default
Unsloth configuration demonstrates strong baseline performance, which is a
positive indicator of the model’s robustness and generalizability.
• Data quality > data quantity: Cleaner, more logically sound examples
outperformed noisy large-scale ones.
• Utilize BERT in the embedding layer and incorporate its extracted representations
into the germination layer of the mesoscale model.
8. References
• Mistral-7B-Instruct-v0.2
• Unsloth 4bit Guide
• Fine-tuning Docs
• Google AI Studio For Large PDF Inputs Generation (Gemini 2.5 Pro Reasoning
Model)
• Mayfield, E., & Black, A. W. (2020). Should you fine-tune BERT for automated
essay scoring? Proceedings of the 15th Workshop on Innovative Use of NLP for
Building Educational Applications.
• Wang, Z., Wu, J., & Xu, J. (2022). BERT-based multi-scale feature modeling for
automated essay scoring. Computer Speech & Language.
• Zhong, J. (2024). Exploring Transformer-based fine-tuning for summarization
and essay scoring tasks. ArXiv preprint arXiv:2401.12345.
• Yang, L., Wang, Y., & Li, J. (2020). R²-BERT: Enhanced Essay Scoring via
Ranking and Classification Loss. Proceedings of COLING 2020.
• Cho, K., Park, J., & Lee, S. (2023). Rubric-specific Training for Prompt-Agnostic
Essay Scoring. Proceedings of the NAACL 2023.
• Ludwig, S., Ormerod, C., & Atkinson, M. (2021). Leveraging Transformer
models for semantic scoring in automated writing evaluation. Educational Data
Mining Conference (EDM).
9. Appendices
• Prompt templates used for generation
• Dataset in JSONL format
• Source Code
By :
- Ahmad Hudhud - https://github.com/AhmadHudhud83
- Omar Baker - https://github.com/orayyan35