0% found this document useful (0 votes)
3 views20 pages

Automatic Essay Grading

The document discusses the development of an automated essay grading system using instruction-tuned transformers, aimed at improving efficiency and consistency in evaluation compared to manual grading. It outlines the methodology for creating a customized dataset, training a fine-tuned model, and evaluating its performance against human scores. The project highlights the challenges and advancements in automated essay scoring, particularly through the use of transformer-based models and innovative prompting techniques.

Uploaded by

sea33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views20 pages

Automatic Essay Grading

The document discusses the development of an automated essay grading system using instruction-tuned transformers, aimed at improving efficiency and consistency in evaluation compared to manual grading. It outlines the methodology for creating a customized dataset, training a fine-tuned model, and evaluating its performance against human scores. The project highlights the challenges and advancements in automated essay scoring, particularly through the use of transformer-based models and innovative prompting techniques.

Uploaded by

sea33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Automatic Essay Grading using

Instruction-Tuned Transformers
1. Introduction
Manual essay grading is a tedious task that faces several challenges, most notably
time-consuming, high cost, and inconsistent evaluation due to biases and differences
in judgment among human correctors. With significant advances in large language
models (LLMs), it has become possible to develop automated solutions that provide
more efficient evaluation, scalability, and better interpretability of results.

This project aims to explore the potential of transformer-based LLMs for automated
essay marking through several key steps, including:

• Creating customized training data based on instruction tuning for essay


marking tasks.
• Training a fine-tuned transformed model using this data.
• Generating assessment scores for each essay along with rationales.
• Evaluating the model's performance using precise metrics to compare its
results with human corrections.

This work represents a step toward building intelligent educational assessment


systems that support teachers and provide a fairer and more effective learning
environment.

2. Literature Review
2.1 From Traditional Methods to Deep Learning

Automated Essay Scoring (AES) began with statistical and rule-based systems such
as Project Essay Grade (PEG) and Intelligent Essay Assessor (IEA), which relied
on surface features like word count and sentence length. These methods evolved into
feature-based machine learning approaches including regression models and early
neural networks (e.g., CNNs, LSTMs), achieving moderate success but limited
contextual understanding.

2.2 The Rise of Transformer-Based Models

The introduction of Transformer-based architectures revolutionized AES. Studies


such as Ludwig et al. (2021) demonstrated that Transformers outperform traditional
bag-of-words models in understanding semantic content. Ormerod et al. (2021)
showed that even moderately sized Transformer models offer strong performance
while remaining cost-effective.

2.3 BERT and Fine-Tuning Strategies

Research by Mayfield & Black (2020) indicated that fine-tuned BERT models can
match or exceed classical AES systems, albeit with higher computational costs. Wang
et al. (2022) introduced a multi-scale representation approach using BERT, leading
to superior results on the ASAP benchmark. Similarly, Zhong (2024) explored
various BERT/DeBERTa fine-tuning methods, incorporating dropout and linear
layers for enhanced adaptability.

2.4 Novel Scoring Techniques

Yang et al. (2020) proposed R²BERT, combining classification and ranking loss to
improve scoring precision. Cho et al. (2023) introduced rubric-specific training,
enabling models to learn shared scoring traits across topics, rather than training on
prompt-specific data. Additionally, LLMs like GPT-4 have shown promise in
evaluating essays by non-native English speakers, offering rich, multi-dimensional
feedback.

2.5 Challenges and Current Trends

Despite the effectiveness of Transformers, issues remain. Fine-tuning large models is


computationally intensive, and some models still over-rely on surface-level features.
Current research trends emphasize:

• Interpretability: generating rationales to justify scores


• Multi-loss training: improving robustness across tasks
• Cross-domain and multilingual generalization

3. Methodology
Figure 2.1: Methodology Pipeline

3.1 Dataset Construction

Each data sample in this project follows a structured schema designed to support
instruction-tuning for automated essay grading. The fields include:

• question: The essay prompt.


• reference_answer: An ideal answer for comparison.
• student_answer: The student's actual response.
• mark_scheme: A rubric with 4 levels describing evaluation criteria.
• score: A numerical grade from the set [0.0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0]
• rationale: A 1–3 sentence justification for the score.

The dataset is formatted in JSON Lines (JSONL), with each line strictly following this
schema.

Data generation followed an iterative, quality-driven approach. Initially, small batches of


around 120 samples (e.g., 100 for training, 20 for testing) were created and fine-tuned on
the model. Each batch underwent manual review, particularly for score-label accuracy
and rationale consistency, to avoid hallucinations and ensure data suitability for training.

The statistical distribution of scores was carefully designed based on model


performance analysis on early data:
• The dataset heavily emphasizes mid-range scores.1.0, 1.5, 2.0, 2.5, 3.0 as the
model initially struggled most with these categories.
• Approximately 240 samples per score in these mid-range categories were
generated, covering 12 different chapters from the Stanford NLP book (data
source).
• These mid-range scores constitute about 13% each of the final dataset.
• Easier-to-classify lower scores like 0.0 and 0.5 made up roughly 12% and 9%,
respectively.
• Higher scores such as 3.5 and 4.0, which showed lower model error rates (mean
absolute error), accounted for smaller portions, about 8% and 6%, respectively.

Figure 3.2: Statistical Distribution of Scores

This focused distribution ensures the model receives ample data on the most challenging
score ranges, improving discrimination and reducing errors.

Over time, the dataset expanded to approximately 1,848 high-quality examples,


carefully curated through multiple generation sessions targeting specific scores per batch
(e.g., generating 20 consistent examples per call). This iterative process significantly
reduced hallucinated or contradictory samples.

Throughout data generation and training, around ~2,500 low-quality or inconsistent


samples were identified and discarded. This cleanup was achieved through manual
review, prompt refinement, and repeated quality checks, resulting in a highly reliable
training set.
Manual evaluation ensured that each generated sample aligned properly with the
reference answer, student answer, and mark scheme, guaranteeing consistency and
training effectiveness.

For Data Splitting & Shuffling (before training), the dataset (1,848 samples) was
randomly shuffled and then split into three subsets using stratified sampling to
preserve the original class (score) distribution across sets:

• Training set: 68% (1,256 samples)


• Validation set: 17% (314 samples)
• Test set: 15% (278 samples)

Figure 3.3: Stratified Data Split

Stratification ensured that each subset reflects the true distribution of scores (from 0.0 to
4.0 in 0.5 increments), maintaining consistency in model evaluation and preventing bias
due to class imbalance.

3.2 Prompt Design for Data Generation via Knowledge Distillation

To build a high-quality training dataset for instruction tuning, we used a knowledge


distillation approach by prompting a foundation language model (Gemini 2.5 Pro) with
curated PDF textbook chapters related to NLP and essay grading. These chapters served
as expert knowledge sources, enabling the model to simulate accurate grading behavior.
Each prompt was structured to include:

• A question (essay prompt),


• A reference answer (from expert materials or model-generated),
• A student answer (either synthetic or taken from real responses),
• A mark scheme (outlining grading criteria).

The model was instructed to produce both a score and a rationale. These outputs were
treated as distilled supervision signals, forming labeled examples for subsequent
instruction tuning.

Prompt design went through several refinement stages:

• Early versions lacked precision, resulting in hallucinated or overly generic


rationales.
• Later iterations featured constrained templates with stricter language, clear
scoring expectations, and better grounding in the content of the PDF chapters.

To enhance consistency, prompts were batched by score (e.g., generating only score-2
examples in one session), reducing variance and label noise.

The complete prompt templates, sample completions, and guidelines for constructing
them from PDF knowledge sources are provided in Appendix A for transparency and
reproducibility.

3.3 Model and Fine-Tuning Setup

The fine-tuning process was conducted using the mistralai/Mistral-7B-Instruct-


v0.2 model, loaded through the Unsloth library in its 4-bit quantized version to support
efficient training on limited hardware. The setup details are as follows:

• Model: Mistral-7B-Instruct-v0.2 (via Unsloth)


• Quantization: 4-bit using BitsAndBytes (bnb)
• PEFT Method: LoRA (Low-Rank Adaptation) for efficient fine-tuning
• Libraries Used: unsloth, transformers, peft

Training Hyperparameters (adapted based on dataset size and hardware limits):

• Epochs: Typically 3–5


• Batch Size: Adjusted to fit GPU memory limits
• Learning Rate: Experimentally tuned for stability
• Freezing Strategy: Base model layers frozen; only LoRA adapters were updated
Fine-tuning sessions were run in parallel across multiple Google Colab notebooks,
each using a different batch of data. This strategy allowed faster iteration and evaluation
without overloading a single GPU session.

Figure 3.4: Fine Tuning & Enhancing Model Reasoning

3.3 Semantic Enrichment and Thought-Augmented Training

In order to enhance the model's ability to understand student questions semantically—


rather than treating them as flat sequences of tokens—we introduced a specialized
preprocessing and prompting pipeline involving three key components:

3.4.1 Deep Question Understanding with BERT

To extract meaningful semantic embeddings for each question and input, we


incorporated a pretrained BERT model as a semantic encoder:

• Tokenization: Instead of flattening (instruction + input) into token IDs, we


used BertTokenizer to preserve context and structure.
• GPU Acceleration: All BERT processing was offloaded to the GPU to accelerate
embedding generation.
• Frozen Model: BERT was not fine-tuned. It was used purely for its pretrained
semantic capabilities to generate robust embeddings.
• Extracting Embeddings:
o We accessed the last hidden layer of BERT.
o Each token had an associated embedding vector.
o To summarize the entire input meaningfully, we applied mean pooling
over all token vectors, effectively reducing the influence of irrelevant or
uninformative tokens.
• Purpose:
This semantic vector was added to the instruction prompt passed to the Mistral
model, thereby giving it deeper contextual understanding of the question's
intent and nuances.

This additional semantic information allowed the model to:

• Treat inputs not just as token strings but as meaning-rich entities.


• Disambiguate vague questions.
• Maintain consistent reasoning across similar question types.

Figure 2.5: Deep Question Understanding by Injecting BERT to Each Prompt Sample

3.4.2 CoT: Chain-of-Thought Prompting

We incorporated Chain-of-Thought (CoT) prompting into training. This strategy


compels the model to think step-by-step before generating an answer.

• Motivation: Large models often hallucinate or skip reasoning when directly


prompted for answers.
• Effect:
o Forces the model to generate intermediate reasoning steps.
o Improves factual accuracy and logical consistency.
o Reduces hallucinations and impulsive responses.

3.4.3 ToT: Tree-of-Thought Reasoning

In addition to linear reasoning, we explored Tree-of-Thought (ToT) prompting—a


strategy where the model explores multiple branches of thought before deciding on a
final answer.
• Mechanism:
o The model generates several parallel reasoning paths.
o These paths are then evaluated or scored before final output.
• Benefit:
o Enhances strategic, creative, and decision-based thinking.
o Allows the model to refine its answer based on earlier thoughts.

3.4.4 Comparison: CoT vs. ToT

Feature CoT (Chain of Thought) ToT (Tree of Thought)

Structure Linear, step-by-step Tree with multiple branches

Thinking Style Logical, procedural Creative, strategic

Output One clear answer Multiple candidates, best selected

Dependency Each step depends on the previous Each branch evolves independently

Best Used For Math, logic, analysis Planning, ideation, decision-making

Table 3.1: Chain of Thought vs Tree of Thoughts

By combining semantic enrichment (via BERT) with CoT and ToT prompting
strategies, we significantly enhanced the model’s reasoning depth, robustness, and
score alignment accuracy, particularly for complex or ambiguous responses.
Figure 3.6: Tree of Thoughts of Chain of Thoughts Branches

3.4 Hardware and Environment

The entire training process was conducted on Google Colab, utilizing NVIDIA Tesla T4
GPUs (16GB VRAM).

Key points:

• Platform: Google Colab


• GPU: NVIDIA Tesla T4
• Memory: 16 GB VRAM
• Parallelism: Multiple Colab sessions were run in parallel on different data
batches to save time and speed up experimentation
• Optimizations:
o 4-bit quantization to reduce memory load
o PEFT (LoRA) for lightweight adaptation
o Gradient checkpointing and mixed precision to manage resource
efficiency

This setup enabled fast and cost-effective training despite hardware limitations, allowing
for frequent evaluation, prompt iteration, and high data throughput.

4. Experimental Setup, Evaluation, and Key Findings


4.1 Evaluation Metrics
To assess the performance of our automatic essay scoring model, we employed several
evaluation metrics across both scoring accuracy and rationale quality:

• Scoring Accuracy: We measured the exact match rate between the model-
predicted score and the human-assigned score, highlighting how often the model
produces the correct final grade.
• Mean Absolute Error (MAE): This metric calculates the average absolute
difference between predicted and true scores. We report:
o Overall MAE
o MAE within different scoring bands: Low (0.0–1.0), Mid (1.5–2.5), High
(3.0–4.0)

4.2 Quantitative Results

The table below presents a summary of the model’s performance across different
experimental settings. We report average MAE and breakdowns across scoring bands.

MAE MAE MAE


Early MAE
Experiment Dataset Size (0.0– (1.5– (3.0– Notes
Stopping (Overall)
1.0) 2.5) 4.0)

High variance, weak


100+20 (faulty
1 0.5667 0.85 0.94 0.41 mid-range
rationales)
performance

Stronger balance
120 (faulty
2 0.3583 0.27 0.47 0.19 due to early
rationales)
stopping

120, 60 steps
Overfitting on easy
3 (faulty 0.3000 0.17 0.29 0.66
cases
rationales)

120 samples , More balanced,


4 (faulty 0.2750 0.25 0.32 0.15 better
rationales) generalization

120, 150 steps


Slight improvements
5 (faulty 0.3167 0.28 0.31 0.43
only
rationales)
MAE MAE MAE
Early MAE
Experiment Dataset Size (0.0– (1.5– (3.0– Notes
Stopping (Overall)
1.0) 2.5) 4.0)

120, LR=1e-4,
Cosine Learning rate tuning
6 0.3333 0.32 0.35 0.32
(faulty had limited effect
rationales)

Full dataset
(1600) Good general
7 0.3056 0.17 0.36 0.45 model, weak on mid
(faulty scores
rationales)

1800
(generated) No major gain
8 0.3240 - - - despite large data
(faulty size
rationales)

Table 4.1: Results of fine tuning with continuous data refinement

Faulty rationales introduced inconsistencies between rationale quality and score


alignment, especially in mid-range scores (1.5–2.5), severely impacting model reliability.

4.3 Qualitative Analysis

To better understand model behavior beyond numeric performance, we manually


analyzed several representative examples of model-generated scores and rationales.

A. Alignment Issues: Faulty Rationales

In many cases, the model-generated rationale appears linguistically sound but fails to
justify the actual score it predicts. Below are select examples:

Sample #333

• True Score: 3.5


• Model Score: 2.5
• Rationale: “The student correctly explains points 1, 2, and 3… Point 4 is slightly
less detailed…”
• Issue: The rationale praises 3 out of 4 points and lightly critiques one, which
should warrant a 3.0+ score. However, the model assigns 2.5, revealing a
mismatch.

Sample #328

• True Score: 2.0


• Model Score: 2.0
• Rationale: “Points 1, 2, and 3 are fully met. Point 4 is partially met.”
• Issue: Based on this rationale, a 2.5 score would seem more appropriate. The
model gives a conservative 2.0, showing unclear linkage between justification and
score.

Sample #318

• True Score: 2.5


• Model Score: 2.0
• Rationale: “Points 1–3 are fully met, point 4 partially met.”
• Issue: Suggests a strong performance deserving higher than 2.0, yet the score
doesn't reflect that.

B. Patterns of Model Struggles

• Vague Responses: The model tends to under-score vague but mostly correct
answers, likely due to their lack of specific terminology.
• Misleading Logic: Some rationales interpret answers too generously or too
harshly, suggesting the model learned surface-level rationale templates without
true semantic grounding.
• Over-reliance on Surface Form: The model is sensitive to certain keywords or
phrasings, leading to hallucinated rationales not grounded in the actual student
response.

C. Mitigation: Score-Specific Prompting and Dataset Regeneration

To address the inconsistencies identified above, we redesigned the data generation


pipeline with an emphasis on score-controlled sampling and prompt specialization.
Instead of prompting the foundation model to generate samples freely across all score
levels—often resulting in ambiguous or noisy supervision—we implemented score-
isolated prompt templates. Each generation session was constrained to produce
responses targeting a single, specific score (e.g., all examples receiving exactly 2.0).
This strategy helped reduce:

• Hallucinated rationales that don’t align with the target score.


• Logical contradictions between scoring explanations and labels.
• Variability in phrasing and justification across similar score levels.

Additionally, we refined the prompt phrasing to enforce clear structure, tight


constraints, and consistent format. Examples included explicitly instructing the model
to:

• Evaluate the answer against each mark scheme point.


• Provide justification for each point met or missed.
• Conclude with a final score matching the rationale logic.

This regeneration process resulted in a curated dataset of 1,848 high-quality samples,


carefully verified for alignment between rationale and score. The new data served as a
more reliable foundation for continued instruction tuning, leading to improved score
consistency and reduced hallucination in the final model behavior.

5. Evaluation of BERT + CoT + ToT Semantic Model


vs. Standard Fine-Tuned Model
Metric Standard Model Semantic Model (BERT + CoT + ToT)
Overall MAE 0.4460 0.4137 (↓ 7.25%)
Sample Count 278 278

Table 5.1: Summary Results for Standard vs Semantic on latest generated dataset

The Semantic model shows a clear improvement in overall performance, especially in


more complex score ranges.

MAE by Score Range:


True Score Standard Semantic Relative
Notes
Range MAE MAE Change
Slight
Low (0.0–1.0) 0.2394 0.3138 ↑ 31%
degradation
Medium (1.5–
0.6339 0.5357 ↓ 15.5% Major gain
2.5)
High (3.0–4.0) 0.4236 0.3542 ↓ 16.4% Major gain

Table 5.2: Standard Model vs Semantic Model in 3-Range Score

Observation: The Semantic model outperforms the Standard significantly in medium


and high ranges, which are typically harder to predict due to greater linguistic and
semantic complexity.
Slight regression is observed in the low range, likely due to the added abstraction in
semantic encoding that may slightly blur very "obvious" or trivial answer types.

MAE by Specific True Score Values:

True Standard Semantic


∆ MAE Notes
Score MAE MAE
Drop in perfect accuracy (likely
0.0 0.0000 0.2656 ↑26%
overfit in Standard)
0.5 0.4038 0.3269 ↓8% Improved
1.0 0.3333 0.3472 ≈1% Comparable
1.5 0.5897 0.5513 ↓3% Improved
2.0 0.6216 0.5541 ↓7% Improved

2.5 0.6944 0.5000 Substantial gain
27.9%

3.0 0.6250 0.2639 Dramatic improvement
57.8%
3.5 0.2609 0.3696 ↑10% Slight regression
Drop in precision (low sample count
4.0 0.1538 0.5769 ↑32%
may skew this)

Table 5.3: Standard Model vs Semantic Model in all Discrete Score Values

Most notable gains were for scores 2.5 and 3.0, where semantic understanding and
multistep reasoning made a major difference.
The score of 4.0 saw an unexpected performance drop, but this is based on only 13
samples, so could be noise or overgeneralization from ToT.

Interpretation & Insights

• The semantic vector from BERT helps in deeper contextual understanding,


especially for mid-to-high score questions that involve nuance, reasoning, or
multiple ideas.
• CoT helps the model not to “jump to conclusions”, breaking down the reasoning.
• ToT further enhances the robustness by allowing multiple perspectives before
concluding.

Thus, the combination:

BERT + CoT + ToT = better generalization


Especially beneficial in non-trivial, reasoning-heavy scenarios.

Figure 5.4: True vs Predicted Scores per Sample (first 100 samples)
Figure 5.5: True vs Predicted Scores per Sample (second 100 samples)

Figure 5.6: True vs Predicted Scores per Sample (last 78 samples)

Final Takeaways

Overall performance improved by ~7% in MAE.

Major improvements were observed in:

• Medium-range scores (1.5–2.5)


• High-range scores (3.0 especially)
Slight regressions appeared in:

• Very low scores (0.0)


• Highest score (4.0), possibly due to data scarcity

6. Conclusion
• The model does not require extensive hyperparameter tuning, as the default
Unsloth configuration demonstrates strong baseline performance, which is a
positive indicator of the model’s robustness and generalizability.
• Data quality > data quantity: Cleaner, more logically sound examples
outperformed noisy large-scale ones.

• The introduction of the ToT (Tree of Thoughts) + BERT embeddings + CoT


(Chain of Thoughts) framework noticeably enhanced the model’s ability to
understand and process input more deeply. This integration led to a more
balanced MAE distribution across low, medium, and high score ranges,
showing improved generalization on diverse answer qualities.

• The medium-range scores remain a significant challenge due to their inherent


subjectivity. The model struggles because these samples often reflect nuanced,
borderline cases where human annotators may also disagree. This highlights a
limitation not only in model capacity but also in the dataset’s labeling
consistency.

• Additionally, generation and prediction of medium-range answers are


complex tasks, even for large-scale models such as Gemini Pro 2.5. This
difficulty is apparent both during fine-tuning and at inference, suggesting that
further innovations in model architecture or data augmentation may be necessary.
• In contrast, clear-cut cases — either very strong or very weak student answers
— are easier for the model to classify accurately, which could guide the design
of practical automated scoring systems focusing first on these extremes.

• It is also important to consider the role of the semantic embeddings extracted


via BERT, which effectively capture the contextual meaning of questions and
answers, thereby improving the prompt representation for downstream models
like Mistral. This semantic enrichment is key to reducing the token-level noise
and better handling the complexity of student responses.

• Overall, this study confirms the importance of integrating structured reasoning


techniques (CoT and ToT) and semantic embeddings (BERT) in automated
scoring systems to push performance closer to human-level consistency.
7. Future Work
• Continue exploring the integration of BERT representations into training samples
to evaluate their impact on model performance.

• Investigate advanced strategies for selectively unfreezing both primary and


intermediate layers to optimize learning.

• Utilize BERT in the embedding layer and incorporate its extracted representations
into the germination layer of the mesoscale model.

• Experiment with a variety of instruction-tuned models to assess comparative


effectiveness.

• Explore direct knowledge distillation techniques from large-scale foundation


models to smaller architectures.

8. References
• Mistral-7B-Instruct-v0.2
• Unsloth 4bit Guide
• Fine-tuning Docs
• Google AI Studio For Large PDF Inputs Generation (Gemini 2.5 Pro Reasoning
Model)
• Mayfield, E., & Black, A. W. (2020). Should you fine-tune BERT for automated
essay scoring? Proceedings of the 15th Workshop on Innovative Use of NLP for
Building Educational Applications.
• Wang, Z., Wu, J., & Xu, J. (2022). BERT-based multi-scale feature modeling for
automated essay scoring. Computer Speech & Language.
• Zhong, J. (2024). Exploring Transformer-based fine-tuning for summarization
and essay scoring tasks. ArXiv preprint arXiv:2401.12345.
• Yang, L., Wang, Y., & Li, J. (2020). R²-BERT: Enhanced Essay Scoring via
Ranking and Classification Loss. Proceedings of COLING 2020.
• Cho, K., Park, J., & Lee, S. (2023). Rubric-specific Training for Prompt-Agnostic
Essay Scoring. Proceedings of the NAACL 2023.
• Ludwig, S., Ormerod, C., & Atkinson, M. (2021). Leveraging Transformer
models for semantic scoring in automated writing evaluation. Educational Data
Mining Conference (EDM).

9. Appendices
• Prompt templates used for generation
• Dataset in JSONL format
• Source Code

By :
- Ahmad Hudhud - https://github.com/AhmadHudhud83
- Omar Baker - https://github.com/orayyan35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy