0% found this document useful (0 votes)

5 views

3649409.3691090

This study evaluates the effectiveness of GPT-3, GPT-4, GPT-4o, and BERT in generating quiz questions for Java and Python programming courses, focusing on criteria such as technical precision and pedagogical appropriateness. Preliminary findings indicate that GPT-4 outperforms BERT in technical precision, and ongoing analysis aims to assess the models' contextual relevance and educational utility. The research seeks to enhance formative assessment practices in computer science education through the integration of large language models.

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

3649409.3691090

Uploaded by

mahnoorarshad311002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

From GPT to BERT:

Benchmarking Large Language Models for Automated iz Generation

Yetunde Folajimi
School of Computing and Data Science
Wentworth Institute of Technology
Boston, Massachuses, USA
folajimiy@wit.edu

Abstract 1 Introduction
is study evaluates the effectiveness of four leading large Formative assessment is a critical educational practice that
language models (LLMs), GPT-3, GPT-4, GPT-4o, and BERT, in facilitates learning by providing timely and actionable feedback to
generating quiz questions for Java and Python programming students, thereby enhancing their understanding and retention of
courses. We aim to recognize how LLMs can effectively produce subject matter. Unlike summative assessments, which aim to
educationally valuable questions that meet specific pedagogical evaluate student learning at the end of an instructional period,
criteria, including technical precision, relevance to course formative assessments are integrated throughout the educational
objectives, linguistic clarity, and pedagogical appropriateness. process, offering continuous information on student progress and
Each model was prompted to generate 200 Java and 200 Python areas that need improvement [1]. This is especially vital in
quiz questions, totaling 1600 unique questions. ese questions disciplines like computer science, where concepts can be abstract
are currently being evaluated based on both quantitative and and complex, requiring iterative practice and reinforcement.
qualitative assessments by a team of computer science educators. Recent advances in artificial intelligence (AI), particularly in natural
Preliminary findings suggest that GPT-4 outperforms BERT in language processing, have opened new avenues to enhance
terms of technical precision. Further analysis is ongoing to assess educational practices. Large language models (LLMs), such as
the performance of the models in generating contextually OpenAI’s GPT-3 and GPT-4, and Google’s BERT, have
appropriate and educationally useful questions, offering insights demonstrated remarkable capabilities to generate coherent and
into their potential integration into computer science curricula. contextually relevant text. These models are trained on diverse
is work seeks to contribute to the broader discourse on the datasets comprising a vast array of internet text, enabling them to
utility of LLMs in educational seings, specifically within the perform a variety of tasks that include content creation, question
scope of automated content creation to enhance teaching and answering, and even tutoring [2].
assessment methodologies in computer science education. The potential of LLMs to revolutionize educational tools by
automating the creation of diverse and adaptive educational content
CCS Concepts is significant. Specifically, in the realm of formative assessments,
LLMs can be used to generate a wide range of quiz questions tailored
• Applied computing~Education~E-learning • Computing
to individual learning trajectories, supporting educators in
methodologies~Artificial intelligence • Applied
delivering personalized learning experiences and reducing their
computing~Education~Interactive learning environments
workload in content preparation. This study explores the
Keywords integration of LLMs into the process of generating formative
Formative Assessment, Automated Assessment, Computer Science assessment questions in computer science education to identify
Education, personalized quizzes, Quiz Questions Generation, Large efficient ways to enhance pedagogical support and student
Language Models engagement through technology.
This work builds on existing research indicating the benefits of AI
ACM Reference format: in educational settings, where AI tools have been shown to improve
Yetunde Folajimi. 2024. From GPT to BERT: Benchmarking Large engagement and outcomes by providing learning experiences that
Language Models for Automated iz Generation. In Proceedings of the are adaptive to student needs [5]. By integrating LLMs into
2024 ACM Virtual Global Computing Education Conference V. 2 (SIGCSE formative assessment practices, this study seeks to further
Virtual 2024), December 5-8, 2024, Virtual Event, NC USA. ACM, New understand and harness AI’s potential to transform educational
York, NY, USA, 2 pages. hps://doi.org/10.1145/3649409.3691090 methodologies in computer science.

Permission to make digital or hard copies of all or part of this work for personal or 2 Background
classroom use is granted without fee provided that copies are not made or distributed LLMs, such as those developed by OpenAI, represent a significant
for profit or commercial advantage and that copies bear this notice and the full citation advancement in natural language processing (NLP) technologies.
on the first page. Copyrights for components of this work owned by others than the These models, including GPT-3 and GPT-4, are a type of neural
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or network architecture known as transformers, which have been
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
trained on diverse datasets that encompass a vast swath of human
SIGCSE Virtual 2024, December 5-8, 2024, Virtual Event, NC, USA. knowledge expressed in text form. The key to their success lies in
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM. their ability to generate text that is not only contextually relevant
ACM ISBN 979-8-4007-0604-2/24/12 but also remarkably coherent [2].
https://doi.org/10.1145/3649409.3691090

312
SIGCSE Virtual 2024, December 5-8, 2024, Virtual Event, NC USA Yetunde Folajimi

LLMs have been increasingly incorporated into various applications • An initial script checks for obvious errors, such as syntax
beyond traditional NLP tasks. In the educational sector, these errors or irrelevant content, filtering out unsuitable questions.
models have demonstrated potential as tools for content generation, • Remaining questions undergo a rigorous review by a team of
tutoring systems, and automated assessment creation. The ability of computer science educators. Each question is rated on
LLMs to generate text based on given prompts makes them accuracy, relevance, clarity, and educational value to ensure
particularly suitable for creating educational materials and the content’s suitability for educational use. Each criterion is
assessments that are both diverse and adaptive [5]. scored on a 1–5 scale, ensuring that only high-quality
Previous research has shown that the use of AI in education can lead questions are considered for use in assessments.
to more personalized learning experiences. LLMs, for example, have
been used to automatically generate learning content adapted to
4 Future Plans
individual student needs [1]. In addition, these models have been 4.1 Analytical Approach
used to generate assessments that can adapt to the level of knowl- Statistical analysis will be used to compare the performance of each
edge of the students, providing immediate feedback and guiding model based on expert reviews. Metrics such as the number of
their learning journey more effectively [3]. acceptable questions, the average ratings in clarity and relevance,
Despite their advantages, integrating LLMs into educational and the diversity of topics covered will be analyzed. Preliminary
practices presents challenges, primarily concerning the accuracy statistical methods include descriptive statistics and variance
and relevance of the content generated, especially in technical analysis to identify significant differences in model performance.
subjects such as computer science. This study aims to address these Qualitative feedback from reviewers will highlight strengths and
challenges by exploring the capabilities of various LLMs in weaknesses, providing deeper insight into each model's educational
generating formative assessment questions that are both value.
pedagogically sound and technically accurate
4.2 Tools and Technologies
3 Methodology The study leverages various software tools and platforms:
• Python is used for scripting the initial filtering process and for
3.1 Selection of Large Language Models
statistical analysis, utilizing libraries such as Pandas and SciPy.
This study evaluates four large language models (LLMs). We
• Each model is accessed via its respective API—OpenAI for GPT
selected GPT-3, GPT-4, GPT-4o, and BERT due to their differing
models and Hugging Face’s Transformers library for BERT.
architectures and widespread use in NLP. GPT models are known
for strong generative capabilities, while BERT excels in contextual 5 Conclusion
understanding. This mix offers a comprehensive comparison for Our methodology provides a comprehensive framework for
quiz question generation in programming courses, where both assessing the potential of different LLMs to effectively contribute to
accuracy and clarity are key. GPT models are autoregressive, the generation of educational content. The results of this study will
designed to predict the next word in a sequence, making them well inform best practices for integrating AI into the quiz generation
suited for generative tasks. In contrast, BERT is bidirectional, process in educational settings. While this study provides valuable
primarily designed to understand the context of words in a sentence insights into the capabilities of different LLMs for quiz generation,
and hence adapted for this study to generate content based on a potential limitation is the potential mismatch between the LLMs'
prompts [4]. pre-training data and programming-specific content. Future work
3.2 Data Preparation could explore fine-tuning models on domain-specific data to
improve accuracy and relevance for educational use.
The study uses a data set of topics in computer science, specifically
Java and Python programming, to maintain focus and relevance. References
The topics were selected based on the core curriculum requirements [1] Paul Black and Dylan Wiliam. 2009. Developing the theory of formative
for introductory and intermediate computer science courses. Each assessment. Educational Assessment, Evaluation and Accountability 21,
model received personalized prompts that included a brief 1 (February 2009), 5–31. https://doi.org/10.1007/s11092-008-9068-5
description of the topic, the difficulty level, and the specific format [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
desired for the question (e.g. multiple choice, fill-in-the-blank). Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
3.3 Question Generation Process Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Each model was asked to generate 200 unique questions for Java and Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Python, totaling 800 questions per programming language designed Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
to test a range of cognitive skills. This approach allows a direct and Dario Amodei. 2020. Language models are few-shot learners.
evaluation of each model's performance, eliminating prompt In OpenAI Blog. https://arxiv.org/abs/2005.14165
variation as a factor. [3] Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 2007. Deciding
• Prompt Example for GPT Models: "Generate a multiple- choice equivalences among conjunctive aggregate queries. Journal of the
question about object-oriented principles in Java focusing on ACM (JACM) 54, 2 (April 2007), 5–50.
class inheritance." https://doi.org/10.1145/1219092.1219093
• Prompt example for BERT: "Create a multiple-choice question [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
about the use of lists in Python programming." 2019. BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. NAACL HLT 2019 (2019).
3.4 Ongoing Studies https://www.aclweb.org/anthology/N19-1423
The generated questions are currently being evaluated through a [5] Rosemary Luckin, Wayne Holmes, Mark Griffiths, and Laurie B. Forcier.
two-stage process: 2016. Intelligence Unleashed: An argument for AI in Education. Pearson
Education, London.

313

MFOM OSPE
No ratings yet
MFOM OSPE
4 pages
Algoverse AI Research Brochure - NeurIPS Track
No ratings yet
Algoverse AI Research Brochure - NeurIPS Track
13 pages
Enhanced PYP - Unit Planner PDF
100% (2)
Enhanced PYP - Unit Planner PDF
1 page
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
My Best Friend's Daughter
33% (3)
My Best Friend's Daughter
5 pages
Sephirothic Archangels
100% (10)
Sephirothic Archangels
5 pages
The_Future_of_Learning_in_the_Age_of_Generative_AI
No ratings yet
The_Future_of_Learning_in_the_Age_of_Generative_AI
13 pages
FutureOfLearning_LLMs_Book_Chapter
No ratings yet
FutureOfLearning_LLMs_Book_Chapter
12 pages
Escholarship UC Item 6kf0r28s
No ratings yet
Escholarship UC Item 6kf0r28s
45 pages
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle To Pass Assessments in Higher Education Programming Courses
No ratings yet
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle To Pass Assessments in Higher Education Programming Courses
15 pages
2412.04185v1
No ratings yet
2412.04185v1
20 pages
applsci-14-09125
No ratings yet
applsci-14-09125
19 pages
Ethical Considerations For Companies Implementing LLMs in Education Software
No ratings yet
Ethical Considerations For Companies Implementing LLMs in Education Software
6 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
My Library.csv
No ratings yet
My Library.csv
10 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
2408.11539v1
No ratings yet
2408.11539v1
8 pages
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
No ratings yet
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
9 pages
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
No ratings yet
E NHANCING E DUCATIONAL QA S YSTEMS I NTEGRATING K NOWLEDGE G RAPHS A ND L ARGE L ANGUAGE M ODELS F OR C ONTEXT A WARE L EARNING
9 pages
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGUAGE MODELS FOR CONTEXT-AWARE LEARNING
No ratings yet
ENHANCING EDUCATIONAL QA SYSTEMS: INTEGRATING KNOWLEDGE GRAPHS AND LARGE LANGUAGE MODELS FOR CONTEXT-AWARE LEARNING
9 pages
Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments
No ratings yet
Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments
9 pages
1533-Article Text-5901-2-10-20240528
No ratings yet
1533-Article Text-5901-2-10-20240528
11 pages
Exploring the potential of using ChatGPT in physics education
No ratings yet
Exploring the potential of using ChatGPT in physics education
19 pages
Perspective Large Languagemodels in Applied Mechanics
No ratings yet
Perspective Large Languagemodels in Applied Mechanics
7 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
42 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
45 pages
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
No ratings yet
A Survey of Current Machine Learning Approaches To Student Free-Text Evaluation For Intelligent Tutoring - 2022
39 pages
Natural learning
No ratings yet
Natural learning
35 pages
8 Quiz Maker Automatic Quiz Generation From Text Using NLP
No ratings yet
8 Quiz Maker Automatic Quiz Generation From Text Using NLP
11 pages
Next-Generation Assessments: Automatic Question Paper Generation For Modern Classrooms
No ratings yet
Next-Generation Assessments: Automatic Question Paper Generation For Modern Classrooms
4 pages
rita_3381842_pp
No ratings yet
rita_3381842_pp
10 pages
Welcome to the Era of ChatGPT et al.
No ratings yet
Welcome to the Era of ChatGPT et al.
7 pages
ijimai8_5_8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
No ratings yet
ijimai8_5_8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
8 pages
Generative Artificial Intelligence_ Opportunities and Challenges of Large Language Models _ SpringerLink
No ratings yet
Generative Artificial Intelligence_ Opportunities and Challenges of Large Language Models _ SpringerLink
8 pages
INTRODUCTION ChatGPT
No ratings yet
INTRODUCTION ChatGPT
2 pages
2407.20578v2
No ratings yet
2407.20578v2
6 pages
15.AlternativeApplicationsofAI-BasedGPTsinResearchpublications
No ratings yet
15.AlternativeApplicationsofAI-BasedGPTsinResearchpublications
12 pages
ML 22
No ratings yet
ML 22
29 pages
Evaluation of Automatic Multiple Choice Question Generation Using Prompt Engineering
No ratings yet
Evaluation of Automatic Multiple Choice Question Generation Using Prompt Engineering
22 pages
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
100% (1)
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
18 pages
Creating Large Language Model Applications Utilizing Langchain: A Primer On Developing LLM Apps Fast
No ratings yet
Creating Large Language Model Applications Utilizing Langchain: A Primer On Developing LLM Apps Fast
8 pages
LLM Benchmarks
No ratings yet
LLM Benchmarks
5 pages
Generative AI and Prompt Engineering
No ratings yet
Generative AI and Prompt Engineering
36 pages
A Review On Question Generation From Natural Language Text
No ratings yet
A Review On Question Generation From Natural Language Text
43 pages
A Study on the Implementation of Generative AI Ser
No ratings yet
A Study on the Implementation of Generative AI Ser
26 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
Prompt Engineering
100% (1)
Prompt Engineering
63 pages
2407.10246v3
No ratings yet
2407.10246v3
3 pages
Workshop AI Baker PDF
No ratings yet
Workshop AI Baker PDF
88 pages
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review
No ratings yet
Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review
23 pages
Trending_Terms_in_The_AI_and_LLM_Vicinity_1695959485
No ratings yet
Trending_Terms_in_The_AI_and_LLM_Vicinity_1695959485
23 pages
amit pdf report train
No ratings yet
amit pdf report train
20 pages
BCS document
No ratings yet
BCS document
6 pages
1-s2.0-S2666920X24000262-main
No ratings yet
1-s2.0-S2666920X24000262-main
14 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
Marco Teorio 2 USA
No ratings yet
Marco Teorio 2 USA
46 pages
Usage of Large Language Models For Enhancing Interactive Systems: Challenges and Prospects
No ratings yet
Usage of Large Language Models For Enhancing Interactive Systems: Challenges and Prospects
4 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
LLM Survey
No ratings yet
LLM Survey
31 pages
Usage and Knowledge of Online Tools and Generative
No ratings yet
Usage and Knowledge of Online Tools and Generative
27 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
26 pages
Synopsis Final 1 - BE
No ratings yet
Synopsis Final 1 - BE
9 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
24 pages
Audd
100% (1)
Audd
3 pages
READING TASK - 20 MIN
No ratings yet
READING TASK - 20 MIN
5 pages
DFJ AE3 Specification
No ratings yet
DFJ AE3 Specification
1 page
Pert Chart Template
No ratings yet
Pert Chart Template
2 pages
Christian Ministry Religion and Politics
No ratings yet
Christian Ministry Religion and Politics
8 pages
PCS-902 Line Distance Protection Data Sheet
No ratings yet
PCS-902 Line Distance Protection Data Sheet
49 pages
Chapter 4 Leverage Capital Structure
No ratings yet
Chapter 4 Leverage Capital Structure
5 pages
MAY 2025 Monthly Payslip
No ratings yet
MAY 2025 Monthly Payslip
1 page
Linux Printing
No ratings yet
Linux Printing
94 pages
The "Cymbee" Water Spirits of St. John's Berkeley
100% (1)
The "Cymbee" Water Spirits of St. John's Berkeley
12 pages
A DollsHouse Afewlitgennotes
No ratings yet
A DollsHouse Afewlitgennotes
5 pages
MHF Legal Coercion Fact Sheets 2016
No ratings yet
MHF Legal Coercion Fact Sheets 2016
40 pages
Jewish Standard, November 24, 2017
No ratings yet
Jewish Standard, November 24, 2017
52 pages
CBSE Class 10 Maths Content 2024 2025
No ratings yet
CBSE Class 10 Maths Content 2024 2025
2 pages
Tata Steel Emerging-Leader-Program
No ratings yet
Tata Steel Emerging-Leader-Program
7 pages
Gender Genre Child Lit
No ratings yet
Gender Genre Child Lit
15 pages
Nfpa 25-2020 13
No ratings yet
Nfpa 25-2020 13
1 page
Thesis Statement in Favor of School Uniforms
100% (3)
Thesis Statement in Favor of School Uniforms
6 pages
Grade 9 Unit 2 Rugby Lesson Plans
No ratings yet
Grade 9 Unit 2 Rugby Lesson Plans
60 pages
Virgil, Aeneid 11 (Pallas & Camilla)
No ratings yet
Virgil, Aeneid 11 (Pallas & Camilla)
599 pages
E Pathshala File 801642381308
No ratings yet
E Pathshala File 801642381308
48 pages
Reverse Pickup Format
No ratings yet
Reverse Pickup Format
9 pages
(Elements in Publishing and Book Culture) Simon Rowberry - The Early Development of Project Gutenberg c.1970–2000-Cambridge University Press (2023)
No ratings yet
(Elements in Publishing and Book Culture) Simon Rowberry - The Early Development of Project Gutenberg c.1970–2000-Cambridge University Press (2023)
108 pages
Manual Motor Crrcpro26i
No ratings yet
Manual Motor Crrcpro26i
8 pages
Corporate Accounting 2023 Question Paper
No ratings yet
Corporate Accounting 2023 Question Paper
9 pages
Third Division: Positive Action Foundation Philippines, Inc.) - This Is A Petition For
No ratings yet
Third Division: Positive Action Foundation Philippines, Inc.) - This Is A Petition For
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3649409.3691090

Uploaded by

3649409.3691090

Uploaded by

From GPT to BERT:

Benchmarking Large Language Models for Automated iz Generation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.