0% found this document useful (0 votes)
5 views

3649409.3691090

This study evaluates the effectiveness of GPT-3, GPT-4, GPT-4o, and BERT in generating quiz questions for Java and Python programming courses, focusing on criteria such as technical precision and pedagogical appropriateness. Preliminary findings indicate that GPT-4 outperforms BERT in technical precision, and ongoing analysis aims to assess the models' contextual relevance and educational utility. The research seeks to enhance formative assessment practices in computer science education through the integration of large language models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

3649409.3691090

This study evaluates the effectiveness of GPT-3, GPT-4, GPT-4o, and BERT in generating quiz questions for Java and Python programming courses, focusing on criteria such as technical precision and pedagogical appropriateness. Preliminary findings indicate that GPT-4 outperforms BERT in technical precision, and ongoing analysis aims to assess the models' contextual relevance and educational utility. The research seeks to enhance formative assessment practices in computer science education through the integration of large language models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

From GPT to BERT:

Benchmarking Large Language Models for Automated iz Generation

Yetunde Folajimi
School of Computing and Data Science
Wentworth Institute of Technology
Boston, Massachuses, USA
folajimiy@wit.edu

Abstract 1 Introduction
is study evaluates the effectiveness of four leading large Formative assessment is a critical educational practice that
language models (LLMs), GPT-3, GPT-4, GPT-4o, and BERT, in facilitates learning by providing timely and actionable feedback to
generating quiz questions for Java and Python programming students, thereby enhancing their understanding and retention of
courses. We aim to recognize how LLMs can effectively produce subject matter. Unlike summative assessments, which aim to
educationally valuable questions that meet specific pedagogical evaluate student learning at the end of an instructional period,
criteria, including technical precision, relevance to course formative assessments are integrated throughout the educational
objectives, linguistic clarity, and pedagogical appropriateness. process, offering continuous information on student progress and
Each model was prompted to generate 200 Java and 200 Python areas that need improvement [1]. This is especially vital in
quiz questions, totaling 1600 unique questions. ese questions disciplines like computer science, where concepts can be abstract
are currently being evaluated based on both quantitative and and complex, requiring iterative practice and reinforcement.
qualitative assessments by a team of computer science educators. Recent advances in artificial intelligence (AI), particularly in natural
Preliminary findings suggest that GPT-4 outperforms BERT in language processing, have opened new avenues to enhance
terms of technical precision. Further analysis is ongoing to assess educational practices. Large language models (LLMs), such as
the performance of the models in generating contextually OpenAI’s GPT-3 and GPT-4, and Google’s BERT, have
appropriate and educationally useful questions, offering insights demonstrated remarkable capabilities to generate coherent and
into their potential integration into computer science curricula. contextually relevant text. These models are trained on diverse
is work seeks to contribute to the broader discourse on the datasets comprising a vast array of internet text, enabling them to
utility of LLMs in educational seings, specifically within the perform a variety of tasks that include content creation, question
scope of automated content creation to enhance teaching and answering, and even tutoring [2].
assessment methodologies in computer science education. The potential of LLMs to revolutionize educational tools by
automating the creation of diverse and adaptive educational content
CCS Concepts is significant. Specifically, in the realm of formative assessments,
LLMs can be used to generate a wide range of quiz questions tailored
• Applied computing~Education~E-learning • Computing
to individual learning trajectories, supporting educators in
methodologies~Artificial intelligence • Applied
delivering personalized learning experiences and reducing their
computing~Education~Interactive learning environments
workload in content preparation. This study explores the
Keywords integration of LLMs into the process of generating formative
Formative Assessment, Automated Assessment, Computer Science assessment questions in computer science education to identify
Education, personalized quizzes, Quiz Questions Generation, Large efficient ways to enhance pedagogical support and student
Language Models engagement through technology.
This work builds on existing research indicating the benefits of AI
ACM Reference format: in educational settings, where AI tools have been shown to improve
Yetunde Folajimi. 2024. From GPT to BERT: Benchmarking Large engagement and outcomes by providing learning experiences that
Language Models for Automated iz Generation. In Proceedings of the are adaptive to student needs [5]. By integrating LLMs into
2024 ACM Virtual Global Computing Education Conference V. 2 (SIGCSE formative assessment practices, this study seeks to further
Virtual 2024), December 5-8, 2024, Virtual Event, NC USA. ACM, New understand and harness AI’s potential to transform educational
York, NY, USA, 2 pages. hps://doi.org/10.1145/3649409.3691090 methodologies in computer science.

Permission to make digital or hard copies of all or part of this work for personal or 2 Background
classroom use is granted without fee provided that copies are not made or distributed LLMs, such as those developed by OpenAI, represent a significant
for profit or commercial advantage and that copies bear this notice and the full citation advancement in natural language processing (NLP) technologies.
on the first page. Copyrights for components of this work owned by others than the These models, including GPT-3 and GPT-4, are a type of neural
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or network architecture known as transformers, which have been
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
trained on diverse datasets that encompass a vast swath of human
SIGCSE Virtual 2024, December 5-8, 2024, Virtual Event, NC, USA. knowledge expressed in text form. The key to their success lies in
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM. their ability to generate text that is not only contextually relevant
ACM ISBN 979-8-4007-0604-2/24/12 but also remarkably coherent [2].
https://doi.org/10.1145/3649409.3691090

312
SIGCSE Virtual 2024, December 5-8, 2024, Virtual Event, NC USA Yetunde Folajimi

LLMs have been increasingly incorporated into various applications • An initial script checks for obvious errors, such as syntax
beyond traditional NLP tasks. In the educational sector, these errors or irrelevant content, filtering out unsuitable questions.
models have demonstrated potential as tools for content generation, • Remaining questions undergo a rigorous review by a team of
tutoring systems, and automated assessment creation. The ability of computer science educators. Each question is rated on
LLMs to generate text based on given prompts makes them accuracy, relevance, clarity, and educational value to ensure
particularly suitable for creating educational materials and the content’s suitability for educational use. Each criterion is
assessments that are both diverse and adaptive [5]. scored on a 1–5 scale, ensuring that only high-quality
Previous research has shown that the use of AI in education can lead questions are considered for use in assessments.
to more personalized learning experiences. LLMs, for example, have
been used to automatically generate learning content adapted to
4 Future Plans
individual student needs [1]. In addition, these models have been 4.1 Analytical Approach
used to generate assessments that can adapt to the level of knowl- Statistical analysis will be used to compare the performance of each
edge of the students, providing immediate feedback and guiding model based on expert reviews. Metrics such as the number of
their learning journey more effectively [3]. acceptable questions, the average ratings in clarity and relevance,
Despite their advantages, integrating LLMs into educational and the diversity of topics covered will be analyzed. Preliminary
practices presents challenges, primarily concerning the accuracy statistical methods include descriptive statistics and variance
and relevance of the content generated, especially in technical analysis to identify significant differences in model performance.
subjects such as computer science. This study aims to address these Qualitative feedback from reviewers will highlight strengths and
challenges by exploring the capabilities of various LLMs in weaknesses, providing deeper insight into each model's educational
generating formative assessment questions that are both value.
pedagogically sound and technically accurate
4.2 Tools and Technologies
3 Methodology The study leverages various software tools and platforms:
• Python is used for scripting the initial filtering process and for
3.1 Selection of Large Language Models
statistical analysis, utilizing libraries such as Pandas and SciPy.
This study evaluates four large language models (LLMs). We
• Each model is accessed via its respective API—OpenAI for GPT
selected GPT-3, GPT-4, GPT-4o, and BERT due to their differing
models and Hugging Face’s Transformers library for BERT.
architectures and widespread use in NLP. GPT models are known
for strong generative capabilities, while BERT excels in contextual 5 Conclusion
understanding. This mix offers a comprehensive comparison for Our methodology provides a comprehensive framework for
quiz question generation in programming courses, where both assessing the potential of different LLMs to effectively contribute to
accuracy and clarity are key. GPT models are autoregressive, the generation of educational content. The results of this study will
designed to predict the next word in a sequence, making them well inform best practices for integrating AI into the quiz generation
suited for generative tasks. In contrast, BERT is bidirectional, process in educational settings. While this study provides valuable
primarily designed to understand the context of words in a sentence insights into the capabilities of different LLMs for quiz generation,
and hence adapted for this study to generate content based on a potential limitation is the potential mismatch between the LLMs'
prompts [4]. pre-training data and programming-specific content. Future work
3.2 Data Preparation could explore fine-tuning models on domain-specific data to
improve accuracy and relevance for educational use.
The study uses a data set of topics in computer science, specifically
Java and Python programming, to maintain focus and relevance. References
The topics were selected based on the core curriculum requirements [1] Paul Black and Dylan Wiliam. 2009. Developing the theory of formative
for introductory and intermediate computer science courses. Each assessment. Educational Assessment, Evaluation and Accountability 21,
model received personalized prompts that included a brief 1 (February 2009), 5–31. https://doi.org/10.1007/s11092-008-9068-5
description of the topic, the difficulty level, and the specific format [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
desired for the question (e.g. multiple choice, fill-in-the-blank). Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
3.3 Question Generation Process Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Each model was asked to generate 200 unique questions for Java and Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Python, totaling 800 questions per programming language designed Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever,
to test a range of cognitive skills. This approach allows a direct and Dario Amodei. 2020. Language models are few-shot learners.
evaluation of each model's performance, eliminating prompt In OpenAI Blog. https://arxiv.org/abs/2005.14165
variation as a factor. [3] Sarah Cohen, Werner Nutt, and Yehoshua Sagic. 2007. Deciding
• Prompt Example for GPT Models: "Generate a multiple- choice equivalences among conjunctive aggregate queries. Journal of the
question about object-oriented principles in Java focusing on ACM (JACM) 54, 2 (April 2007), 5–50.
class inheritance." https://doi.org/10.1145/1219092.1219093
• Prompt example for BERT: "Create a multiple-choice question [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
about the use of lists in Python programming." 2019. BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. NAACL HLT 2019 (2019).
3.4 Ongoing Studies https://www.aclweb.org/anthology/N19-1423
The generated questions are currently being evaluated through a [5] Rosemary Luckin, Wayne Holmes, Mark Griffiths, and Laurie B. Forcier.
two-stage process: 2016. Intelligence Unleashed: An argument for AI in Education. Pearson
Education, London.

313

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy