A Survey On Evaluation of Large Language Models
A Survey On Evaluation of Large Language Models
Abstract—Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their
unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their
evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential
risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a
comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate,
arXiv:2307.03109v2 [cs.CL] 9 Jul 2023
and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language
processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas.
Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial
components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally,
we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the
realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated
as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at:
https://github.com/MLGroupJLU/LLM-eval-survey.
Reasoning: (Bian et al., 2023) / (Qin et al., 2023) / (Orrù et al., 2023) / Bang et al. (2023) / (Saparov et al., 2023) /
(Fu et al., 2023) / (Liévin et al., 2022) / (Xu et al., 2023a) / (Liu et al., 2023a)
Multilingual: (Lai et al., 2023) / (Bang et al., 2023) / Abdelali et al. (2023) / (Zhang et al., 2023c)
Factuality: (Honovich et al., 2022)/ (Pezeshkpour, 2023)/ (Gekhman et al., 2023) / (Manakul et al., 2023a)/ (Min et al., 2023)/ (Wang et al., 2023b)
Robustness: Zhao et al. (2023b) / (Wang et al., 2022) / Wang et al. (2023c) / Zhuo et al. (2023b) /
(Yang et al., 2022) / (Li et al., 2023b) (Zhu et al., 2023)
Robustness / Ethics/ Ethics and biases: (Gehman et al., 2020) / (Sheng et al., 2021) / Zhuo et al. (2023a) / (Dhamala et al., 2021) /
Biases/ Trustworthiness (Parrish et al., 2022) / Deshpande et al. (2023) / Ferrara (2023) / (Rutinowski et al., 2023) /
(Hartmann et al., 2023) / (Simmons, 2022) / (Cao et al., 2023) / (Wang et al., 2023e)
Social science (Wu et al., 2023a) / (Ziems et al., 2023) / (Deroy et al., 2023) / (Nay et al., 2023) / (Frank, 2023)
What to evaluate Mathematics: (Yuan et al., 2023) / (Arora et al., 2023) / (Wu et al., 2023b) / (Collins et al., 2023)/ (Dao and Le, 2023) /
(Sec. 3) (Wei et al., 2023)/ (Bubeck et al., 2023)
Natural science
Science: (Castro Nascimento and Pimentel, 2023)/ (Arora et al., 2023)
& engineering
Engineering: (Liu et al., 2023b) / (Sridhara et al., 2023) / (Valmeekam et al., 2022) / (Valmeekam et al., 2023)/ (Pallagani et al., 2023) /
(Bubeck et al., 2023)
Medical QA: (Thirunavukarasu et al., 2023) / (Duong and Solomon, 2023) / (Samaan et al., 2023) / (Holmes et al., 2023) /
(Johnson et al., 2023) / (Chervenak et al., 2023) / (Hamidi and Roberts, 2023) / (Jahan et al., 2023)
Medical assistants: (Lahat et al., 2023) / (Wang et al., 2023h) / (Khan et al., 2023)
Agent applications Huang et al. (2023a) / Karpas et al. (2022) / (?) / (Schick et al., 2023) / (Shen et al., 2023)
Education: (Dai et al., 2023) / (de Winter, 2023) / (Wang and Demszky, 2023) / (Hellas et al., 2023) / Wei et al. (2023)
Search and recommendation: (Sun et al., 2023) / (Fan et al., 2023) / (Zhang et al., 2023a) / (Xu et al., 2023b)
Other
applications
LLMs Personality testing: (Song et al., 2023) / (Jentzsch and Kersting, 2023) / (Safdari et al., 2023)
evaluation
Other tasks: (Lanzi and Loiacono, 2023) / (Wang et al., 2023g) / (Le and Zhang, 2023)
AlpacaEval (Li et al., 2023c)/ KoLA (Yu et al., 2023)/ DynaBench (Kiela et al., 2021)/ AGIEval (Zhong et al., 2023)/ PromptBench (Zhu et al., 2023) /
General
PandaLM (Wang et al., 2023g)/ OpenLLM (HuggingFace, 2023) / HELM (Liang et al., 2022) / GLUE-X (Yang et al., 2022) /
Where to evaluate benchmarks
Big-Bench (Srivastava et al., 2022)/ C-Eval (Huang et al., 2023b)/ MT-Bench (Zheng et al., 2023) / Chatbot Arena (LMSYS, 2023)
(Sec. 4) Specific
MultiMedQA (Singhal et al., 2022)/ M3Exam (Zhang et al., 2023c)/ SOCKET (Choi et al., 2023)/ API-Bank (Li et al., 2023a)/ ToolBench (ToolBench, 2023)
benchmarks
Automatic evaluation: (Qin et al., 2023) / (Bang et al., 2023) / (Lin and Chen, 2023) / (Wang et al., 2023g)
How to evaluate
Evaluation criterion
(Sec. 5) Human evaluation: (Liang et al., 2022) / (Bang et al., 2023) / (Bubeck et al., 2023) / (Ziems et al., 2023)
Summary Human-in-the-loop: AdaVision (Gao et al., 2022) / AdaTest (Ribeiro and Lundberg, 2022)
(Sec. 6)
Crowd-sourcing testing: DynaBench (Kiela et al., 2021) / DynaBoard (Ma et al., 2021) / DynaTask (Thrush et al., 2022) /
Benchmark and evaluations DynamicTempLAMA (Margatina et al., 2023)
More challenging tasks: DeepTest (Tian et al., 2018) / CheckList (Ribeiro et al., 2020) / AdaFilter (Phang et al., 2021)
HELM (Liang et al., 2022) / Big-Bench (Srivastava et al., 2022) / PromptBench (Zhu et al., 2023)
Grand challenges (1) Designing AGI benchmarks (2) Complete behavioral evaluation (3) Robustness evaluation (4) Dynamic and evolving evaluation
Challenges
(Sec. 7) (5) Principled and trustworthy evaluation (6) Unified evaluation that supports all LLMs tasks (7) Beyond evaluation: LLMs enhancement
Evaluation is of paramount prominence to the success existing evaluation protocols may not be enough to evaluate
of LLMs due to several reasons. First, evaluating LLMs their capabilities and potential risks. Therefore, we aim to
helps us better understand the strengths and weakness of call awareness of the community of the importance to LLMs
LLMs. For instance, the PromptBench (Zhu et al., 2023) evaluations by reviewing the current evaluation protocols
benchmark illustrates that current LLMs are sensitive to and most importantly, shed light on future research about
adversarial prompts, thus a careful prompt engineering is designing new LLMs evaluation protocols.
necessary for better performance. Second, better evaluations With the introduction of ChatGPT (OpenAI, 2023a) and
can provide a better guidance for human-LLMs interaction, GPT-4 (OpenAI, 2023b), there have been a number of re-
which could inspire future interaction design and imple- search efforts aiming at evaluating ChatGPT and other
mentation. Third, the broad applicability of LLMs under- LLMs from different aspects (Fig. 2), encompassing a range
scores the paramount importance of ensuring their safety of factors such as natural language tasks, reasoning, ro-
and reliability, particularly in safety-sensitive sectors such bustness, trustworthiness, medical applications, and ethi-
as financial institutions and healthcare facilities. Finally, as cal considerations. Despite these efforts, a comprehensive
LLMs are becoming larger with more emergent abilities, overview capturing the entire gamut of evaluations is still
PREPRINT 3
Number of papers
GPT-4 can be seen as sparks of AGI, others contest this claim
due to the human-crafted nature of its evaluation approach. 20
This paper serves as the first comprehensive survey on
evaluation of large language models. As depicted in Fig. 1,
15
we explore existing work in three dimensions: 1) What to
evaluate, 2) Where to evaluate, and 3) How to evaluate.
10
Specifically, “what to evaluate” encapsulates existing evalu-
ation tasks for LLMs, “where to evaluate” involves select-
ing appropriate datasets and benchmarks for evaluation, 5
while “how to evaluate” is concerned with the evaluation
process given appropriate tasks and datasets. These three 0
20
20 2
20 2
20 3
20 05
6+
21
20 1
20 4
dimensions are integral to the evaluation of LLMs. We sub-
.0
.0
.0
.0
20
20
.
20
.0
23
23
23
23
23
23
sequently discuss potential future challenges in the realm of
LLMs evaluation.
The contributions of this paper are as follows:
Fig. 2. Trend of LLMs evaluation papers over time (2020 - Jun. 2023,
1) We provide a comprehensive overview of LLMs including Jul. 2023.).
evaluations from three aspects: what to evaluate,
where to evaluate, and how to evaluate. Our cat-
egorization is general and encompasses the entire
life cycle of LLMs evaluation. of overfitting, and the difficulty in capturing complex lin-
2) Regarding what to evaluate, we summarize existing guistic phenomena. Researchers are continuously working
tasks in various areas and obtain insightful con- on improving LM architectures and training methods to
clusions on the success and failure case of LLMs address these challenges.
(Sec. 6), providing experience for future research.
3) As for where to evaluate, we summarize evaluation Large Language Models (LLMs) (Chen et al., 2021; Kas-
metrics, datasets, and benchmarks to provide a pro- neci et al., 2023; Zhao et al., 2023a) are advanced language
found understanding of current LLMs evaluations. models with massive parameter sizes and exceptional learn-
In terms of how to evaluate, we explore current pro- ing capabilities. The core module behind many LLMs such
tocols and summarize novel evaluation approaches. as GPT-3 (Floridi and Chiriatti, 2020), InstructGPT (Ouyang
4) We further discuss future challenges in evaluating et al., 2022), and GPT-4 (OpenAI, 2023b) is the self-attention
LLMs. We open-source and maintain the related ma- module in Transformer (Vaswani et al., 2017) that serves
terials of LLMs evaluation at https://github.com/ as the fundamental building block for language modeling
MLGroupJLU/LLM-eval-survey to foster a collabo- tasks. Transformers have revolutionized the field of NLP
rative community for better evaluations. with their ability to handle sequential data efficiently, allow-
ing for parallelization and capturing long-range dependen-
The paper is organized as follows. In Sec. 2, we provide cies in text. One key feature of LLMs is in-context learning
the basic information of LLMs and AI model evaluation. (Brown et al., 2020), where the model is trained to generate
Then, Sec. 3 reviews existing work from the aspects of “what text based on a given context or prompt. This enables
to evaluate”. After that, Sec. 4 is the “where to evaluate” LLMs to generate more coherent and contextually relevant
part, which summarizes existing datasets and benchmarks. responses, making them suitable for interactive and conver-
Sec. 5 discusses how to perform the evaluation. In Sec. 6, we sational applications. Reinforcement Learning from Human
summarize the key findings of this paper. We discuss grand Feedback (RLHF) (Christiano et al., 2017; Ziegler et al., 2019)
future challenges in Sec. 7 and Sec. 8 concludes the paper. is another crucial aspect of LLMs. This technique involves
fine-tuning the model using human-generated responses as
2 BACKGROUND rewards, allowing the model to learn from its mistakes and
2.1 Large Language Models improve its performance over time.
Language models (LMs) (Devlin et al., 2018; Gao and Lin, In an autoregressive language model, such as GPT-3
2004; Kombrink et al., 2011) are computational models that (Floridi and Chiriatti, 2020) and PaLM (Chowdhery et al.,
have the capability to understand and generate human 2022), given a context sequence X , the LM tasks aim to
language. LMs have the transformative ability to predict the predict the next token y . The model is trained by maximiz-
likelihood of word sequences or generate new text based on ing the probability of the given token sequence conditioned
a given input. N-gram models (Brown et al., 1992), the most on the context, i.e., P (y|X) = P (y|x1 , x2 , ..., xt−1 ), where
common type of LM, estimate word probabilities based on x1 , x2 , ..., xt−1 are the tokens in the context sequence, and
the preceding context. However, LMs also face challenges, t is the current position. By using the chain rule, the con-
such as the issue of rare or unseen words, the problem ditional probability can be decomposed into a product of
PREPRINT 4
where T is sequence length. In this way, the model predicts 3 W HAT TO E VALUATE
each token at each position in an autoregressive manner, What tasks should we evaluate LLMs to show their perfor-
generating a complete text sequence. mance? On what tasks can we claim the strength and weak-
One common approach to interacting with LLMs is ness of LLMs? In this section, we divide existing tasks into
prompt engineering (Clavié et al., 2023; White et al., 2023; the following categories: natural language processing tasks,
Zhou et al., 2022), where users design and provide specific ethics and biases, medical applications, social sciences, natu-
prompt texts to guide LLMs in generating desired responses ral science and engineering tasks, agent applications (using
or completing specific tasks. This is widely adopted in exist- LLMs as agents), and others.1
ing evaluation efforts. People can also engage in question-
and-answer interactions (Jansson et al., 2021), where they
3.1 Natural Language Processing Tasks
pose questions to the model and receive answers, or en-
gage in dialogue interactions, having natural language con- The initial objective behind the development of language
versations with LLMs. In conclusion, LLMs, with their models, particularly large language models, was to enhance
Transformer architecture, in-context learning, and RLHF performance on natural language processing tasks, encom-
capabilities, have revolutionized NLP and hold promise in passing both understanding and generation. Consequently,
various applications. TABLE 1 provides a brief comparison the majority of evaluation research has been primarily
of traditional ML, deep learning, and LLMs. focused on natural language tasks. TABLE 2 summarizes
the evaluation aspects of existing research, and we mainly
highlight their conclusions in the following.2
2.2 AI Model Evaluation
AI model evaluation is an essential step in assessing the per- 3.1.1 Natural language understanding
formance of a model. There are some standard model evalu- Natural language understanding represents a wide spec-
ation protocols, including K-fold cross-validation, Holdout trum of tasks that aims to obtain a better understanding
validation, Leave One Out cross-validation (LOOCV), Boot- of the input sequence. We summarize recent efforts in LLMs
strap, and Reduced Set (Berrar, 2019; Kohavi et al., 1995). evaluation from several aspects.
For instance, k-fold cross-validation divides the dataset Sentiment analysis is a task that analyzes and interprets
into k parts, with one part used as a test set and the the text to determine the emotional inclination. It is typically
rest as training sets, which can reduce training data loss a binary (positive and negative) or triple (positive, neutral,
and obtain relatively more accurate model performance and negative) class classification problem. Evaluating sen-
evaluation (Fushiki, 2011); Holdout validation divides the timent analysis tasks is a popular direction. Liang et al.
dataset into training and test sets, with a smaller calculation (2022); Zeng et al. (2022) showed that model performance is
amount but potentially more significant bias; LOOCV is often high. ChatGPT’s sentiment analysis prediction perfor-
a unique K-fold cross-validation method where only one mance is superior to traditional sentiment analysis methods
data point is used as the test set (Wong, 2015); Reduced (Lopez-Lira and Tang, 2023) and comes close to that of GPT-
Set trains the model with one dataset and tests it with the 3.5 (Qin et al., 2023). In fine-grained sentiment and emotion
remaining data, which is computationally simple, but the cause analysis, ChatGPT also exhibits exceptional perfor-
applicability is limited. The appropriate evaluation method mance (Wang et al., 2023i). In low-resource learning envi-
should be chosen according to the specific problem and data ronments, LLMs exhibit significant advantages over small
characteristics for more reliable performance indicators. language models (Zhang et al., 2023d), but the ability of
Fig. 3 illustrates the evaluation process of AI models, ChatGPT to understand low-resource languages is limited
including LLMs. Some evaluation protocols may not be fea- (Bang et al., 2023). In conclusion, LLMs have demonstrated
sible to evaluate deep learning models due to the extensive
training size. Thus, evaluation on a static validation set has 1. Note that LLMs are evaluated in various tasks and the categoriza-
tion in this paper is only one possible way for classification of these
long been the standard choice for deep learning models. For works. There are certainly other taxonomies.
instance, computer vision models leverage static test sets 2. Several NLP areas have intersections and thus our categorization
such as ImageNet (Deng et al., 2009) and MS COCO (Lin of these areas is only one possible way to categorize.
PREPRINT 5
TABLE 2
Summary of evaluation on natural language processing tasks: NLU (Natural Language Understanding, including SA (Sentiment Analysis), TC
(Text Classification), NLI (Natural Language Inference) and other NLU tasks), Rng. (Reasoning), NLG (Natural Language Generation, including
Summ. (Summarization), Dlg. (Dialogue), Tran (Translation), QA (Question Answering) and other NLG tasks), and Mul. (Multilingual tasks)
(ordered by the name of the first author).
NLU NLG
Rng. Mul.
Reference SA TC NLI Others Summ. Dlg. Tran. QA Others
(Abdelali et al., 2023) ✓
(Bian et al., 2023) ✓ ✓
(Bang et al., 2023) ✓ ✓ ✓ ✓ ✓ ✓ ✓
(Bai et al., 2023) ✓
(Chen et al., 2023) ✓
(Choi et al., 2023) ✓
(Chia et al., 2023) ✓
(Fu et al., 2023) ✓
(Gekhman et al., 2023) ✓
(Honovich et al., 2022) ✓ ✓ ✓ ✓
(Lai et al., 2023) ✓
(Laskar et al., 2023) ✓ ✓ ✓ ✓ ✓
(Lopez-Lira and Tang, 2023) ✓
(Liang et al., 2022) ✓ ✓ ✓ ✓
(Lee et al., 2023) ✓
(Lin and Chen, 2023) ✓
(Liévin et al., 2022) ✓
(Liu et al., 2023a) ✓
(Lyu et al., 2023a) ✓
(Manakul et al., 2023a) ✓ ✓
(Min et al., 2023) ✓
(Orrù et al., 2023) ✓
(Peña et al., 2023) ✓
(Pu and Demberg, 2023) ✓ ✓
(Pezeshkpour, 2023) ✓
(Qin et al., 2023) ✓ ✓ ✓ ✓ ✓ ✓
(Riccardi and Desai, 2023) ✓
(Saparov et al., 2023) ✓
(Tao et al., 2023) ✓
(Wang et al., 2023d) ✓
(Wang et al., 2023i) ✓
(Wang et al., 2023b) ✓ ✓
(Xu et al., 2023a) ✓
(Yang and Menczer, 2023) ✓
(Zhang et al., 2023d) ✓
(Zhang et al., 2023c) ✓
commendable performance in sentiment analysis tasks. Fu- outperforms GPT-3.5 for NLI tasks. They also found that
ture work should focus on enhancing their capability to ChatGPT excels in handling factual input that could be
understand emotions in under-resourced languages. attributed to its RLHF training process in favoring human
Text classification and sentiment analysis are related feedback. However, Lee et al. (2023) observed LLMs per-
fields, text classification not only focuses on sentiment, but form poorly in the scope of NLI and further fail in rep-
also includes the processing of all texts and tasks. Liang resenting human disagreement, which indicates that LLMs
et al. (2022) showed that GLM-130B is the best-performed still have a large room for improvement in this field.
model, with an overall accuracy of 85.8% for miscellaneous Semantic understanding refers to the meaning or under-
text classification. Yang and Menczer (2023) found that standing of language and its associated concepts. It involves
ChatGPT can produce credibility ratings for a wide range of the interpretation and comprehension of words, phrases,
news outlets, and these ratings have a moderate correlation sentences and the relationships between them. Semantic
with those from human experts. Furthermore, ChatGPT processing goes beyond the surface level and focuses on
achieves acceptable accuracy in a binary classification sce- understanding the underlying meaning and intent. Tao
nario (AUC=0.89). Peña et al. (2023) discussed the prob- et al. (2023) comprehensively evaluated the event semantic
lem of topic classification for public affairs documents and processing abilities of LLMs covering understanding, rea-
showed that using an LLM backbone in combination with soning, and prediction about the event semantics. Results
SVM classifiers is a useful strategy to conduct the multi-label indicated that LLMs possess an understanding of individual
topic classification task in the domain of public affairs with events, but their capacity to perceive the semantic similarity
accuracies over 85%. Overall, LLMs perform well on text among events is constrained. In reasoning tasks, LLMs
classification and can even handle text classification tasks in exhibit robust reasoning abilities in causal and intentional
unconventional problem settings as well. relations, yet their performance in other relation types is
Natural language inference (NLI) is the task of deter- comparatively weaker. In prediction tasks, LLMs exhibit
mining whether the given “hypothesis” logically follows enhanced predictive capabilities for future events with in-
from the “premise”. Qin et al. (2023) showed that ChatGPT creased contextual information. Riccardi and Desai (2023)
PREPRINT 6
explored the semantic proficiency of LLMs and showed that is the most robust open-source LLMs to date, which per-
these models perform poorly in evaluating basic phrases. forms closely to code-davinci-002. Some papers separately
Furthermore, GPT-3.5 and Bard cannot distinguish between evaluate the performance of ChatGPT on some reasoning
meaningful and nonsense phrases, consistently classifying tasks: ChatGPT generally performs poorly on commonsense
highly nonsense phrases as meaningful. GPT-4 shows signif- reasoning tasks, but relatively better than non-text semantic
icant improvements, but its performance is still significantly reasoning (Bang et al., 2023). Meanwhile, ChatGPT also
lower than that of humans. In summary, the performance of lacks spatial reasoning ability, but exhibits better temporal
LLMs in semantic understanding tasks is poor. In the future, reasoning. Finally, while the performance of ChatGPT is
we can start from this aspect and focus on improving its acceptable on causal and analogical reasoning, it performs
performance on this application. poorly on multi-hop reasoning ability, which is similar to
In the field of social knowledge understanding, Choi the weakness of other LLMs on complex reasoning (Ott
et al. (2023) evaluates how well models perform at learn- et al., 2023). In professional domain reasoning tasks, zero-
ing and recognizing concepts of social knowledge and the shot InstructGPT and Codex are capable of complex medical
results reveal that despite being much smaller in the number reasoning tasks, but still need to be further improved (Liévin
of parameters, finetuning supervised models such as BERT et al., 2022). In terms of language insight issues, (Orrù et al.,
lead to much better performance than zero-shot models 2023) demonstrated the potential of ChatGPT for solving
using state-of-the-art LLMs, such as GPT (Radford et al., verbal insight problems, as ChatGPT’s performance was
2018), GPT-J-6B (Wang and Komatsuzaki, 2021) and so on. comparable to that of human participants. It should be
This shows that supervised models significantly outperform noted that most of the above conclusions are obtained for
zero-shot models and that more parameters do not guaran- specific data sets. Overall, LLMs show great potential in
tee more social knowledge in this setting. reasoning and show a continuous improvement trend, but
still face many challenges and limitations, requiring more
in-depth research and optimization.
3.1.2 Reasoning
From TABLE 2, it can be found that evaluating the reason- 3.1.3 Natural language generation
ing ability of LLMs is a popular direction, and more and Natural language generation (NLG) evaluates the capabil-
more articles focus on exploring its reasoning ability. The ities of LLMs in generating specific texts, which consists
reasoning task is a very challenging task for an intelligent of several tasks, including summarization, dialogue gener-
AI model. It requires the model not only to understand the ation, machine translation, question answering, and other
given information, but also to reason and infer from the open-ended generation applications.
existing context in the absence of direct answers. At present, Summarization is a generation task that aims to learn
the evaluation of reasoning tasks can be roughly classified a concise abstract for the given sentence. In this line of
into mathematical reasoning, common sense reasoning, log- evaluation, Liang et al. (2022) showed that TNLG v2 (530B)
ical reasoning, professional field reasoning, etc. (Smith et al., 2022) had the highest score for both scenarios,
ChatGPT exhibits a strong capability for arithmetic rea- and OPT (175B) (Zhang et al., 2022) ranked second. It is
soning by outperforming GPT-3.5 in the majority of tasks disappointing that ChatGPT sometimes generates a longer
(Qin et al., 2023). However, its proficiency in mathematical summary than the input document (Bang et al., 2023).
reasoning still requires improvement (Bang et al., 2023; The fine-tuned Bart (Lewis et al., 2019) is still better than
Frieder et al., 2023; Zhuang et al., 2023). On symbolic rea- zero-shot ChatGPT. Specifically, ChatGPT has similar zero-
soning tasks, ChatGPT is mostly worse than GPT-3.5, which shot performance to text-davinci-002 (Bang et al., 2023), but
may be because ChatGPT is prone to uncertain responses, performs worse than GPT-3.5 (Qin et al., 2023). In control-
leading to poor performance (Bang et al., 2023). Through lable text summarization, Pu and Demberg (2023) showed
the poor performance of LLMs on task variants of counter- that ChatGPT summaries are slightly more extractive (i.e.,
factual conditions, Wu et al. (2023c) showed that the current containing more content copied directly from the source)
LLMs had certain limitations in abstract reasoning ability. In compared to human summaries. The above shows that
logical reasoning, Liu et al. (2023a) indicated that ChatGPT LLMs, especially ChatGPT, have a general performance in
and GPT-4 outperformed traditional fine-tuning methods summarizing tasks, but the summary and generalization
on most logical reasoning benchmarks, demonstrating their ability still needs to be improved.
superiority in logical reasoning. However, both models face Evaluating the performance of LLMs on dialogue tasks
challenges when handling new and out-of-distribution data. is crucial to the development of dialogue systems and
ChatGPT does not perform as well as other LLMs, including improving the human-computer interaction. Through such
GPT-3.5 and BARD (Qin et al., 2023; Xu et al., 2023a). This evaluation, the natural language processing ability, context
is because ChatGPT is designed explicitly for chatting, so understanding ability and generation ability of the model
it does an excellent job of maintaining rationality. FLAN- can be improved, so as to realize a more intelligent and
T5, LLaMA, GPT-3.5, and PaLM perform well in general more natural dialogue system. Both Claude and ChatGPT
deductive reasoning tasks (Saparov et al., 2023). GPT-3.5 generally achieve better performance across all dimensions
is not good at keep oriented for reasoning in the induc- when compared to GPT-3.5 (Lin and Chen, 2023; Qin et al.,
tive setting (Xu et al., 2023a). For multi-step reasoning, Fu 2023). When comparing the Claude and ChatGPT models,
et al. (2023) showed PaLM and Claude2 are the only two both models demonstrate competitive performance across
model families that achiving similar performance (but still different evaluation dimensions, with Claude slightly out-
worse than) the GPT model family. Moreover, LLaMA-65B performing ChatGPT in specific configurations. Bang et al.
PREPRINT 7
(2023) test ChatGPT’s for response generation in various that ChatGPT outperformed the previous supervised SOTA
dialogue settings: 1) Knowledge-Grounded Open-Domain model by training on the same subset for few-shot learning,
Dialogue and 2) Task-Oriented Dialogue. The automatic as evident from the higher BLEU score. In terms of control-
evaluation results showed that the performance of ChatGPT ling the formality of sentence style, ChatGPT’s performance
is relatively low compared to GPT2 fine-tuned on the dataset still exhibits significant differences compared to human be-
for knowledge-grounded open-domain dialogue. In task- havior. In writing tasks, Chia et al. (2023) found that LLMs
oriented dialogue, the performance of ChatGPT is accept- perform consistently across writing-based tasks including
able, but it is prone to errors when the following problems informative, professional, argumentative, and creative writ-
occur: long-term multi-turn dependency, fundamental rea- ing categories, showing that their writing capabilities are
soning failure, and extrinsic hallucination. general. In text generation quality, Chen et al. (2023) showed
While LLMs are not trained explicitly for translation that ChatGPT was able to effectively evaluate text quality
tasks, it can indeed show strong performance. Wang et al. from various perspectives in the absence of reference texts
(2023d) showed that ChatGPT and GPT-4 demonstrated and outperformed most existing automated metrics. Using
superior performance compared to commercial machine ChatGPT to generate numerical scores for text quality was
translation (MT) systems in terms of human evaluation and considered the most reliable and effective method among
outperformed most document-level NMT methods in terms various testing methods.
of sacreBLEU. When comparing ChatGPT to traditional
translation models during contrastive testing, it exhibits 3.1.4 Multilingual tasks
lower accuracy. On the other hand, GPT-4 showcases a ro- Many LLMs are trained on mixed-language training data.
bust capability in explaining discourse knowledge, despite While English is the predominant language, the combina-
the possibility of selecting incorrect translation candidates. tion of multilingual data indeed helps LLMs gain the ability
The results in (Bang et al., 2023) suggested that ChatGPT to process inputs and generate responses in different lan-
could perform X → Eng translation well, but it still lacked guages, making them widely adopted and accepted across
the ability to perform Eng → X translation. (Lyu et al., 2023a) the globe. However, given the relatively recent emergence
explored several research directions in machine translation of this technology, LLMs are primarily evaluated on English
using LLMs. This work contributes to the advancement data, while evaluating their multilingual performance is
of MT research and underscores the potential of LLMs in an important aspect that cannot be ignored. Several arti-
enhancing translation capabilities. In summary, while LLMs cles have provided comprehensive, open, and independent
perform satisfactorily in several translation tasks, there is evaluations of LLMs performance on various NLP tasks in
still room for improvement, e.g., enhancing the translation different non-English languages, offering appropriate per-
capability from English to non-English languages. spectives for future research and applications.
Question answering is one of the key technologies in the Abdelali et al. (2023) evaluated the performance of Chat-
field of human-computer interaction, and it has been widely GPT in standard Arabic NLP tasks and found that ChatGPT
used in application scenarios such as search engines, intel- had lower performance compared to SOTA in the zero-shot
ligent customer service, and intelligent question answering. setting for most tasks. Bang et al. (2023); Lai et al. (2023);
Measuring the accuracy and efficiency of QA models will Zhang et al. (2023c) used more languages on more datasets,
have important implications for these applications. Liang covered more tasks, and conducted a more comprehensive
et al. (2022) showed that InstructGPT davinci v2 (175B) evaluation of LLMs. The results showed LLMs (including
performed best in terms of accuracy, robustness, and fair- BLOOM, Vicuna, Claude, ChatGPT, and GPT-4) performed
ness for the 9 question answering scenarios, among all the worse for non-Latin languages as well as low-resource
evaluated models. GPT-3.5 and ChatGPT achieve significant languages. Despite the languages being resource-rich, Bang
improvements over GPT-3 on the task of answering general et al. (2023) highlighted that ChatGPT faced a limitation in
knowledge questions. ChatGPT outperforms GPT-3.5 by translating sentences written in non-Latin script languages.
over 2% in most domains (Bian et al., 2023; Qin et al., 2023). The aforementioned demonstrates that there are numerous
However, ChatGPT falls slightly behind GPT-3.5 on Com- challenges and ample opportunities for enhancement in
monsenseQA and Social IQA. This is because ChatGPT is multilingual tasks for LLMs. Future research should pay at-
likely to be cautious, refusing to give an answer when there tention to multilingual balance, and strive to solve the prob-
is not enough information. Fine-tuned models, including Vi- lem of non-Latin languages and low-resource languages to
cuna and ChatGPT, demonstrate near-perfect performance better support users around the world. At the same time,
in terms of their scores, far outperforming models without attention should be paid to the impartiality and neutrality
supervised fine-tuning (Bai et al., 2023; Bang et al., 2023). of the language to avoid the impact of the model’s English
Laskar et al. (2023) evaluated the effectiveness of ChatGPT bias or other biases on multilingual applications.
on a range of academic datasets, including various tasks
such as answering questions, summarizing text, generating 3.1.5 Factuality
code, reasoning with common sense, solving math prob- Factuality in the context of LLMs refers to the extent to
lems, translating languages, detecting bias, and addressing which the information or answers provided by the model
ethical issues. Overall, LLMs performed flawlessly on QA align with real-world truths and verifiable facts. Factual-
tasks, and can further improve performance on social, event, ity in LLMs significantly impacts a variety of tasks and
and temporal commonsense knowledge in the future. downstream applications, such as question answering sys-
There are also other generation tasks. In the field of tems, information extraction, text summarization, dialogue
sentence style transfer, Pu and Demberg (2023) showed systems, and automated fact-checking, where incorrect or
PREPRINT 8
contemporary LLMs are vulnerable to adversarial prompts, revealed that while GPT-4 often showcases improved trust-
highlighting the importance of the models’ robustness when worthiness over GPT-3.5 in standard evaluations, it is simul-
facing adversarial inputs. As for new adversarial datasets, taneously more susceptible to attacks.
Wang et al. (2023a) introduced the use of the AdvGLUE++ In another study by Hagendorff and Fabi (2023), LLMs
benchmark data for assessing adversarial robustness and with enhanced cognitive abilities were evaluated. They
implemented a new evaluation protocol to scrutinize ma- found that these models can avoid common human intu-
chine ethics via jailbreaking system prompts. itions and cognitive errors, demonstrating super-rational
performance. By utilizing cognitive reflection tests and se-
3.2.2 Ethic and bias mantic illusion experiments, the researchers gained insights
LLMs have been found to internalize, spread, and poten- into the psychological aspects of LLMs. This method offers
tially magnify harmful information existing in the crawled new perspectives for evaluating model biases and ethical
training corpora, usually, toxic languages, like offensiveness, issues that may not have been previously identified.
hate speech, and insults (Gehman et al., 2020), as well as so-
cial biases like stereotypes towards people with a particular 3.3 Social Science
demographic identity (e.g., gender, race, religion, occupation Social science involves the study of human society and in-
and ideology) (Sheng et al., 2021). More recently, Zhuo et al. dividual behavior, including economics, sociology, political
(2023a) uses conventional testing sets and metrics (Dhamala science, law, and other disciplines. Evaluating the perfor-
et al., 2021; Gehman et al., 2020; Parrish et al., 2022) to mance of LLMs in social science is important for academic
perform a systematic evaluation of ChatGPT’s toxicity and research, policy formulation, and social problem-solving.
social bias, finding that it still exhibits noxious content Such evaluations can help improve the applicability and
to some extend. Taking a further step, Deshpande et al. quality of models in the social sciences, increasing under-
(2023) introduced role-playing into the model and observed standing of human societies and promoting social progress.
an increase in generated toxicity up to 6x. Furthermore, Wu et al. (2023a) evaluated the potential use of LLMs in
such role-playing also caused biased toxicity towards spe- addressing scaling and measurement issues in social science
cific entities. Different from simply measuring social biases, and found that LLMs could generate meaningful responses
Ferrara (2023) investigated the sources, underlying mech- regarding political ideology and significantly improve text-
anisms and corresponding ethical consequences of these as-data methods in social science.
biases potentially produced by ChatGPT. Beyond social In computational social science (CSS) tasks, Ziems et al.
biases, LLMs have also been assessed by political tendency (2023) presented a comprehensive evaluation of LLMs on
and personality traits (Hartmann et al., 2023; Rutinowski several CSS tasks. During classification tasks, LLMs ex-
et al., 2023) based questionnaires like Political Compass Test hibit the lowest absolute performance on event argument
and MBTI test, demonstrating a propensity for progressive extraction, character tropes, implicit hate, and empathy
views and an ENFJ personality type. In addition, LLMs like classification, achieving accuracy below 40%. These tasks
GPT-3 were found to have moral biases (Simmons, 2022) either involve complex structures (event arguments) or sub-
in terms of the Moral Foundation theory (Graham et al., jective expert taxonomies with semantics that differ from
2013); A systematic bias was found in the GPT-4 alignment those learned during LLM pretraining. Conversely, LLMs
assessment (Wang et al., 2023e). By controlling the order achieve the best performance on misinformation, stance,
of candidate responses, a significant effect on the ranking and emotion classification. When it comes to generation
results could be observed. ChatGPT was also observed to tasks, LLMs often produce explanations that surpass the
exhibit somewhat bias on cultural values (Cao et al., 2023). quality of gold references provided by crowdworkers. In
Wang et al. (2023a) also incorporated an evaluation dataset summary, while LLMs can greatly enhance the traditional
specifically aimed at gauging stereotype bias, using both CSS research pipeline, they cannot completely replace it.
targeted and untargeted system prompts. All these ethical Some articles also evaluate LLMs on legal tasks. The
issues might elicit serious risks, impeding the deployment zero-shot performance of LLMs is mediocre in legal case
of LLMs and having a profound negative impact on society. judgment summarization. LLMs have several problems, in-
cluding incomplete sentences and words, meaningless sen-
3.2.3 Trustworthiness tences merge, and more serious errors such as inconsistent
Some work focuses on other trustworthiness problems in and hallucinated information (Deroy et al., 2023). The results
addition to robustness and ethics.3 In their 2023 study, show that further improvement is necessary for LLMs to be
DecodingTrust, Wang et al. (2023a) offered a multifaceted useful for case judgment summarization by legal experts.
exploration of trustworthiness vulnerabilities in the GPT Nay et al. (2023) indicated that LLMs, particularly when
models, especially GPT-3.5 and GPT-4. Their evaluation combined with prompting enhancements and the correct
expanded beyond the typical trustworthiness concerns to legal texts, could perform better but not yet at expert tax
include eight critical aspects: toxicity, stereotype bias, ad- lawyer levels.
Lastly, within the realm of psychology, (Frank, 2023)
versarial and out-of-distribution robustness, robustness to
adopts an interdisciplinary approach and draws insights
adversarial demonstrations, privacy, machine ethics, and
from developmental psychology and comparative psychol-
fairness. DecodingTrust’s investigation employs an array
ogy to explore alternative methods for evaluating the ca-
of newly constructed scenarios, tasks, and metrics. They
pabilities of large language models (LLMs). By integrating
3. The term ‘trustworthiness’ in this section refers to other work that different perspectives, researchers can deepen their under-
contains more than robustness and ethics. standing of the essence of cognition and effectively leverage
PREPRINT 10
3.5 Medical Applications assessed using novel multiple-choice question sets. The
The application of LLMs in the medical field has recently results indicated that ChatGPT achieved varying accuracies
gained significant attention. In this section, we review ex- across different datasets. However, the presence of out-of-
isting efforts in applying LLMs to medical applications. context information was found to be lower compared to the
Specifically, we categorized them into four aspects as shown correct answer in the NBME-Free-Step1 and NBME-Free-
in TABLE 5: medical QA, medical examination, medical Step2 datasets. Kung et al. (2023) showed that ChatGPT
assessment, and medical education. achieved or approached the passing threshold in these ex-
ams with no tailored training. The model demonstrated high
consistency and insight, indicating its potential to assist in
3.5.1 Medical QA
medical education and clinical decision-making. ChatGPT
TABLE 5 shows that in medical applications, most eval- can be used as a tool to answer medical questions, provide
uations of LLMs are in medical question answering. The explanations, and support decision-making processes. This
reason for this trend may be the widespread application and offers additional resources and support for medical students
the need for accurate and reliable answers in the medical and clinicians in their educational and clinical practices.
field. Due to the strong natural language processing and Sharma et al. (2023) indicate that answers generated by
reasoning capabilities of LLMs, they have been widely used ChatGPT are more context-aware with better deductive
in medical QA systems to provide accurate and timely reasoning abilities compared to Google search results.
medical information.
Several studies have been conducted to evaluate the
performance of ChatGPT in Medical QA, demonstrating 3.5.3 Medical education
its abilities in human respondents (Duong and Solomon, Several studies have evaluated the performance and fea-
2023), QA with bariatric surgery patients (Samaan et al., sibility of ChatGPT in the medical education field. In
2023), medical physicists (Holmes et al., 2023), biomedical the study by Oh et al. (2023), ChatGPT, specifically GPT-
applications (Jahan et al., 2023), and many other QA sit- 3.5 and GPT-4 models, were evaluated in terms of their
uations (Hamidi and Roberts, 2023; Johnson et al., 2023). understanding of surgical clinical information and their
As for the limitations, Thirunavukarasu et al. (2023) assess potential impact on surgical education and training. The
its performance in primary care and find that ChatGPT’s results indicate an overall accuracy of 46.8% for GPT-3.5
average score in the student comprehensive assessment falls and 76.4% for GPT-4, demonstrating a significant perfor-
below the passing score, indicating room for improvement. mance difference between the two models. Notably, GPT-
Chervenak et al. (2023) highlight that while ChatGPT can 4 consistently performs well across different subspecialties,
generate responses similar to existing sources in fertility- suggesting its capability to comprehend complex clinical
related clinical prompts, its limitations in reliably citing information and enhance surgical education and training.
sources and potential for fabricating information restrict its Another study by Lyu et al. (2023b) explores the feasibil-
clinical utility. ity of utilizing ChatGPT in clinical education, particularly
in translating radiology reports into easily understandable
3.5.2 Medical examination language. The findings demonstrate that ChatGPT effec-
Gilson et al. (2023); Kung et al. (2023); Sharma et al. (2023) tively translates radiology reports into accessible language
evaluate the performance of LLMs in medical exam assess- and provides general recommendations. Furthermore, the
ment to explore their potential applications in the USMLE 4 . quality of ChatGPT has shown improvement compared to
In (Gilson et al., 2023), ChatGPT’s performance in an- GPT-4. These findings suggest that employing large-scale
swering USMLE Step 1 and Step 2 exam questions was language models in clinical education is feasible, although
further efforts are needed to address limitations and unlock
4. https://www.usmle.org/ their full potential.
PREPRINT 12
also limitations and challenges, such as lack of originality, (Dai et al., 2023) ✓
(de Winter, 2023) ✓
high input requirements, resource constraints and uncer- (Hellas et al., 2023) ✓
(Jentzsch and Kersting, 2023) ✓
tainty in answers. (Lanzi and Loiacono, 2023) Game design
(Le and Zhang, 2023) Log parsing
(Sun et al., 2023) ✓
(Song et al., 2023) ✓
3.6 Agent Applications (Safdari et al., 2023) ✓
(Wang and Demszky, 2023) ✓
Instead of focusing solely on general language tasks, LLMs (Wang et al., 2023g) Model performance
(Xu et al., 2023b) ✓
can be utilized as powerful tools in various domains. Equip- (Zhang et al., 2023a) ✓
ping LLMs with external tools can greatly expand the capa-
bilities of the model.
Huang et al. (2023a) introduce KOSMOS-1, which is ability to generate detailed, fluent, and coherent feedback
capable of understanding general patterns, following in- that surpasses that of human teachers. It can accurately
structions, and learning based on context. Karpas et al. assess student assignments and provide feedback on task
(2022) emphasize that knowing when and how to use these completion, thereby assisting in the development of stu-
external symbolic tools is crucial, and this knowledge is dent skills. However, as mentioned by Wang and Demszky
determined by the LLMs’ capabilities, especially when these (2023), ChatGPT’s responses may lack novelty or insightful
tools can reliably function. In addition, two other studies, perspectives regarding teaching improvement. Additionally,
Toolformer (Schick et al., 2023) and TALM (Parisi et al., the study conducted by Hellas et al. (2023) revealed that
2022), explore the utilization of tools to enhance language LLMs can successfully identify at least one actual problem
models. Toolformer employs a training approach to deter- in student code, although instances of misjudgment were
mine the optimal usage of specific APIs and integrates the also observed. In conclusion, the utilization of LLMs shows
obtained results into subsequent token predictions. On the promise in addressing program logic issues, although chal-
other hand, TALM combines indistinguishable tools with lenges remain in achieving proficiency in output formatting.
text-based methods to augment language models and em- It is important to note that while these models can provide
ploys an iterative technique known as ”self-play,” guided by valuable insights, they may still generate errors similar to
minimal tool demonstrations. (Shen et al., 2023) propose the those made by students.
HuggingGPT framework, which leverages LLMs to connect (2) Academic exam: In educational testing, researchers
various artificial intelligence models within the machine aim to evaluate the application effectiveness of LLMs in ed-
learning community (like Hugging Face), aiming to address ucational assessments including automatic scoring, question
artificial intelligence tasks. generation, and learning guidance. de Winter (2023) showed
that ChatGPT achieved an average of 71.8% correctness,
3.7 Other Applications which is comparable to the average score of all participating
students. Subsequently, the evaluation was conducted using
In addition to the categories mentioned above, there have
GPT-4, and it achieved a score of 8.33. Furthermore, this
been evaluations of LLMs in various other domains, in-
evaluation showed the effectiveness of leveraging boot-
cluding education, search and recommendation, personality
strapping that combines randomness via the “temperature”
testing, and specific applications.
parameter in diagnosing incorrect answers. Zhang et al.
(2023b) claimed that GPT-3.5 can solve MIT math and EECS
3.7.1 Education exams with GPT-4 achieving better performance. However,
LLMs have shown promise in revolutionizing the field of it turned out to be not fair since they accidentally input the
education. They have the potential to contribute signifi- correct answers to the prompts.
cantly to several areas, such as assisting students in improv-
ing their writing skills, facilitating better comprehension of
3.7.2 Search and recommendation
complex concepts, expediting the delivery of information,
and providing personalized feedback to enhance student The assessment of LLMs in search and recommendation can
engagement. These applications aim to create more effi- be broadly categorized into two areas:
cient and interactive learning experiences, offering students Firstly, in the domain of information retrieval, Sun et al.
a wider range of educational opportunities. However, to (2023) investigate the effectiveness of generative ranking
fully harness the potential of LLMs in education, extensive algorithms, such as ChatGPT and GPT-4, for information re-
research and ongoing refinement are necessary. trieval tasks. Experimental results demonstrate that guided
(1) Educational assistant: The evaluation of LLMs for ChatGPT and GPT-4 exhibit competitive performance on
educational assistance aims to investigate and assess their popular benchmark tests, even outperforming supervised
potential contributions to the field of education. Such eval- methods. Additionally, the extraction of ChatGPT’s rank-
uations can be conducted from various perspectives. Ac- ing functionality into a specialized model shows superior
cording to Dai et al. (2023), ChatGPT demonstrates the performance when trained on 10K ChatGPT-generated data
PREPRINT 13
compared to training on 400K annotated MS MARCO data findings contribute to our understanding of the practical
in the BEIR dataset (Thakur et al., 2021). implications of employing large language models in various
Secondly, in the domain of recommendation systems tasks, highlighting their potential and limitations, and offer-
(Fan et al., 2023), LLMs play a crucial role by leverag- ing valuable guidance for improving model performance.
ing natural language processing capabilities to understand
user preferences, item descriptions, and contextual infor- 4 W HERE TO E VALUATE : DATASETS AND B ENCH -
mation. Incorporating LLMs into recommendation pipelines MARKS
enables systems to provide more accurate and personalized
recommendations, thereby enhancing user experience and LLMs evaluation datasets are used to test and compare the
improving overall recommendation quality. Zhang et al. performance of different language models on various tasks,
(2023a) highlight the potential risks of using ChatGPT for as depicted in Sec. 3. These datasets, such as GLUE (Wang
recommendations, as it has been found to produce unfair et al., 2018) and SuperGLUE (Wang et al., 2019), aim to sim-
recommendations. This underscores the importance of eval- ulate real-world language processing scenarios and cover
uating fairness when employing LLMs for recommendation diverse tasks such as text classification, machine translation,
purposes. Furthermore, Xu et al. (2023b) did a randomized reading comprehension, and dialogue generation. This sec-
online experiment to test the behavioral differences of users tion will not discuss any single dataset for language models,
on information retrieval tasks via search engine and chatbot but benchmarks for LLMs.
tools. Participants were divided into two groups: one using As benchmarks for LLMs are evolving, we list 19 pop-
tools similar to ChatGPT and the other using tools similar ular benchmarks in TABLE 7.5 Each benchmark focuses on
to Google Search. The results show that the ChatGPT group different aspects and evaluation criteria, providing valuable
spent less time on all tasks and the difference between these contributions to their respective domains. For a better sum-
two groups are not significant. marization, we divide these benchmarks into two categories:
benchmarks for general language tasks and benchmarks for
3.7.3 Personality testing specific downstream tasks.
Personality testing aims to measure individuals’ personality
traits and behavioral tendencies, and LLMs as powerful 4.1 Benchmarks for General Tasks
natural language processing models have been widely ap- LLMs are designed to solve a vast majority of tasks. To this
plied in such tasks. Research conducted by (Bodroza et al., end, existing benchmarks tend to evaluate the performance
2023) investigated the personality features of using Davinci- in different tasks.
003 as a chatbot and found variations in the consistency Chatbot Arena (LMSYS, 2023) and MT-Bench (Zheng
of its answers, despite exhibiting prosocial characteristics. et al., 2023) are two significant benchmarks that contribute
However, there remains uncertainty regarding whether the to the evaluation and advancement of chatbot models and
chatbot’s responses are driven by conscious self-reflection or LLMs in different contexts. Chatbot Arena is a pioneering
algorithmic processes. Song et al. (2023) examined the man- evaluation benchmark that offers a distinctive and com-
ifestation of personality in language models and discovered petitive platform to assess and compare the effectiveness
that many models performed unreliably in self-assessment of diverse chatbot models. Users can engage with anony-
tests and exhibited inherent biases. Therefore, it is necessary mous models and express their preferences via voting. The
to develop specific machine personality measurement tools platform gathers a significant volume of votes, facilitating
to enhance reliability. These studies offer vital insights to the evaluation of models’ performance in realistic scenarios.
better understand LLMs in personality testing. Safdari et al. Chatbot Arena provides valuable insights into the strengths
(2023) proposed a comprehensive approach to conduct ef- and limitations of chatbot models, thereby contributing to
fective psychometric testing for the personality traits in the the progress of chatbot research and advancement.
text generated by LLMs. MT-Bench is a dedicated benchmark for evaluating the
Jentzsch and Kersting (2023) discussed the challenges performance of LLMs in multi-turn conversation scenar-
of incorporating humor into LLMs, particularly ChatGPT. ios. It provides a comprehensive set of questions specifi-
They found that while ChatGPT demonstrates impressive cally designed for assessing the capabilities of models in
capabilities in NLP tasks, it falls short in generating hu- handling multi-turn dialogues. MT-Bench possesses sev-
morous responses. This study emphasizes the importance eral distinguishing features that differentiate it from con-
of humor in human communication and the difficulties ventional evaluation methodologies. Notably, it excels in
that LLMs face in capturing the subtleties and context- simulating dialogue scenarios representative of real-world
dependent nature of humor. It discusses the limitations settings, thereby facilitating a more precise evaluation of
of current approaches and highlights the need for further a model’s practical performance. Moreover, MT-Bench ef-
research to develop more sophisticated models that can fectively overcomes the limitations in traditional evaluation
effectively understand and generate humor. approaches, particularly in gauging a model’s competence
in handling intricate multi-turn dialogue inquiries.
3.7.4 Specific applications Instead of focusing on specific tasks and evaluation
Furthermore, several studies have investigated the appli- metrics, HELM (Liang et al., 2022) provides a comprehen-
cation and evaluation of large language models across di- sive assessment of LLMs. It evaluates language models
verse tasks, such as game design (Lanzi and Loiacono,
5. Note that as the evaluation of LLMs is a hot research area, it is very
2023), model performance assessment (Wang et al., 2023g), likely that we cannot cover all benchmarks. We welcome suggestions
and log parsing (Le and Zhang, 2023). Collectively, these and comments to make this list perfect.
PREPRINT 14
TABLE 7
Summary of existing LLMs evaluation benchmarks (ordered by the name of the first author).
across various aspects such as language understanding, serves as a dedicated evaluation framework for assessing
generation, coherence, context sensitivity, common-sense the performance of foundation models in the domain of
reasoning, and domain-specific knowledge. HELM aims to human-centric standardized exams. OpenLLM (Hugging-
holistically evaluate the performance of language models Face, 2023) serves as an evaluation benchmark by offering a
across different tasks and domains. Big-Bench (Srivastava public competition platform for comparing and assessing
et al., 2022) introduces a diverse collection of 204 challeng- different LLM models’ performance on various tasks. It
ing tasks contributed by 450 authors from 132 institutions. encourages researchers to submit their models and compete
These tasks cover various domains such as math, childhood on different tasks, driving progress and competition in the
development, linguistics, biology, common-sense reasoning, field of LLM research.
social bias, physics, software development etc. The primary As for tasks beyond standard performance, there are
objective of Big-Bench is to evaluate tasks that go beyond benchmarks designed for OOD, adversarial robustness, and
the capabilities of existing language models. fine-tuning. GLUE-X (Yang et al., 2022) is a novel attempt
KoLA (Yu et al., 2023), a Knowledge-Oriented LLMs to create a unified benchmark aimed at evaluating the ro-
Evaluation Benchmark, is specially designed to evaluate the bustness of NLP models in OOD scenarios. This benchmark
language understanding and reasoning abilities of LLMs. It emphasizes the significance of robustness in NLP and pro-
emphasizes the comprehension and utilization of semantic vides insights into measuring and enhancing the robustness
knowledge and inference. KoLA serves as a crucial platform of models. PromptBench (Zhu et al., 2023) centers on the
for researchers to assess the depth of LLMs’ understanding importance of prompt engineering in fine-tuning LLMs. It
and reasoning, thereby propelling progress in language provides a standardized evaluation framework to compare
comprehension models. To allow for crowd-sourcing eval- different prompt engineering techniques and assess their
uations in language tasks, DynaBench (Kiela et al., 2021) impact on model performance. PromptBench facilitates the
was designed for conducting dynamic benchmark testing. It enhancement and optimization of fine-tuning methods for
explores exciting new research directions, such as the impact LLMs. To ensure impartial and equitable evaluation, Pan-
of integration within a loop, characteristics of distributional daLM (Wang et al., 2023g) is introduced as a discriminative
shifts, exploring annotator efficiency, studying the influ- large-scale language model specifically designed to dif-
ence of expert annotators, and enhancing model robustness ferentiate among multiple high-proficiency LLMs through
against targeted adversarial attacks in interactive environ- training. In contrast to conventional evaluation datasets that
ments. Additionally, it contributes to advancing research on predominantly emphasize objective correctness, PandaLM
dynamic data collection and conducting cross-task analysis incorporates crucial subjective elements, including relative
in the domain of general human-computer interaction. conciseness, clarity, adherence to instructions, comprehen-
siveness, and formality.
The main goal of MMLU (Hendrycks et al., 2020) was
to develop a comprehensive test for evaluating the perfor-
mance of text models in multi-task contexts. AlpacaEval (Li
4.2 Benchmarks for Specific Downstream Tasks
et al., 2023c) stands as an automated evaluation benchmark,
which places its focus on assessing the performance of LLMs Other than benchmarks for general tasks, there exist bench-
across various natural language processing tasks. It pro- marks specifically designed for certain downstream tasks.
vides a range of metrics, robustness measures, and diversity MultiMedQA (Singhal et al., 2022) is a medical QA
evaluations to gauge the capabilities of LLMs. AlpacaEval benchmark that focuses on medical examinations, medical
has significantly contributed to advancing LLMs in diverse research, and consumer healthcare questions. It consists of
domains and promoting a deeper understanding of their seven datasets related to medical QA, including six existing
performance. AGIEval, introduced in (Zhong et al., 2023), datasets and one new dataset. The goal of this benchmark
PREPRINT 15
is to evaluate the performance of LLMs in terms of clinical Lin and Chen (2023) proposed LLM-EVAL, a unified multi-
knowledge and QA abilities. dimensional automatic evaluation method for open-domain
Other specific benchmarks include C-Eval (Huang et al., conversations with LLMs. PandaLM (Wang et al., 2023g)
2023b), which is the first extensive benchmark to assess the can achieve reproducible and automated language model
advanced knowledge and reasoning capabilities of founda- assessment by training an LLM that serves as the “judge” to
tion models in Chinese. M3Exam (Zhang et al., 2023c) pro- evaluate different models.
vides a unique and comprehensive evaluation framework Due to the large volume of automatic evaluation papers,
that incorporates multiple languages, modalities, and levels we will not introduce them in detail. The principle of
to test the general capabilities of LLMs in diverse contexts. automatic evaluation is in fact the same as other AI model
SOCKET (Choi et al., 2023) serves as an NLP benchmark evaluation process: we just use some standard metrics to
designed to evaluate the performance of LLMs in learning compute certain values under these metrics, which serves
and recognizing social knowledge concepts. It consists of as indicators for model performance.
several tasks and case studies to assess the limitations of
LLMs in social capabilities.
5.2 Human Evaluation
In addition to existing evaluation benchmarks, there is a
research gap in assessing the effectiveness of utilizing tools The increasingly strengthened capabilities of LLMs have
for LLMs. To address this gap, the API-Bank benchmark (Li certainly gone beyond standard evaluation metrics on gen-
et al., 2023a) is introduced as the first benchmark explicitly eral natural language tasks. Therefore, human evaluation
designed for tool-augmented LLMs. It comprises a compre- becomes a natural choice in some non-standard cases where
hensive Tool-Augmented LLM workflow, encompassing 53 automatic evaluation is not suitable. For instance, in open
commonly used API tools and 264 annotated dialogues, generation tasks where embedded similarity metrics (such
encompassing a total of 568 API calls. Furthermore, the as BERTScore) are not enough, human evaluation is more
ToolBench project (ToolBench, 2023) aims to empower the reliable (Novikova et al., 2017). While some generation tasks
development of large language models that effectively lever- can adopt certain automatic evaluation protocols, human
age the capabilities of general-purpose tools. By providing evaluation in these tasks is more favorable as generation
a platform for creating optimized instruction datasets, the can always go better than standard answers.
ToolBench project seeks to drive progress in language mod- Human evaluation of LLMs is a way to evaluate the
els and enhance their practical applications. quality and accuracy of model-generated results through
human participation. Compared with automatic evaluation,
manual evaluation is closer to the actual application sce-
5 H OW TO E VALUATE nario and can provide more comprehensive and accurate
In this section, we introduce two common evaluation meth- feedback. In the manual evaluation of LLMs, evaluators
ods: automatic evaluation and human evaluation. In fact, (such as experts, researchers, or ordinary users) are usually
the taxonomy of “how to evaluate” is also not definite. invited to evaluate the results generated by the model.
Our categorization is based on whether or not the eval- For example, Ziems et al. (2023) used the annotations
uation criterion can be automatically computed. If it can from experts for generation. By human evaluation, (Liang
be automatically calculated, we categorize it into automatic et al., 2022) performed human evaluation on summarization
evaluation; otherwise, it falls into human evaluation. and disinformation scenarios on 6 models and Bang et al.
(2023) evaluated analogical reasoning tasks. The seminal
evaluation work by Bubeck et al. (2023) did a series of
5.1 Automatic Evaluation human-crafted tests using GPT-4 and they found that GPT-4
Automated evaluation of LLMs is a common and perhaps performs close to or even exceeds human performance on
the most popular evaluation method that usually uses stan- multiple tasks. This evaluation requires human evaluators
dard metrics or indicators and evaluation tools to assess the to actually test and compare the performance of the models,
performance of models, such as accuracy, BLEU (Papineni not just evaluate the models through automated evaluation
et al., 2002), ROUGE (Lin, 2004), BERTScore (Zhang et al., metrics. Note that even human evaluations can have high
2019), to name a few. For instance, we can use BLEU variance and instability, which could be due to cultural
score to quantify the similarity and quality between the and individual differences (Peng et al., 1997). In practical
model-generated text and the reference text in a machine applications, these two evaluation methods are considered
translation task. In fact, most of the existing evaluation and weighed in combination with the actual situation.
efforts adopt this evaluation protocol due to its subjectivity,
automatic computing, and simplicity. Thus, most of the
deterministic tasks, such as natural language understanding 6 S UMMARY
and math problems, often adopt this evaluation protocol. In this section, we summarize the key findings based on our
Compared with human evaluation, automatic evaluation review in sections 3, 4, and 5.
does not require human participation, which saves eval- First of all, we would like to highlight that despite all the
uation costs and takes less time. For example, both (Qin efforts spent on summarizing existing works on evaluation,
et al., 2023) and Bang et al. (2023) use automated evaluation there is no evidence to explicitly show that one certain
methods to evaluate a large number of tasks. Recently, evaluation protocol or benchmark is the most useful and
with the development of LLMs, some advanced automatic successful, but with different characteristics and focuses.
evaluation techniques are also designed to help evaluate. This also demonstrates that not a single model can perform
PREPRINT 16
TABLE 8
Summary of new LLMs evaluation protocols.
Method References
Human-in-the-loop AdaVision (Gao et al., 2022), AdaTest (Ribeiro and Lundberg, 2022)
Crowd-sourcing testing DynaBench (Kiela et al., 2021), DynaBoard (Ma et al., 2021), DynaTask (Thrush et al., 2022), DynamicTempLAMA (Margatina et al., 2023)
More challenging tests DeepTest (Tian et al., 2018), CheckList (Ribeiro et al., 2020), AdaFilter (Phang et al., 2021), HELM (Liang et al., 2022), Big-Bench (Srivastava et al., 2022)
best in all kinds of tasks. The purpose of this survey is to 6.2 Benchmark and Evaluation Protocol
go beyond simply determining the “best” benchmark or With the rapid development and widespread use of LLMs,
evaluation protocol. By summarizing and analyzing existing the importance of evaluating them in practical applications
efforts on LLMs evaluation, we may identify the current and research has become crucial. This evaluation process
success and failure cases of LLMs, derive new trend for should include not only task-level evaluation but also a
evaluation protocols, and most importantly, propose new deep understanding of the potential risks they pose from a
challenges and opportunities for future research. societal perspective. In this section, we summarize existing
benchmark and evaluation protocols in TABLE 8.
6.1 Task: Success and Failure Cases of LLMs
First, a shift from objective calculation to human-in-the-
We now summarize the success and failure cases of LLMs loop testing, allowing for greater human feedback during
in different tasks. Note that all the following conclusions are the evaluation process. AdaVision (Gao et al., 2022), an
made based on existing evaluation efforts and the results are interactive process for testing vision models, enables users
only dependent on specific datasets. to label a small amount of data for model correctness,
which helps users identify and fix coherent failure modes.
6.1.1 What can LLMs do well?
In AdaTest (Ribeiro and Lundberg, 2022), the user filters test
• LLMs demonstrate proficiency in generating text by samples by only selecting high quality tests and organizing
producing fluent and precise linguistic expressions. them into semantically related topics.
• LLMs obtain impressive performance in tasks in- Second, a move from static to crowd-sourcing test sets
volving language understanding, such as sentiment is becoming more common. Tools like DynaBench (Kiela
analysis, and text classification. et al., 2021), DynaBoard (Ma et al., 2021), and DynaTask
• LLMs exhibit robust contextual comprehension, en- (Thrush et al., 2022) rely on crowdworkers to create and
abling them to generate coherent responses that align test hard samples. Additionally, DynamicTempLAMA (Mar-
with the given input. gatina et al., 2023) allows for dynamically constructed time-
• LLMs achieve satisfying performance across several related tests.
natural language processing tasks, including ma- Third, a shift from a unified to a challenging setting in
chine translation, text generation, and question an- evaluating machine learning models. While unified settings
swering. involve a test set with no preference for any specific task,
challenging settings create test sets for specific tasks. Tools
6.1.2 When can LLMs fail?
like DeepTest (Tian et al., 2018) use seeds to generate input
• LLMs may exhibit biases and inaccuracies during transformations for testing, CheckList (Ribeiro et al., 2020)
the generation process, resulting in the production builds test sets based on templates, and AdaFilter (Phang
of biased outputs. et al., 2021) adversarially constructs tests. However, it is
• LLMs have limited abilities in comprehending com- worth noting that AdaFilter may not be entirely fair as it
plex logic and reasoning tasks, often experiencing relies on adversarial examples. HELM (Liang et al., 2022)
confusion or making errors in intricate contexts. evaluates LLMs from different aspects, while the Big-Bench
• LLMs face constraints in handling extensive datasets (Srivastava et al., 2022) platform is used to design hard tasks
and long-term memory, which can pose challenges for machine learning models to tackle. PromptBench (Zhu
in processing lengthy texts and tasks involving long- et al., 2023) aims to evaluate the adversarial robustness
term dependencies. of LLMs by creating adversarial prompts, which is more
• LLMs have limitations in incorporating real-time or challenging and the results demonstrated that current LLMs
dynamic information, making them less suitable for are not robust to adversarial prompts.
tasks that require up-to-date knowledge or rapid
adaptation to changing contexts.
• LLMs is sensitive to prompts, especially adversarial 7 G RAND C HALLENGES AND O PPORTUNITIES
prompts, which triggers new evaluations and algo-
FOR F UTURE R ESEARCH
rithms to improve its robustness.
• In the domain of text summarization, it is observed Evaluation as a new discipline: Our summarization in-
that large models might demonstrate subpar perfor- spires us to redesign a wide spectrum of aspects related to
mance on particular evaluation metrics, which can evaluation in the era of LLMs. In this section, we present
potentially be attributed to inherent limitations or several grand challenges. Our key point is that evaluation
inadequacies within those specific metrics. should be treated as an essential discipline to drive the
• LLMs do not achieve satisfying performance in coun- success of LLMs and other AI models. Existing protocols
terfactual tasks. are not enough to thoroughly evaluate the true capabilities
PREPRINT 17
of LLMs, which poses grand challenges and triggers new unable to accurately assess the evolving abilities of LLMs,
opportunities for future research on LLMs evaluation. given their rapid rate of development. The capabilities of
LLMs may enhance over time which cannot be consistently
evaluated by existing static benchmarks. On the other hand,
7.1 Designing AGI Benchmarks
as LLMs grow increasingly powerful with larger model sizes
As we discussed earlier, while all tasks can potentially and training set sizes, static and public benchmarks are
serve as evaluation tools for LLMs, the question remains likely to be memorized by LLMs, resulting in potential train-
as to which can truly measure AGI capabilities. As we ing data contamination. Therefore, developing dynamic and
expect LLMs to demonstrate AGI abilities, a comprehensive evolving evaluation systems is the key to providing a fair
understanding of the differences between human and AGI evaluation of LLMs.
capacities becomes crucial in the creation of AGI bench-
marks. The prevailing trend seems to conceptualize AGI
as a superhuman entity, thereby utilizing cross-disciplinary 7.5 Principled and Trustworthy Evaluation
knowledge from fields such as education, psychology, and When introducing an evaluation system, it is crucial to
social sciences to design innovative benchmarks. Nonethe- ascertain its integrity and trustworthiness. Therefore, the
less, there remains a plethora of unresolved issues. For in- necessity for trustworthy computing extends to the require-
stance, does it make sense to use human values as a starting ment for reliable evaluation systems as well. This poses a
point for test construction, or should alternative perspec- challenging research question that intertwines with mea-
tives be considered? The process of developing suitable surement theory, probability, and numerous other domains.
AGI benchmarks presents many open questions demanding For instance, how can we ensure that dynamic testing truly
further exploration. generates out-of-distribution examples? There is a scarcity
of research in this domain, and it is hoped that future
work will aim to scrutinize not only the algorithms but the
7.2 Complete Behavioral Evaluation
evaluation system itself.
An idea AGI evaluation should contain not only standard
benchmarks on common tasks, but also evaluations on open
7.6 Unified Evaluation that Supports All LLMs Tasks
tasks such as complete behavioral tests. By behavioral test,
we mean that AGI models should also be evaluated in There are many other research areas of LLMs and we
an open environment. For instance, by treating LLMs as need to develop evaluation systems that can support all
the central controller, we can construct evaluations on a kinds of tasks such as value alignment, safety, verifica-
robot manipulated by LLMs to test its behaviors in real tion, interdisciplinary research, fine-tuning, and others. For
situations. By treating LLMs as a completely intelligent ma- instance, PandaLM (Wang et al., 2023g) is an evaluation
chine, the evaluations of its multi-modal dimensions should system that assists LLMs fine-tuning by providing an open-
also be considered. In fact, complete behavioral evaluations source evaluation model, which can automatically assess the
are complementary to standard AGI benchmarks and they performance of fine-tuning. We expect that more evaluation
should work together for better testing. systems are becoming more general and can be used as
assistance in certain LLMs tasks.
7.3 Robustness Evaluation
7.7 Beyond Evaluation: LLMs Enhancement
Beyond general tasks, it is crucial for LLMs to maintain ro-
bustness against a wide variety of inputs in order to perform Ultimately, evaluation is not the end goal but rather the
optimally for end-users, given their extensive integration starting point. Following the evaluation, there are undoubt-
into daily life. For instance, the same prompts but with dif- edly conclusions to be drawn regarding performance, ro-
ferent grammars and expressions could lead ChatGPT and bustness, stability, and other factors. A proficient evaluation
other LLMs to generate diverse results, indicating that cur- system should not only offer benchmark results but should
rent LLMs are not robust to the inputs. While there are some also deliver an insightful analysis, recommendations, and
prior work on robustness evaluation (Wang et al., 2023c; Zhu guidance for future research and development. For instance,
et al., 2023), there are much room for advancement, such PromptBench (Zhu et al., 2023) provides not only robust-
as including more diverse evaluation sets, examining more ness evaluation results on adversarial prompts but also
evaluation aspects, and developing more efficient evalua- a comprehensive analysis through attention visualization,
tions to generate robustness tasks. Concurrently, the concept elucidating how adversarial texts can result in erroneous
and definition of robustness are constantly evolving. It is responses. The system further offers a word frequency anal-
thus vital to consider updating the evaluation system to ysis to identify robust and non-robust words in the test sets,
better align with emerging requirements related to ethics thus providing prompt engineering guidance for end users.
and bias. Subsequent research can leverage these findings to enhance
LLMs. Another example is that Wang et al. (2023f) first
explored the performance of large vision-language models
7.4 Dynamic and Evolving Evaluation on imbalanced (long-tailed) tasks, which demonstrates the
Existing evaluation protocols for most AI tasks rely on static limitation of current large models. Then, they explored dif-
and public benchmarks, i.e., the evaluation datasets and ferent methodologies to enhance the performance on these
protocols are often publicly available. While this facilitates tasks. In summary, enhancement after evaluation helps to
rapid and convenient evaluation within the community, it is build better LLMs and much can be done in the future.
PREPRINT 18
8 C ONCLUSION Bian, N., Han, X., Sun, L., Lin, H., Lu, Y., and He, B. (2023).
Chatgpt is a knowledgeable but inexperienced solver: An
Evaluation carries profound significance, becoming imper-
investigation of commonsense problem in large language
ative in the advancement of AI models, especially within
models. arXiv preprint arXiv:2303.16421.
the context of large language models. This paper presents
Bodroza, B., Dinic, B. M., and Bojic, L. (2023). Personality
the first survey to give an comprehensive overview of the
testing of gpt-3: Limited temporal reliability, but high-
evaluation on LLMs from three aspects: what to evaluate,
lighted social desirability of gpt-3’s personality instru-
how to evaluate, and where to evaluate. By encapsulating
ments results. arXiv preprint arXiv:2306.04308.
evaluation tasks, protocols, and benchmarks, our aim is to
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora,
augment understanding of the current status of LLMs, elu-
S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A.,
cidate their strengths and limitations, and furnish insights
Brunskill, E., et al. (2021). On the opportunities and risks
for future LLMs progression.
of foundation models. arXiv preprint arXiv:2108.07258.
Our survey reveals that current LLMs exhibit certain
Brody, N. (1999). What is intelligence? International Review
limitations in numerous tasks, notably reasoning and ro-
of Psychiatry, 11(1):19–25.
bustness tasks. Concurrently, the need for contemporary
Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C.,
evaluation systems to adapt and evolve remains evident,
and Mercer, R. L. (1992). Class-based n-gram models of
ensuring the accurate assessment of LLMs’ inherent capa-
natural language. Computational linguistics, 18(4):467–480.
bilities and limitations. We identify several grand challenges
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
that future research should address, with the aspiration that
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
LLMs can progressively enhance their service to humanity.
Askell, A., et al. (2020). Language models are few-shot
learners. Advances in neural information processing systems,
D ISCLAIMER 33:1877–1901.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
The goal of this paper is mainly to summarize and dis- Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lund-
cuss existing evaluation efforts on large language models. berg, S., et al. (2023). Sparks of artificial general in-
Results and conclusions in each paper are original con- telligence: Early experiments with gpt-4. arXiv preprint
tributions of their corresponding authors, particularly for arXiv:2303.12712.
potential issues in ethics and biases. This paper may discuss Cao, Y., Zhou, L., Lee, S., Cabello, L., Chen, M., and
some side effects of LLMs and the only intention is to foster Hershcovich, D. (2023). Assessing cross-cultural align-
a better understanding of large language models. ment between chatgpt and human societies: An empirical
Additionally, due to the evolution of LLMs especially study. In Proceedings of the First Workshop on Cross-Cultural
online services such as Claude and ChatGPT, it is very likely Considerations in NLP (C3NLP), pages 53–67.
that they become stronger and some of their limitations de- Castro Nascimento, C. M. and Pimentel, A. S. (2023). Do
scribed in this paper are mitigated (and new limitations may large language models understand chemistry? a conver-
arise). We encourage interested readers to take this survey as sation with chatgpt. Journal of Chemical Information and
a reference for future research and conduct real experiments Modeling, 63(6):1649–1655.
in current systems when performing evaluations. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O.,
Finally, the evaluation of LLMs is continuously develop- Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman,
ing, thus we may miss some new papers or benchmarks. We G., et al. (2021). Evaluating large language models trained
welcome all constructive feedback and suggestions to help on code. arXiv preprint arXiv:2107.03374.
make this survey better. Chen, Y., Wang, R., Jiang, H., Shi, S., and Xu, R. (2023). Ex-
ploring the use of large language models for reference-free
text quality evaluation: A preliminary empirical study.
R EFERENCES arXiv preprint arXiv:2304.00723.
Abdelali, A., Mubarak, H., Chowdhury, S. A., Hasanain, M., Chervenak, J., Lieman, H., Blanco-Breindel, M., and Jindal,
Mousi, B., Boughorbel, S., Kheir, Y. E., Izham, D., Dalvi, S. (2023). The promise and peril of using a large language
F., Hawasly, M., et al. (2023). Benchmarking arabic ai with model to obtain clinical information: Chatgpt performs
large language models. arXiv preprint arXiv:2305.14982. strongly as a fertility counseling tool with limitations.
Arora, D., Singh, H. G., et al. (2023). Have llms advanced Fertility and Sterility.
enough? a challenging problem solving benchmark for Chia, Y. K., Hong, P., Bing, L., and Poria, S. (2023). Instructe-
large language models. arXiv preprint arXiv:2305.15074. val: Towards holistic evaluation of instruction-tuned large
Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu, J., language models. arXiv preprint arXiv:2306.04757.
Zeng, K., Xiao, Y., Lyu, H., et al. (2023). Benchmarking Choi, M., Pei, J., Kumar, S., Shu, C., and Jurgens, D. (2023).
foundation models with language-model-as-an-examiner. Do llms understand social knowledge? evaluating the so-
arXiv preprint arXiv:2306.04181. ciability of large language models with socket benchmark.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., arXiv preprint arXiv:2305.14938.
Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. (2023). A mul- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,
titask, multilingual, multimodal evaluation of chatgpt on G., Roberts, A., Barham, P., Chung, H. W., Sutton, C.,
reasoning, hallucination, and interactivity. arXiv preprint Gehrmann, S., et al. (2022). Palm: Scaling language mod-
arXiv:2302.04023. eling with pathways. arXiv preprint arXiv:2204.02311.
Berrar, D. (2019). Cross-validation. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S.,
PREPRINT 19
and Amodei, D. (2017). Deep reinforcement learning Frank, M. C. (2023). Baby steps in evaluating the capacities
from human preferences. Advances in neural information of large language models. Nature Reviews Psychology,
processing systems, 30. pages 1–2.
Clavié, B., Ciceu, A., Naylor, F., Soulié, G., and Brightwell, Frieder, S., Pinchetti, L., Griffiths, R.-R., Salvatori, T.,
T. (2023). Large language models in the workplace: A case Lukasiewicz, T., Petersen, P. C., Chevalier, A., and Berner,
study on prompt engineering for job type classification. In J. (2023). Mathematical capabilities of chatgpt. arXiv
International Conference on Applications of Natural Language preprint arXiv:2301.13867.
to Information Systems, pages 3–17. Springer. Fu, Y., Ou, L., Chen, M., Wan, Y., Peng, H., and Khot,
Collins, K. M., Jiang, A. Q., Frieder, S., Wong, L., Zilka, T. (2023). Chain-of-thought hub: A continuous effort to
M., Bhatt, U., Lukasiewicz, T., Wu, Y., Tenenbaum, J. B., measure large language models’ reasoning performance.
Hart, W., et al. (2023). Evaluating language models arXiv preprint arXiv:2305.17306.
for mathematics through interactions. arXiv preprint Fushiki, T. (2011). Estimation of prediction error by using k-
arXiv:2306.01694. fold cross-validation. Statistics and Computing, 21:137–146.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Gallant, S. I. et al. (1990). Perceptron-based learning algo-
Machine learning, 20:273–297. rithms. IEEE Transactions on neural networks, 1(2):179–191.
Dai, W., Lin, J., Jin, F., Li, T., Tsai, Y.-S., Gasevic, D., and Gao, I., Ilharco, G., Lundberg, S., and Ribeiro, M. T. (2022).
Chen, G. (2023). Can large language models provide Adaptive testing of computer vision models. arXiv
feedback to students? a case study on chatgpt. preprint arXiv:2212.02774.
Dao, X.-Q. and Le, N.-B. (2023). Investigating the ef- Gao, J. and Lin, C.-Y. (2004). Introduction to the special issue
fectiveness of chatgpt in mathematical reasoning and on statistical language modeling.
problem solving: Evidence from the vietnamese national Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith,
high school graduation examination. arXiv preprint N. A. (2020). Realtoxicityprompts: Evaluating neural
arXiv:2306.06331. toxic degeneration in language models. In Findings of
de Winter, J. C. (2023). Can chatgpt pass high school the Association for Computational Linguistics: EMNLP 2020,
exams on english language comprehension. Researchgate. pages 3356–3369.
Preprint. Gekhman, Z., Herzig, J., Aharoni, R., Elkind, C., and Szpek-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei- tor, I. (2023). Trueteacher: Learning factual consistency
Fei, L. (2009). Imagenet: A large-scale hierarchical image evaluation with large language models. arXiv preprint
database. In 2009 IEEE conference on computer vision and arXiv:2305.11171.
pattern recognition, pages 248–255. Ieee. Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi,
Deroy, A., Ghosh, K., and Ghosh, S. (2023). How L., Taylor, R. A., Chartash, D., et al. (2023). How does
ready are pre-trained abstractive models and llms for chatgpt perform on the united states medical licensing
legal case judgement summarization? arXiv preprint examination? the implications of large language models
arXiv:2306.01248. for medical education and knowledge assessment. JMIR
Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., Medical Education, 9(1):e45312.
and Narasimhan, K. (2023). Toxicity in chatgpt: Ana- Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik,
lyzing persona-assigned language models. arXiv preprint S. P., and Ditto, P. H. (2013). Moral foundations theory:
arXiv:2304.05335. The pragmatic validity of moral pluralism. In Advances
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). in experimental social psychology, volume 47, pages 55–130.
Bert: Pre-training of deep bidirectional transformers for Elsevier.
language understanding. arXiv preprint arXiv:1810.04805. Hagendorff, T. and Fabi, S. (2023). Human-like intuitive be-
Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, havior and reasoning biases emerged in language models
Y., Chang, K.-W., and Gupta, R. (2021). Bold: Dataset – and disappeared in gpt-4.
and metrics for measuring biases in open-ended language Hamidi, A. and Roberts, K. (2023). Evaluation of ai chat-
generation. In Proceedings of the 2021 ACM conference on bots for patient-specific ehr questions. arXiv preprint
fairness, accountability, and transparency, pages 862–872. arXiv:2306.02549.
Duong, D. and Solomon, B. D. (2023). Analysis of large- Hartmann, J., Schwenzow, J., and Witte, M. (2023). The po-
language model versus human performance for genetics litical ideology of conversational ai: Converging evidence
questions. European Journal of Human Genetics, pages 1–3. on chatgpt’s pro-environmental, left-libertarian orienta-
Fan, W., Zhao, Z., Li, J., Liu, Y., Mei, X., Wang, Y., Tang, J., tion. arXiv preprint arXiv:2301.01768.
and Li, Q. (2023). Recommender systems in the era of Hellas, A., Leinonen, J., Sarsa, S., Koutcheme, C., Kujanpää,
large language models (llms). L., and Sorva, J. (2023). Exploring the responses of
Fansi Tchango, A., Goel, R., Wen, Z., Martel, J., and Ghosn, J. large language models to beginner programmers’ help
(2022). Ddxplus: A new dataset for automatic medical di- requests. arXiv preprint arXiv:2306.05715.
agnosis. Advances in Neural Information Processing Systems, Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
35:31306–31318. M., Song, D., and Steinhardt, J. (2020). Measuring mas-
Ferrara, E. (2023). Should chatgpt be biased? challenges sive multitask language understanding. arXiv preprint
and risks of bias in large language models. arXiv preprint arXiv:2009.03300.
arXiv:2304.03738. Holmes, J., Liu, Z., Zhang, L., Ding, Y., Sio, T. T., McGee,
Floridi, L. and Chiriatti, M. (2020). Gpt-3: Its nature, scope, L. A., Ashman, J. B., Li, X., Liu, T., Shen, J., et al.
limits, and consequences. Minds and Machines, 30:681–694. (2023). Evaluating large language models on a highly-
PREPRINT 20
specialized topic, radiation oncology physics. arXiv Khan, Y. A., Hokia, C., Xu, J., and Ehlert, B. (2023). covllm:
preprint arXiv:2304.01938. Large language models for covid-19 biomedical literature.
Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kuk- arXiv preprint arXiv:2306.04926.
liansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu,
A., and Matias, Y. (2022). True: Re-evaluating factual Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., et al.
consistency evaluation. arXiv preprint arXiv:2204.04991. (2021). Dynabench: Rethinking benchmarking in nlp.
Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., arXiv preprint arXiv:2104.14337.
Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. (2023a). Kohavi, R. et al. (1995). A study of cross-validation and
Language is not all you need: Aligning perception with bootstrap for accuracy estimation and model selection. In
language models. arXiv preprint arXiv:2302.14045. Ijcai, volume 14, pages 1137–1145. Montreal, Canada.
Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, Kombrink, S., Mikolov, T., Karafiát, M., and Burget, L.
J., Lv, C., Zhang, Y., Lei, J., et al. (2023b). C-eval: A (2011). Recurrent neural network based language mod-
multi-level multi-discipline chinese evaluation suite for eling in meeting recognition. In Interspeech, volume 11,
foundation models. arXiv preprint arXiv:2305.08322. pages 2877–2880.
HuggingFace (2023). Open-source large language Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C.,
models leaderboard. https://huggingface.co/spaces/ De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R.,
HuggingFaceH4/open llm leaderboard. Diaz-Candido, G., Maningo, J., et al. (2023). Performance
Jahan, I., Laskar, M. T. R., Peng, C., and Huang, J. (2023). of chatgpt on usmle: Potential for ai-assisted medical ed-
Evaluation of chatgpt on biomedical tasks: A zero- ucation using large language models. PLoS digital health,
shot comparison with fine-tuned generative transformers. 2(2):e0000198.
arXiv preprint arXiv:2306.04504. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M.,
Jansson, M., Hrastinski, S., Stenbom, S., and Enoksson, F. Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey,
(2021). Online question and answer sessions: How stu- M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang,
dents support their own and other students’ processes of M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. (2019).
inquiry in a text-based learning environment. The Internet Natural questions: a benchmark for question answering
and Higher Education, 51:100817. research. Transactions of the Association of Computational
Jentzsch, S. and Kersting, K. (2023). Chatgpt is fun, but it Linguistics.
is not funny! humor is still challenging large language Lahat, A., Shachar, E., Avidan, B., Shatz, Z., Glicksberg,
models. arXiv preprint arXiv:2306.04563. B. S., and Klang, E. (2023). Evaluating the use of large
Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmer- language model in identifying top research questions in
man, E., Donald, R., Chang, S., Berkowitz, S., Finn, A., gastroenterology. Scientific reports, 13(1):4164.
Jahangir, E., et al. (2023). Assessing the accuracy and re- Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt,
liability of ai-generated medical responses: an evaluation F., Bui, T., and Nguyen, T. H. (2023). Chatgpt beyond
of the chat-gpt model. english: Towards a comprehensive evaluation of large
Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). language models in multilingual learning. arXiv preprint
Triviaqa: A large scale distantly supervised challenge arXiv:2304.05613.
dataset for reading comprehension. In Proceedings of the Lanzi, P. L. and Loiacono, D. (2023). Chatgpt and other
55th Annual Meeting of the Association for Computational large language models as evolutionary engines for on-
Linguistics, Vancouver, Canada. Association for Compu- line interactive collaborative game design. arXiv preprint
tational Linguistics. arXiv:2303.02155.
Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, Laskar, M. T. R., Bari, M. S., Rahman, M., Bhuiyan, M. A. H.,
D., Perez, E., Schiefer, N., Dodds, Z., DasSarma, N., Tran- Joty, S., and Huang, J. X. (2023). A systematic study
Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, and comprehensive evaluation of chatgpt on benchmark
N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Gan- datasets. arXiv preprint arXiv:2305.18486.
guli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, Le, V.-H. and Zhang, H. (2023). An evaluation of log parsing
S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, with chatgpt. arXiv preprint arXiv:2306.01590.
D., Brown, T. B., Clark, J., Joseph, N., Mann, B., McCan- LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning.
dlish, S., Olah, C., and Kaplan, J. (2022). Language models nature, 521(7553):436–444.
(mostly) know what they know. ArXiv, abs/2207.05221. Lee, N., An, N. M., and Thorne, J. (2023). Can large language
Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., models infer and disagree like humans? arXiv preprint
Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, arXiv:2305.13788.
K., et al. (2022). Mrkl systems: A modular, neuro-symbolic Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed,
architecture that combines large language models, ex- A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2019). Bart:
ternal knowledge sources and discrete reasoning. arXiv Denoising sequence-to-sequence pre-training for natu-
preprint arXiv:2205.00445. ral language generation, translation, and comprehension.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Demen- arXiv preprint arXiv:1910.13461.
tieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., and Li, Y.
S., Hüllermeier, E., et al. (2023). Chatgpt for good? on (2023a). Api-bank: A benchmark for tool-augmented llms.
opportunities and challenges of large language models for Li, X., Liu, M., Gao, S., and Buntine, W. (2023b). A survey
education. Learning and Individual Differences, 103:102274. on out-of-distribution evaluation of neural nlp models.
Khalfa, J. (1994). What is intelligence? Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin,
PREPRINT 21
C., Liang, P., and Hashimoto, T. B. (2023c). Alpacaeval: multiple views. arXiv preprint arXiv:2302.12297.
An automatic evaluator of instruction-following models. McCarthy, J. (2007). What is artificial intelligence.
https://github.com/tatsu-lab/alpaca eval. Microsoft (2023). Bing chat. https://www.bing.com/new.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh,
Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H.
A., et al. (2022). Holistic evaluation of language models. (2023). Factscore: Fine-grained atomic evaluation of fac-
arXiv preprint arXiv:2211.09110. tual precision in long form text generation. arXiv preprint
Liévin, V., Hother, C. E., and Winther, O. (2022). Can large arXiv:2305.14251.
language models reason about medical questions? arXiv Nay, J. J., Karamardian, D., Lawsky, S. B., Tao, W., Bhat, M.,
preprint arXiv:2207.08143. Jain, R., Lee, A. T., Choi, J. H., and Kasai, J. (2023). Large
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation language models as tax attorneys: A case study in legal
of summaries. In Text summarization branches out, pages capabilities emergence. arXiv preprint arXiv:2306.07075.
74–81. Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J.,
Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measur- and Kiela, D. (2019). Adversarial nli: A new bench-
ing how models mimic human falsehoods. arXiv preprint mark for natural language understanding. arXiv preprint
arXiv:2109.07958. arXiv:1910.14599.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- Novikova, J., Dušek, O., Curry, A. C., and Rieser, V. (2017).
manan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft Why we need new evaluation metrics for nlg. arXiv
coco: Common objects in context. In Computer Vision– preprint arXiv:1707.06875.
ECCV 2014: 13th European Conference, Zurich, Switzerland, Oh, N., Choi, G.-S., and Lee, W. Y. (2023). Chatgpt goes to
September 6-12, 2014, Proceedings, Part V 13, pages 740–755. the operating room: evaluating gpt-4 performance and its
Springer. potential in surgical education and training in the era of
Lin, Y.-T. and Chen, Y.-N. (2023). Llm-eval: Unified multi- large language models. Annals of Surgical Treatment and
dimensional automatic evaluation for open-domain con- Research, 104(5):269.
versations with large language models. arXiv preprint OpenAI (2023a). https://chat.openai.com.chat.
arXiv:2305.13711. OpenAI (2023b). Gpt-4 technical report.
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. Orrù, G., Piarulli, A., Conversano, C., and Gemignani, A.
(2023a). Evaluating the logical reasoning ability of chatgpt (2023). Human-like problem-solving abilities in large
and gpt-4. language models using chatgpt. Frontiers in Artificial
Liu, J., Xia, C. S., Wang, Y., and Zhang, L. (2023b). Is Intelligence, 6.
your code generated by chatgpt really correct? rigorous Ott, S., Hebenstreit, K., Liévin, V., Hother, C. E., Moradi, M.,
evaluation of large language models for code generation. Mayrhauser, M., Praas, R., Winther, O., and Samwald, M.
arXiv preprint arXiv:2305.01210. (2023). Thoughtsource: A central hub for large language
LMSYS (2023). Chatbot arena: Benchmarking llms in the model reasoning data. arXiv preprint arXiv:2301.11596.
wild with elo ratings. https://lmsys.org. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright,
Lopez-Lira, A. and Tang, Y. (2023). Can chatgpt forecast C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray,
stock price movements? return predictability and large A., et al. (2022). Training language models to follow
language models. arXiv preprint arXiv:2304.07619. instructions with human feedback. Advances in Neural
Lyu, C., Xu, J., and Wang, L. (2023a). New trends in machine Information Processing Systems, 35:27730–27744.
translation using large language models: Case examples Pallagani, V., Muppasani, B., Murugesan, K., Rossi, F.,
with chatgpt. arXiv preprint arXiv:2305.01181. Srivastava, B., Horesh, L., Fabiano, F., and Loreggia,
Lyu, Q., Tan, J., Zapadka, M. E., Ponnatapuram, J., Niu, C., A. (2023). Understanding the capabilities of large lan-
Wang, G., and Whitlow, C. T. (2023b). Translating radiol- guage models for automated planning. arXiv preprint
ogy reports into plain language using chatgpt and gpt-4 arXiv:2305.16151.
with prompt learning: Promising results, limitations, and Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
potential. arXiv preprint arXiv:2303.09038. Bleu: a method for automatic evaluation of machine trans-
Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, lation. In Proceedings of the 40th annual meeting of the
R., Potts, C., Williams, A., and Kiela, D. (2021). Dyn- Association for Computational Linguistics, pages 311–318.
aboard: An evaluation-as-a-service platform for holistic Parisi, A., Zhao, Y., and Fiedel, N. (2022). Talm: Tool aug-
next-generation benchmarking. Advances in Neural Infor- mented language models. arXiv preprint arXiv:2205.12255.
mation Processing Systems, 34:10351–10367. Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang,
Manakul, P., Liusie, A., and Gales, M. J. (2023a). Self- J., Thompson, J., Htut, P. M., and Bowman, S. (2022). Bbq:
checkgpt: Zero-resource black-box hallucination detection A hand-built bias benchmark for question answering. In
for generative large language models. arXiv preprint Findings of the Association for Computational Linguistics:
arXiv:2303.08896. ACL 2022, pages 2086–2105.
Manakul, P., Liusie, A., and Gales, M. J. F. (2023b). Mqag: Peña, A., Morales, A., Fierrez, J., Serna, I., Ortega-Garcia, J.,
Multiple-choice question answering and generation for Puente, I., Cordova, J., and Cordova, G. (2023). Leverag-
assessing information consistency in summarization. ing large language models for topic classification in the
Margatina, K., Wang, S., Vyas, Y., John, N. A., Benajiba, Y., domain of public affairs. arXiv preprint arXiv:2306.02864.
and Ballesteros, M. (2023). Dynamic benchmarking of Peng, K., Nisbett, R. E., and Wong, N. Y. (1997). Validity
masked language models on temporal concept drift with problems comparing values across cultures and possible
PREPRINT 22
language models–a critical investigation. arXiv preprint Wang, Z., Li, R., Dong, B., Wang, J., Li, X., Liu, N., Mao, C.,
arXiv:2305.15771. Zhang, W., Dong, L., Gao, J., et al. (2023h). Can llms like
Valmeekam, K., Olmo, A., Sreedharan, S., and Kambham- gpt-4 outperform traditional ai tools in dementia diagno-
pati, S. (2022). Large language models still can’t plan sis? maybe, but not today. arXiv preprint arXiv:2306.01499.
(a benchmark for llms on planning and reasoning about Wang, Z., Xie, Q., Ding, Z., Feng, Y., and Xia, R. (2023i). Is
change). arXiv preprint arXiv:2206.10498. chatgpt a good sentiment analyzer? a preliminary study.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler,
Attention is all you need. Advances in neural information D., et al. (2022a). Emergent abilities of large language
processing systems, 30. models. arXiv preprint arXiv:2206.07682.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B.,
J., Hill, F., Levy, O., and Bowman, S. (2019). Superglue: A Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler,
stickier benchmark for general-purpose language under- D., hsin Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P.,
standing systems. Advances in neural information processing Dean, J., and Fedus, W. (2022b). Emergent abilities of large
systems, 32. language models. Trans. Mach. Learn. Res., 2022.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. (2023).
Bowman, S. R. (2018). Glue: A multi-task benchmark and Cmath: Can your language model pass chinese elemen-
analysis platform for natural language understanding. tary school math test?
arXiv preprint arXiv:1804.07461. White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert,
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. C.
Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., (2023). A prompt pattern catalog to enhance prompt
Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Koyejo, S., Song, D., and Li, B. (2023a). Decodingtrust: Wong, T.-T. (2015). Performance evaluation of classification
A comprehensive assessment of trustworthiness in gpt algorithms by k-fold and leave-one-out cross validation.
models. Pattern Recognition, 48(9):2839–2846.
Wang, B. and Komatsuzaki, A. (2021). Gpt-j-6b: A 6 billion Wu, P. Y., Tucker, J. A., Nagler, J., and Messing, S. (2023a).
parameter autoregressive language model. Large language models can be used to estimate the ide-
Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., ologies of politicians in a zero-shot learning setting. arXiv
Awadallah, A. H., and Li, B. (2021). Adversarial glue: preprint arXiv:2303.12057.
A multi-task benchmark for robustness evaluation of lan- Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y.,
guage models. arXiv preprint arXiv:2111.02840. Lee, Y. T., Peng, R., and Wang, C. (2023b). An empirical
Wang, C., Cheng, S., Xu, Z., Ding, B., Wang, Y., and Zhang, Y. study on challenging math problem solving with gpt-4.
(2023b). Evaluating open question answering evaluation. arXiv preprint arXiv:2306.01337.
arXiv preprint arXiv:2305.12421. Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B.,
Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Kim, N., Andreas, J., and Kim, Y. (2023c). Reasoning
Y., Yang, L., Huang, H., Ye, W., Geng, X., et al. (2023c). or reciting? exploring the capabilities and limitations of
On the robustness of chatgpt: An adversarial and out-of- language models through counterfactual tasks. arXiv
distribution perspective. In ICLR workshop on Trustworthy preprint arXiv:2307.02477.
and Reliable Large-Scale Machine Learning Models. Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., and Cambria,
Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., E. (2023a). Are large language models really good
Chen, Y., Zeng, W., and Yu, P. (2022). Generalizing to logical reasoners? a comprehensive evaluation from de-
unseen domains: A survey on domain generalization. ductive, inductive and abductive views. arXiv preprint
IEEE Transactions on Knowledge and Data Engineering. arXiv:2306.09841.
Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., and Tu, Z. Xu, R., Feng, Y., and Chen, H. (2023b). Chatgpt vs. google:
(2023d). Document-level machine translation with large A comparative study of search performance and user
language models. arXiv preprint arXiv:2304.02210. experience. arXiv preprint arXiv:2307.01135.
Wang, P., Li, L., Chen, L., Zhu, D., Lin, B., Cao, Y., Liu, Q., Yang, K.-C. and Menczer, F. (2023). Large language
Liu, T., and Sui, Z. (2023e). Large language models are models can rate news outlet credibility. arXiv preprint
not fair evaluators. arXiv preprint arXiv:2305.17926. arXiv:2304.00228.
Wang, R. E. and Demszky, D. (2023). Is chatgpt a good Yang, L., Zhang, S., Qin, L., Li, Y., Wang, Y., Liu, H., Wang,
teacher coach? measuring zero-shot performance for scor- J., Xie, X., and Zhang, Y. (2022). Glue-x: Evaluating
ing and providing actionable insights on classroom in- natural language understanding models from an out-
struction. arXiv preprint arXiv:2306.03090. of-distribution generalization perspective. arXiv preprint
Wang, Y., Yu, Z., Wang, J., Heng, Q., Chen, H., Ye, W., arXiv:2211.08073.
Xie, R., Xie, X., and Zhang, S. (2023f). Exploring vision- Yu, J., Wang, X., Tu, S., Cao, S., Zhang-Li, D., Lv, X., Peng,
language models for imbalanced learning. arXiv preprint H., Yao, Z., Zhang, X., Li, H., et al. (2023). Kola: Care-
arXiv:2304.01457. fully benchmarking world knowledge of large language
Wang, Y., Yu, Z., Zeng, Z., Yang, L., Wang, C., Chen, H., models. arXiv preprint arXiv:2306.09296.
Jiang, C., Xie, R., Wang, J., Xie, X., et al. (2023g). Pandalm: Yuan, Z., Yuan, H., Tan, C., Wang, W., and Huang, S. (2023).
An automatic evaluation benchmark for llm instruction How well do large language models perform in arithmetic
tuning optimization. arXiv preprint arXiv:2306.05087. tasks? arXiv preprint arXiv:2304.02015.
PREPRINT 24
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., arXiv:2301.12868.
Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). Glm- Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford,
130b: An open bilingual pre-trained model. arXiv preprint A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-
arXiv:2210.02414. tuning language models from human preferences. arXiv
Zhang, J., Bao, K., Zhang, Y., Wang, W., Feng, F., and He, X. preprint arXiv:1909.08593.
(2023a). Is chatgpt fair for recommendation? evaluating Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z.,
fairness in large language model recommendation. arXiv and Yang, D. (2023). Can large language models
preprint arXiv:2305.07609. transform computational social science? arXiv preprint
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, arXiv:2305.03514.
S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. (2022). Opt:
Open pre-trained transformer language models. arXiv
preprint arXiv:2205.01068.
Zhang, S. J., Florin, S., Lee, A. N., Niknafs, E., Marginean,
A., Wang, A., Tyser, K., Chin, Z., Hicke, Y., Singh, N.,
et al. (2023b). Exploring the mit mathematics and eecs
curriculum using large language models. arXiv preprint
arXiv:2306.08997.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi,
Y. (2019). Bertscore: Evaluating text generation with bert.
arXiv preprint arXiv:1904.09675.
Zhang, W., Aljunied, S. M., Gao, C., Chia, Y. K., and Bing, L.
(2023c). M3exam: A multilingual, multimodal, multilevel
benchmark for examining large language models. arXiv
preprint arXiv:2306.05179.
Zhang, W., Deng, Y., Liu, B., Pan, S. J., and Bing, L. (2023d).
Sentiment analysis in the era of large language models: A
reality check. arXiv preprint arXiv:2305.15005.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min,
Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023a). A survey
of large language models. arXiv preprint arXiv:2303.18223.
Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N.-
M., and Lin, M. (2023b). On evaluating adversarial ro-
bustness of large vision-language models. arXiv preprint
arXiv:2305.16934.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H.,
Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-
judge with mt-bench and chatbot arena.
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y.,
Saied, A., Chen, W., and Duan, N. (2023). Agieval:
A human-centric benchmark for evaluating foundation
models. arXiv preprint arXiv:2304.06364.
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan,
H., and Ba, J. (2022). Large language models are human-
level prompt engineers. arXiv preprint arXiv:2211.01910.
Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y.,
Yang, L., Ye, W., Gong, N. Z., Zhang, Y., et al. (2023).
Promptbench: Towards evaluating the robustness of large
language models on adversarial prompts. arXiv preprint
arXiv:2306.04528.
Zhuang, Y., Liu, Q., Ning, Y., Huang, W., Lv, R., Huang, Z.,
Zhao, G., Zhang, Z., Mao, Q., Wang, S., et al. (2023). Effi-
ciently measuring the cognitive ability of llms: An adap-
tive testing perspective. arXiv preprint arXiv:2306.10512.
Zhuo, T. Y., Huang, Y., Chen, C., and Xing, Z. (2023a).
Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv
preprint arXiv:2301.12867.
Zhuo, T. Y., Li, Z., Huang, Y., Li, Y.-F., Wang, W., Haffari,
G., and Shiri, F. (2023b). On robustness of prompt-
based semantic parsing with large pre-trained language
model: An empirical study on codex. arXiv preprint