Optimizing Large Language Models a Deep Dive Into
Optimizing Large Language Models a Deep Dive Into
These models are pre-trained on vast datasets, equipping them with the ability to recognize
and comprehend complex language patterns. From question answering in conversational
agents, text summarization, and language translation, to content generation and speech
recognition, LLMs are integral to the progress of artificial intelligence, particularly in con-
sumer electronics and technologies (CE/CT) [1]. In consumer applications, such as smart
home assistants, voice-controlled devices, and interactive entertainment systems, LLMs
contribute to more intuitive user experiences by enhancing natural language understanding
and response generation.
The evolution of LLMs from simple statistical language models [2] to sophisticated
transformer-based architectures like GPT-4 [3,4], PaLM [5], and LLaMA [6] marks a signifi-
cant leap in their capabilities, influencing diverse CE/CT applications. Prominent mod-
els include OpenAI’s GPT series [3,4], Meta’s LLaMA family [7–9], Google’s Gemini [2]
and Gemma [10], as well as Mistral’s recent advancements [11,12], all of which are continu-
ously evolving to optimize their performance in consumer-facing technologies.
The capabilities of LLMs are grounded in Pre-trained Language Models (PLMs) and
subsequent fine-tuning techniques. As PLMs have grown in scale, developing into highly
capable LLMs, novel reasoning capabilities have emerged, further accelerating their utility
in various applications. These abilities include In-Context Learning (ICL) [13], Instruction
Following (IF) [14], and Step-by-Step Reasoning (SSR) [15], which are increasingly critical
for enhancing consumer electronics interactions, such as improving response accuracy in
smart assistants or optimizing content recommendations in entertainment systems.
Despite the advancements in PLMs, their accuracy often falls short, particularly when
applied to specific datasets or tasks, and they are prone to generating incorrect answers that
include non-existent information, a phenomenon known as hallucination. To address these
issues, various Prompt Engineering (PE) techniques have recently been developed. Prompt
Engineering plays a crucial role in determining the quality of the final output generated by
an LLM by restructuring the prompt’s structure and content to optimize the model’s output.
Notably, Prompt Engineering allows for the extraction of desired responses during the
inference stage without the need for costly and time-consuming pre-training or fine-tuning
stages, making it a highly advantageous technique for commercial applications [16].
One prominent approach within Prompt Engineering is the Retrieval-Augmented
Generation (RAG) technique. RAG enhances the current data by combining a retrieval step,
where external knowledge is leveraged to supplement the information, with a generation
step that produces the final output. This integration improves the accuracy of information
retrieval and enables the model to provide more accurate responses to queries that involve
up-to-date information or rare facts.
In contrast, the Chain of Thought (CoT) technique is designed to guide language
models to explicitly articulate intermediate steps or logical reasoning processes when
solving complex problems. For instance, by structuring prompts to include the step-by-step
solution process for a math problem, CoT enables the evaluation of both the correctness of
the final response and the appropriateness of the reasoning that led to it.
The In-Context Learning (ICL) technique, as its name suggests, involves learning
meaning within the context provided by the prompt. It enables the model to generate
relevant responses by understanding the context in the prompt. This is typically achieved by
providing examples, which guide the model in generating appropriate answers. Depending
on the number of examples provided, the technique is categorized into zero-shot learning
(no examples), one-shot learning (one example), and few-shot learning (multiple examples).
Additionally, the Step-by-Step Reasoning (SSR) technique guides the model to arrive
at an answer through a series of intermediate steps when solving complex problems. This
Appl. Sci. 2025, 15, 1430 3 of 32
approach can be particularly useful in tasks that require multi-step thinking, such as solving
math problems, logical reasoning, or complex decision-making processes [15].
Lastly, the Tree of Thought (ToT) technique involves branching out multiple possible
thought processes in a tree structure to explore different solutions to a complex problem.
This method is especially effective in situations that require multi-stage reasoning or where
multiple solutions are possible [17].
This paper investigates the optimal Prompt Engineering strategies that maximize
user satisfaction, i.e., response accuracy, while minimizing resource requirements for real-
world LLM service deployment. The study focuses on applying Prompt Engineering
techniques without fine-tuning foundation models, enabling efficient service implemen-
tation. To achieve this, we examine the application strategies of widely used Prompt
Engineering methods, including RAG, CoT, ICL, SSR, and ToT. These techniques were
applied to leading LLM technologies, namely Llama3, Mistral, and Gemma2, to evaluate
their performance. For performance evaluation, we utilized diverse metrics, including
BLEU, ROUGE, METEOR, BLEURT, and BERTScore, to closely approximate user satisfac-
tion. Benchmark datasets such as the AI2 Reasoning Challenge (ARC), HellaSwag, Massive
Multitask Language Understanding (MMLU), TruthfulQA, Winogrande, and Grade School
Math (GSM8K) were employed to analyze the natural language generation performance
across a variety of topics. Based on these evaluations, we propose a PE (Prompt Engineer-
ing) strategy for optimizing performance across different datasets and LLMs.
2. Related Works
Prompt Engineering is a crucial technique for determining the quality of responses
generated by LLMs. It optimizes model outputs by adjusting the structure and content of
prompts during the inference stage, without requiring additional pre-training or fine-tuning
of the model. Initially, Prompt Engineering primarily involved simple question-and-answer
formats, but recent advancements have led to the development of more complex prompt
structures designed to solve intricate problems [18]. Early research in this field, spear-
headed by OpenAI with the development of the GPT-3 model, explored various Prompt
Engineering methods. Among these, techniques for designing ICL-based few-shot learn-
ing prompts emerged as a way to obtain desired answers from LLMs without additional
training [13]. The success of ICL has since led to the exploration of various derivatives,
such as PICL [19], MetaICL [20], KATE [21], SG-ICL [22], and GlobalE&LocalE [23]. While
early ICL research focused on optimizing ICL performance through prompt reconfiguration
alone, the PICL [19] approach proposed collecting related contextual data and reconstruct-
ing corpora to pre-train the model, thereby enabling it to learn how to infer through prompt
demonstrations in advance. In contrast, the MetaICL [20] approach introduced an addi-
tional phase between pre-training and ICL inference called continual training, also referred
to as model warming. This optional procedure for ICL involves adjusting the LLM by
modifying or adding to its parameters before inference. Many studies have also shown
that ICL performance heavily depends on the configuration of demonstrations, including
the selection, format, and order of the examples. KATE [21] investigated the selection of
demonstration examples, SG-ICL [22] focused on the format, and GlobalE&LocalE [23]
explored the order of demonstrations.
Research has also been conducted on techniques that encourage models to express
intermediate steps or logical reasoning alongside responses to solve complex problems. No-
table methods in this area include CoT, Self-Consistency, SSR, and ToT. The Self-Consistency
technique optimizes the model to produce consistent and reliable outputs across varied
inputs that convey the same meaning. The ToT approach involves branching out multi-
ple reasoning steps required to solve a problem into various thought trees. Each branch
Appl. Sci. 2025, 15, 1430 4 of 32
Lastly, there have been studies addressing issues related to model alignment, demon-
stration formatting, and the theoretical underpinnings of prompting using the aforemen-
tioned Prompt Engineering techniques [30–32]. The study by [30] proposed efficient
prompt construction methods such as Instruction+Question, Instruction+Input, and Ques-
tion+Answer. It also identified the limitations of LLMs and suggested strategies to over-
come these limitations using CoT-based Prompt Engineering. Similarly, ref. [31] explored
effective prompt construction techniques and discussed optimization strategies for LLMs
through approaches such as CoT, Self-Consistency, General Knowledge, Least-to-Most
Appl. Sci. 2025, 15, 1430 5 of 32
Prompting, ToT, GoT, Decomposed Prompting, and Active Prompting. The study by [32]
provided a SWOT analysis of LLM technologies and presented the mathematical theoretical
underpinnings of various Prompt Engineering techniques. However, none of these studies
have focused on deriving the optimal Prompt Engineering techniques based on perfor-
mance analysis using diverse metrics tailored to the characteristics of LLMs and datasets.
The main contributions of this paper are as follows:
• A technical analysis of various LLMs, Prompt Engineering techniques for performance
enhancement, related datasets, and evaluation metrics.
• A comparative performance analysis of key Prompt Engineering techniques across
different LLMs and datasets, using various evaluation metrics.
• Derivation of performance optimization strategies based on the comparative analysis,
tailored to specific application domains.
In Section 3, we describe the LLMs utilized in this study. Section 4 provides an
overview of the benchmark datasets employed, while Section 3.2 explains the performance
metrics used for evaluation. In Section 5, we present the experimental results based on
these parameters and propose a PE Strategy for optimizing LLM performance. Finally,
Section 6 offers the conclusion.
3.1.1. Gemma2
Gemma2, released by Google on 27 June 2024, is the latest open large language model,
available in two sizes, i.e., 9 billion and 27 billion parameters, based on the Hugging Face
platform. Each model includes pre-training, fine-tuning, and alignment phases. As a
lightweight version of the larger Gemini model, Gemma2 offers significant enhancements.
The key features of the Gemma2 model are summarized in Table 2.
Table 2. Summary of the model architecture, pre-training, and post-training techniques of major LLMs.
layer, and logits in each attention layer and the final layer are constrained to remain within
a soft-cap range.
The training methodology is divided into the following two stages: Pre-Training
and Post-Training. In the Pre-Training stage, data filtering is first applied to address
privacy and safety issues, and the filtered data is then converted into discrete tokens from
a large multilingual text corpus. Instead of next-token prediction, the model is trained
using Knowledge Distillation based on the probability of each token as provided by the
teacher model.
During the Post-Training stage, the pre-trained models are fine-tuned into Super-
vised Fine-Tuned models. Subsequently, Reinforcement Learning with Human Feedback
(RLHF) is applied to these models, where the reward model is trained on labeled English
preference data and the policy is based on the same prompts used in the supervised fine-
tuning (SFT) phase. Finally, the models obtained after each phase are averaged to enhance
overall performance.
3.1.2. Llama3
Llama3 is the latest large language model developed by Meta, released on 18 April 2024.
Building on the foundation of its predecessor, Llama2, this model is available in two sizes.
i.e., 8 billion and 70 billion parameters [9]. The key features of the Llama3 model are
summarized in Table 2.
The model architecture is based on a Transformer with a Decoder-Only structure,
utilizing GQA with 8 key-valued heads. During the Pre-Training stage, a large multilingual
text corpus is first converted into discrete tokens, and then a LLM is pre-trained on the
resulting data to perform next-token prediction.
While the pre-trained language model has a rich understanding of language, it does
not yet fully follow instructions or generate clear responses. In the Post-Training stage,
the model is aligned with human feedback over multiple rounds, each involving SFT on
instruction-tuning data and Direct Preference Optimization (DPO). Additionally, special-
ized training is introduced for specific domains such as coding and reasoning.
3.1.3. Mistral
Mistral is a high-performance language model developed by the French AI startup
Mistral AI. The Mistral model is a language model with 7 billion parameters, leveraging
cutting-edge technologies such as GQA and Sliding Window Attention (SWA) to enhance
efficiency [11,12]. The key technical features are summarized in Table 2.
The model architecture is based on a Transformer with a Decoder-Only structure,
incorporating GQA with 8 key-valued heads and 32 heads, as well as a SWA mechanism
with a size of 4096 to accelerate inference speed. Additionally, to optimize cache usage,
techniques such as Rolling Buffer Cache, Pre-fill, and Chunking are employed.
During the Pre-Training stage, a large multilingual text corpus is first converted
into discrete tokens, and then the resulting data is used to pre-train a LLM for next-
token prediction.
In the Post-Training stage, Instruction Fine-tuning is conducted through SFT using
Hugging Face’s Instruction dataset. Finally, the model performs fine-grained content
moderation by leveraging system prompts to optionally enforce constraints on its output.
3.2.1. BLEU
The BLEU score is a metric developed to evaluate the quality of machine translation
by measuring the n-gram overlap between the translated sentence and reference sentences.
The BLEU score is calculated using the following formula:
!
N
BLEU = BP · exp ∑ wn log pn , (1)
n =1
where pn represents the n-gram precision, wn denotes the weight, and BP refers to the
brevity penalty. BLEU is primarily used to assess the fluency and accuracy of translated
text, where a higher score indicates a closer match between the translated text and the
reference text [35].
3.2.2. ROUGE
The ROUGE score is primarily used to measure the similarity between generated text
and reference text in summarization tasks. There are several variants of ROUGE, including
ROUGE-N, ROUGE-L, and ROUGE-W. ROUGE-N is based on n-gram overlap, while
ROUGE-L relies on the Longest Common Subsequence (LCS). ROUGE-W is a weighted
version of ROUGE-L, where greater weights are assigned to longer common subsequences.
ROUGE is useful for evaluating the coverage and consistency of generated summaries and
is frequently employed in multi-document summarization evaluations [36]. In this study,
for the short-answer datasets, we based our performance measurement on ROUGE-N with
N = 1 (unigrams). The formula for ROUGE-N is as follows:
where gramn is n-grams, RS is the Reference Summaries, Countm (gramn ) is the num-
ber of n-grams in the generated text that match the n-grams in the reference text,
and Count(gramn ) is the total number of n-grams in the reference text.
3.2.3. METEOR
The METEOR metric, unlike BLEU, accounts for factors beyond simple n-gram match-
ing, including synonymy, stemming, and word order alignment. METEOR employs a
harmonic mean of unigram precision and recall, placing a higher weight on recall in its
evaluation. By incorporating considerations such as synonyms and morphological analysis,
METEOR enhances the flexibility of word matching and is known to correlate more closely
with human judgments compared to BLEU. This allows for a more nuanced assessment of
translation quality [37].
where P is the Precision, R is the Recall, and Ppenalty is the fragmentation penalty for
penalizing word order errors. These metrics are calculated using following equations:
3.2.4. BLEURT
Unlike traditional metrics such as BLEU, which rely on n-gram overlap, BLEURT is a
metric designed to account for semantic similarity, contextual understanding, and fluency.
This results in an evaluation metric that more closely aligns with human assessments [38].
Specifically, BLEURT leverages BERT (Bidirectional Encoder Representations from Trans-
formers) embeddings and is fine-tuned for specific text evaluation tasks. The model is
trained to predict a score that reflects how well the generated text aligns with the reference
text, based on the following equation:
where c is the generated text, r is the reference text, and ctxt is the context which may
include additional information or surrounding text that can influence the score. BERT( x )
represents the BERT embedding of the text x and f (·) is a function that combines these
embeddings and outputs a score reflecting the similarity between c and r.
1
|X| ∑
P= max cosine( X, Y ),
X Y
1
|Y | ∑
R= max cosine( X, Y ),
Y X
P·R
BERTScore = 2 × , (6)
P+R
where P is the precision, R is the recall, X is the embedding vector of the generated text x,
Y is the embedding vector of the reference text y, and cosine( X, Y ) represents the cosine
similarity between the embeddings of the tokens from the generated and reference texts.
4. Dataset
In this study, we selected benchmark datasets from three categories to comprehensively
evaluate various language capabilities. For the general natural language understanding
category, we chose the MMLU and TruthfulQA datasets. For the reasoning ability category,
we selected the ARC, HellaSwag and Winogrande datasets. For the mathematics problem-
solving category, we used the GSM8K dataset. These categories were carefully chosen to
ensure a balanced assessment across general natural language understanding, reasoning
ability, and mathematics problem-solving domains.
Appl. Sci. 2025, 15, 1430 9 of 32
4.1. ARC
The ARC dataset consists of 11,119 training, 2992 validation, and 4541 test examples,
and is composed of multiple-choice questions with four options each, covering science
topics typically taught at the elementary and middle school levels in the United States, such
as biology, physics, chemistry, and earth science. The dataset is divided into two categories
as follows: ARC-Easy, which includes questions that require basic information retrieval,
and ARC-Challenge, which contains more difficult questions that require complex reason-
ing and inference. This dataset was designed to evaluate the extent to which AI models can
perform complex reasoning and understanding beyond simple information retrieval [40].
4.2. HellaSwag
The HellaSwag dataset consists of 39,905 training, 5000 validation, and 10,042 test
examples. It is designed to evaluate AI models’ natural language understanding by testing
their ability in commonsense reasoning and contextual understanding through sentence
completion and multiple-choice questions. This dataset includes contexts from various
domains, such as Wikipedia and instructional videos, and requires models to predict the
most appropriate sentence continuation among the provided choices for each context [41].
4.3. MMLU
The MMLU dataset consists of 57,243 training, 10,000 validation, and 15,000 test exam-
ples. Developed as part of Facebook AI’s efforts to advance human language understanding
and generation capabilities, the dataset includes multiple-choice questions across a wide
range of subjects, including science, humanities, and professional fields. It is designed to
assess how broadly and deeply AI models can understand and apply knowledge across
various domains [42].
4.4. TruthfulQA
The TruthfulQA dataset consists of 8170 training, 1024 validation, and 2035 test ex-
amples. It is designed to evaluate the ability of language models to generate factual and
truthful responses. The dataset includes questions that often use misleading or ambiguous
phrasing, or that require the model to acknowledge its lack of knowledge rather than fabri-
cating an answer. This dataset is used to assess the limitations of a model’s truthfulness
and its ability to handle subtlety in information retrieval and response generation [43].
4.5. Winogrande
The Winogrande dataset consists of 40,398 training, 1267 validation, and 3545 test
examples and is designed to train and test the commonsense reasoning abilities within
AI systems. Inspired by the Winograd Schema Challenge, it was developed with a larger
scale and a more diverse set of items to more thoroughly evaluate model performance.
The dataset comprises sentences with blanks that must be filled by selecting the correct word
from two given options, testing the model’s ability to perform contextual and commonsense
reasoning [44].
4.6. GSM8K
The GSM8k dataset consists of 7432 training, 800 validation, and 1319 test examples,
and is designed to test the mathematical reasoning abilities of AI systems. This dataset
is a collection of elementary-level math problems covering fundamental concepts such as
arithmetic, algebra, geometry, and word problems. It is used to evaluate AI models not
only on their ability to compute answers but also on their performance in understanding
the text context of problems, which often involves natural language comprehension and
multi-step reasoning [45].
Appl. Sci. 2025, 15, 1430 10 of 32
5. Simulation Result
Table 3 presents the experimental setup used in this study. The experiments were
conducted using Llama3-8B, Gemma2-9B, and Mistral-7B as the LLMs, with a focus on
analyzing the performance of ICL(1 Shot), CoT, RAG(1 Shot), ToT, SSR, and their combined
configurations. Sections 5.1–5.6 detail the performance evaluation and comparative analysis
across six datasets, based on the experimental environment outlined in Table 3. The
performance results were generated based on the prompt formats created according to the
respective Prompt Engineering techniques, with parameters set as Temperature = 0.1 and
Top-P = 0.9. Additionally, the applied Prompt Engineering techniques are denoted by the
following abbreviations and may be used interchangeably:
Model
Llama3 8B Gemma2 9B Mistral 7B
Prompt Engineering
B: Base I: ICL C: CoT R: Rag T: ToT S: SSR
Metric
BLUE ROUGE METEOR BLUERT BERT -
Dataset
ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
5.1. ARC
Tables 4 and 5 present the performance of Llama3, Gemma2, Mistral, and various
Prompt Engineering combinations on the ARC dataset, measured using the BLEU, ROUGE,
METEOR, BLEURT, and BERTScore metrics. The bolded values in the tables indicate
the highest performance achieved for each evaluation metric within each LLM model.
Furthermore, the bold formatting convention is consistently applied to the performance
result tables of other datasets as well. As shown in these tables, the ARC dataset, which
consists of multiple-choice questions and answers on scientific topics, achieves the best
performance with combinations of SSR techniques and CoT-based methods such as R(B),
R(C), C, and T.
Moreover, it is evident that the optimal combination of Prompt Engineering techniques
varies depending on the LLM and the evaluation metric. For Llama3, using CoT alone
generally yields the best results. In contrast, for Gemma2, the combination of SSR, CoT,
and ICL demonstrates superior performance, while for Mistral, the combination of SSR,
ToT, CoT, and ICL achieves the best outcomes.
Fundamentally, as observed in Tables 4 and 5, Gemma2 consistently outperforms
other models across all five evaluation metrics when no Prompt Engineering techniques
Appl. Sci. 2025, 15, 1430 11 of 32
are applied. Even when considering all possible Prompt Engineering techniques, Gemma2
remains the top-performing LLM, with a significant performance gap compared to the
second-best model, Llama3. Additionally, Llama3 consistently outperforms Mistral across
all five evaluation metrics.
Table 4. Performances of BLUE, ROUGE, and METEOR according to LLMs and Prompt Engineering
techniques on the ARC dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 2.2864 4.4464 1.4115 B 0.1955 0.2439 0.1426 B 0.1514 0.2007 0.1375
I 2.4251 5.5576 1.5623 I 0.1750 0.2575 0.0990 I 0.1237 0.2156 0.0971
C 0.0000 0.0000 1.8549 C 0.6553 0.8340 0.4298 C 0.3277 0.4170 0.2163
C, I 0.0000 0.0000 0.0000 C, I 0.3787 0.8298 0.2681 C, I 0.1872 0.4149 0.1340
R(B) 2.2759 4.8229 1.4879 R(B) 0.2073 0.2561 0.1009 R(B) 0.1501 0.2094 0.0929
R(C) 0.0000 0.0000 1.5619 R(C) 0.5830 0.8383 0.3438 R(C) 0.2915 0.4191 0.1700
T, B 2.0589 3.1321 1.2867 T, B 0.2002 0.2345 0.1499 T, B 0.1499 0.1843 0.1552
T, I 2.1093 4.3401 1.0184 T, I 0.1978 0.2487 0.1432 T, I 0.1433 0.1997 0.1368
T, C 0.0000 0.0000 3.4968 T, C 0.6511 0.8383 0.4199 T, C 0.3255 0.4191 0.2104
T, C, I 0.0000 0.0000 2.3658 T, C, I 0.5362 0.8298 0.2298 T, C, I 0.2681 0.4149 0.1149
T, R(B) 2.0709 4.7891 1.5037 T, R(B) 0.2023 0.2434 0.1407 T, R(B) 0.1462 0.2000 0.1322
T, R(C) 0.0000 0.0000 3.9588 T, R(C) 0.5532 0.8340 0.3574 T, R(C) 0.2766 0.4170 0.1785
S, B 2.1450 3.0319 0.6572 S, B 0.1927 0.2346 0.1290 S, B 0.1499 0.1843 0.1313
S, I 1.2337 4.5014 0.4917 S, I 0.1399 0.2444 0.0661 S, I 0.0866 0.1962 0.0716
S, C 0.0000 21.4646 1.2169 S, C 0.6170 0.8255 0.3887 S, C 0.3085 0.4128 0.1954
S, C, I 0.0000 0.0000 1.6279 S, C, I 0.5702 0.8340 0.2979 S, C, I 0.2851 0.4170 0.1489
S, R(B) 1.2236 4.2853 1.4485 S, R(B) 0.1714 0.2500 0.1352 S, R(B) 0.1197 0.2007 0.1247
S, R(C) 19.5521 21.3812 0.8924 S, R(C) 0.5660 0.8128 0.3234 S, R(C) 0.2830 0.4064 0.1591
S, T, B 2.3439 2.7917 0.9111 S, T, B 0.1887 0.2355 0.1349 S, T, B 0.1433 0.1816 0.1352
S, T, I 0.7283 3.6081 0.1601 S, T, I 0.1157 0.2400 0.0818 S, T, I 0.0869 0.1841 0.0905
S, T, C 0.0000 0.0000 1.7047 S, T, C 0.6426 0.8340 0.2468 S, T, C 0.3213 0.4170 0.1234
S, T, C, I 0.0000 0.0000 16.4416 S, T, C, I 0.5489 0.8170 0.4340 S, T, C, I 0.2745 0.4085 0.2170
S, T, R(B) 1.0221 4.4266 0.6973 S, T, R(B) 0.1764 0.2460 0.1473 S, T, R(B) 0.1285 0.1976 0.1365
S, T, R(C) 0.0000 0.0000 1.5277 S, T, R(C) 0.5447 0.8255 0.3263 S, T, R(C) 0.2723 0.4128 0.1612
Table 5. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the ARC dataset.
Table 5. Cont.
5.2. GSM8K
Tables 6 and 7 present the performance of Llama3, Gemma2, Mistral, and various
Prompt Engineering combinations on the GSM8K dataset, measured using the BLEU,
ROUGE, METEOR, BLEURT, and BERTScore metrics.
As shown in these tables, the GSM8K dataset, which involves elementary-level math
problems, described in natural language and requiring multi-step reasoning, achieves the
best performance with combinations of Step-by-Step Reasoning techniques, CoT-based
methods such as R(B) and R(C), CoT techniques, ToT, and ICL.
The optimal combinations of these Prompt Engineering techniques vary depending on
the LLM and the evaluation metric. For Llama3, although the optimal Prompt Engineering
technique varies across different evaluation metrics, combinations of SSR and CoT-based
RAG generally yield the best performance. For Gemma2, combinations of CoT and ICL
tend to provide the best results in most cases, while for Mistral, combinations of ToT, CoT,
and ICL consistently deliver the best performance.
Fundamentally, as seen in Tables 6 and 7, when no Prompt Engineering techniques are
applied, Llama3 outperforms others in three out of five evaluation metrics, while Gemma2
performs best in two metrics. Even when considering all possible Prompt Engineering
techniques, Llama3 leads in three evaluation metrics, while Gemma2 leads in two. This
indicates that Llama3 generally delivers the best performance on the GSM8k dataset.
Finally, Mistral ranks third, showing a significant performance gap compared to Llama3
and Gemma2.
Table 6. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the GSM8k dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 10.1785 8.2318 6.2994 B 0.4283 0.4434 0.3382 B 0.2512 0.2379 0.1986
I 11.3725 10.3809 6.3203 I 0.4505 0.4471 0.2904 I 0.2680 0.2557 0.1799
C 11.2824 9.7220 6.6639 C 0.4401 0.4493 0.3157 C 0.2653 0.2527 0.1924
Appl. Sci. 2025, 15, 1430 13 of 32
Table 6. Cont.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
C, I 11.1692 11.6382 7.8677 C, I 0.4455 0.4818 0.3349 C, I 0.2637 0.2784 0.2060
R(B) 10.4893 10.0561 7.2215 R(B) 0.4304 0.4513 0.3164 R(B) 0.2577 0.2514 0.1966
R(C) 12.0336 10.5243 7.4787 R(C) 0.4544 0.4537 0.3192 R(C) 0.2730 0.2576 0.1979
T, B 11.3938 8.8232 6.8033 T, B 0.4234 0.4550 0.3214 T, B 0.2654 0.2451 0.1948
T, I 12.1602 10.1051 6.1928 T, I 0.4411 0.4406 0.3072 T, I 0.2773 0.2517 0.1956
T, C 12.4078 10.2231 6.9790 T, C 0.4364 0.4609 0.3405 T, C 0.2755 0.2599 0.2091
T, C, I 12.4400 11.3522 9.7138 T, C, I 0.4511 0.4694 0.3492 T, C, I 0.2813 0.2738 0.2345
T, R(B) 12.2981 10.2974 7.5567 T, R(B) 0.4414 0.4540 0.3301 T, R(B) 0.2797 0.2595 0.2091
T, R(C) 13.6000 11.6573 9.4806 T, R(C) 0.4675 0.4718 0.3248 T, R(C) 0.2919 0.2738 0.2188
S, B 11.0489 7.3710 5.8964 S, B 0.4423 0.4427 0.3089 S, B 0.2683 0.2330 0.1859
S, I 11.2372 8.8757 4.0492 S, I 0.4309 0.4582 0.2516 S, I 0.2608 0.2540 0.1559
S, C 11.2324 9.6305 6.2983 S, C 0.4464 0.4522 0.3260 S, C 0.2738 0.2504 0.1969
S, C, I 12.2068 8.9984 6.1892 S, C, I 0.4503 0.4550 0.2950 S, C, I 0.2764 0.2499 0.1848
S, R(B) 11.4328 9.1563 6.5322 S, R(B) 0.4517 0.4593 0.3064 S, R(B) 0.2751 0.2546 0.1903
S, R(C) 12.4954 9.9989 8.0168 S, R(C) 0.4717 0.4628 0.3188 S, R(C) 0.2873 0.2551 0.2056
S, T, B 11.5911 7.8427 7.1353 S, T, B 0.4364 0.4566 0.3299 S, T, B 0.2762 0.2431 0.2019
S, T, I 12.9408 9.5542 5.7495 S, T, I 0.4522 0.4376 0.2702 S, T, I 0.2865 0.2474 0.1757
S, T, C 12.0767 9.5349 7.5402 S, T, C 0.4388 0.4570 0.3295 S, T, C 0.2794 0.2550 0.2044
S, T, C, I 12.7459 10.8192 8.4431 S, T, C, I 0.4421 0.4668 0.3271 S, T, C, I 0.2833 0.2662 0.2102
S, T, R(B) 11.6348 10.0010 7.8086 S, T, R(B) 0.4408 0.4515 0.3239 S, T, R(B) 0.2790 0.2568 0.2126
S, T, R(C) 13.7994 10.8942 9.0781 S, T, R(C) 0.4567 0.4650 0.3427 S, T, R(C) 0.2895 0.2664 0.2214
Table 7. Performances of BLUERT, BERTScore according to LLMs and Prompt Engineering techniques
on the GSM8k dataset.
Table 7. Cont.
5.3. HellaSwag
Tables 8 and 9 present the performance of Llama3, Gemma2, Mistral, and various
combinations of Prompt Engineering techniques on the HellaSwag dataset, as measured by
the BLEU, ROUGE, METEOR, BLEURT, and BERTScore metrics. As seen in these tables,
the HellaSwag dataset, which evaluates natural language understanding, common sense
reasoning, and situational comprehension, shows that combinations of CoT, ICL, ToT,
and CoT-based RAG techniques yield the best performance.
Moreover, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric, while the best-performing Prompt
Engineering approach changes across metrics for all models as follows: Llama3, Gemma2,
and Mistral. Llama3 generally achieves the best performance with standalone CoT or
combinations of CoT with ICL, ToT, and SSR techniques. For Gemma2, the combination
of CoT-based RAG and ToT techniques yields the best results, while Mistral generally
performs best with combinations of CoT and ICL.
Fundamentally, as observed in Tables 8 and 9, when no Prompt Engineering techniques
are applied, Llama3 typically delivers the best performance, leading in four out of the five
evaluation metrics. Gemma2 only slightly outperforms Llama3 in the BLEURT metric.
However, when all possible Prompt Engineering techniques are considered, Gemma2
emerges as the best-performing LLM, leading in all five evaluation metrics with a substan-
tial performance gap over Llama3, the second-best model. Notably, although Gemma2’s
base PLM initially showed the lowest performance in BLEU, ROUGE, METEOR, and
BERTScore, its performance was significantly enhanced through Prompt Engineering strate-
gies, ultimately making it the top-performing LLM. This demonstrates that appropriate
combinations of Prompt Engineering techniques can significantly improve the performance
of even a baseline PLM.
Table 8. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the HellaSwag dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 0.9938 0.4228 0.8254 B 0.1477 0.1335 0.1463 B 0.0951 0.0926 0.1034
I 0.1473 0.3997 0.2550 I 0.0855 0.1429 0.0942 I 0.0469 0.1047 0.0659
C 0.0193 0.0000 0.0531 C 0.3168 0.6521 0.2600 C 0.1592 0.3265 0.1312
C, I 1.8112 0.0000 0.0000 C, I 0.3604 0.7168 0.3206 C, I 0.1807 0.3586 0.1605
R(B) 0.8526 0.7209 0.6942 R(B) 0.1309 0.1488 0.1205 R(B) 0.0822 0.0999 0.0858
R(C) 1.0346 0.0000 2.0251 R(C) 0.3440 0.7158 0.2680 R(C) 0.1715 0.3584 0.1344
T, B 0.8721 0.2981 0.9997 T, B 0.1434 0.1290 0.1636 T, B 0.0999 0.0897 0.1222
T, I 1.0498 0.2970 0.7208 T, I 0.1399 0.1213 0.1335 T, I 0.1056 0.0854 0.0996
Appl. Sci. 2025, 15, 1430 15 of 32
Table 8. Cont.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
T, C 0.7567 2.1914 0.0888 T, C 0.4604 0.6989 0.3061 T, C 0.2305 0.3497 0.1535
T, C, I 0.0000 0.0000 0.0000 T, C, I 0.2633 0.7193 0.2937 T, C, I 0.1319 0.3599 0.1468
T, R(B) 0.7593 0.4324 1.1741 T, R(B) 0.1382 0.1373 0.1571 T, R(B) 0.0885 0.0917 0.1157
T, R(C) 0.0000 0.0000 0.7358 T, R(C) 0.3892 0.7382 0.2429 T, R(C) 0.1944 0.3693 0.1215
S, B 1.0413 0.4181 0.4664 S, B 0.1522 0.1345 0.1035 S, B 0.0982 0.0911 0.0705
S, I 0.0003 0.5170 0.2350 S, I 0.0204 0.1202 0.1059 S, I 0.0135 0.0842 0.0823
S, C 0.0054 1.5551 0.0240 S, C 0.1226 0.6381 0.3175 S, C 0.0641 0.3193 0.1614
S, C, I 0.0000 0.0000 0.5114 S, C, I 0.3116 0.6565 0.2992 S, C, I 0.1563 0.3283 0.1503
S, R(B) 0.7829 0.6365 1.3147 S, R(B) 0.1305 0.1395 0.1745 S, R(B) 0.0826 0.0952 0.1327
S, R(C) 0.5530 0.0000 0.9046 S, R(C) 0.3041 0.6775 0.3204 S, R(C) 0.1518 0.3387 0.1604
S, T, B 0.9205 0.3513 1.2806 S, T, B 0.1500 0.1355 0.1666 S, T, B 0.0979 0.0919 0.1271
S, T, I 1.1919 0.5674 1.1760 S, T, I 0.1496 0.1388 0.1510 S, T, I 0.1091 0.0971 0.1250
S, T, C 0.7736 0.9067 0.0124 S, T, C 0.4251 0.6585 0.2467 S, T, C 0.2125 0.3295 0.1252
S, T, C, I 0.0000 0.0000 0.0898 S, T, C, I 0.3718 0.7222 0.2683 S, T, C, I 0.1862 0.3614 0.1346
S, T, R(B) 0.9430 0.5889 1.3780 S, T, R(B) 0.1513 0.1473 0.1792 S, T, R(B) 0.0981 0.0993 0.1415
S, T, R(C) 0.0000 0.0000 0.3419 S, T, R(C) 0.3454 0.7133 0.2687 S, T, R(C) 0.1727 0.3569 0.1348
Table 9. Performances of BLUERT, BERTScore according to LLMs and Prompt Engineering techniques
on the HellaSwag dataset.
5.4. MMLU
Tables 10 and 11 present the performance results of Llama3, Gemma2, and Mistral,
along with various combinations of Prompt Engineering techniques on the MMLU dataset,
evaluated using the BLEU, ROUGE, METEOR, BLEURT, and BERTScore metrics. As ob-
served in these tables, the MMLU dataset, which involves multiple-choice questions on a
wide range of topics related to language understanding and generation, generally shows
that techniques like ToT, CoT, ICL, and CoT-based RAG achieve strong performance.
Additionally, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, the best performance is
generally observed with standalone CoT or combinations of CoT with ICL. For Gemma2,
CoT-based RAG typically yields the highest performance, while for Mistral, the combination
of SSR and CoT tends to perform best.
Fundamentally, as seen in Tables 10 and 11, when no Prompt Engineering techniques
are applied, Gemma2 generally exhibits the best performance across all five evaluation met-
rics, consistently outperforming Llama3. Furthermore, when considering all possible com-
binations of Prompt Engineering techniques, Gemma2 emerges as the top-performing LLM,
leading in all five metrics and demonstrating a substantial performance gap over Llama3,
the second-best model. Llama3 also outperforms Mistral across all five evaluation metrics.
Table 10. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the MMLU dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 0.6079 3.4827 1.7943 B 0.0881 0.2582 0.1415 B 0.0570 0.1767 0.1148
I 2.1802 3.7322 0.9140 I 0.2003 0.2602 0.0820 I 0.1302 0.1812 0.0650
C 3.5684 0.0000 0.5051 C 0.5526 0.6735 0.2995 C 0.2763 0.3366 0.1503
C, I 0.0000 0.0000 0.3492 C, I 0.5482 0.6614 0.2252 C, I 0.2743 0.3307 0.1126
R(B) 2.8672 4.7673 1.3631 R(B) 0.2259 0.2789 0.0993 R(B) 0.1400 0.1931 0.0769
R(C) 0.9513 1.7725 0.5213 R(C) 0.4991 0.6761 0.4030 R(C) 0.2498 0.3379 0.2011
T, B 0.8849 3.2586 1.7603 T, B 0.1063 0.2570 0.1682 T, B 0.0702 0.1703 0.1359
T, I 1.7126 3.4013 1.5259 T, I 0.1978 0.2566 0.1179 T, I 0.1274 0.1770 0.0909
T, C 3.9037 20.4567 0.7130 T, C 0.5504 0.6739 0.4081 T, C 0.2751 0.3366 0.2043
T, C, I 3.4659 4.6662 1.9065 T, C, I 0.5334 0.6646 0.4400 T, C, I 0.2668 0.3320 0.2203
T, R(B) 2.5767 4.4502 1.9976 T, R(B) 0.2142 0.2682 0.1368 T, R(B) 0.1351 0.1832 0.1095
T, R(C) 1.9788 1.3839 0.4798 T, R(C) 0.5053 0.6719 0.4225 T, R(C) 0.2528 0.3359 0.2113
S, B 1.5409 3.3855 1.6624 S, B 0.1356 0.2567 0.1549 S, B 0.0924 0.1742 0.1280
S, I 1.8313 2.3881 0.7418 S, I 0.1941 0.1977 0.0853 S, I 0.1246 0.1345 0.0761
S, C 1.5335 0.0000 0.3471 S, C 0.5294 0.6650 0.4576 S, C 0.2647 0.3323 0.2293
S, C, I 0.5118 7.1647 0.5612 S, C, I 0.4870 0.6493 0.3697 S, C, I 0.2437 0.3245 0.1846
S, R(B) 2.7748 4.5459 1.7730 S, R(B) 0.2121 0.2561 0.1302 S, R(B) 0.1324 0.1762 0.1108
S, R(C) 1.1176 1.0204 0.5201 S, R(C) 0.4870 0.6590 0.4336 S, R(C) 0.2437 0.3293 0.2166
S, T, B 1.5694 3.2601 1.7471 S, T, B 0.1456 0.2456 0.1528 S, T, B 0.0983 0.1642 0.1261
S, T, I 1.3087 3.1979 0.8279 S, T, I 0.1267 0.2429 0.0923 S, T, I 0.0823 0.1671 0.0858
S, T, C 2.4407 18.9637 0.4261 S, T, C 0.5339 0.6650 0.4523 S, T, C 0.2671 0.3321 0.2265
S, T, C, I 2.3339 3.2443 0.6960 S, T, C, I 0.5030 0.6457 0.4252 S, T, C, I 0.2517 0.3228 0.2128
S, T, R(B) 2.6319 4.2450 1.3921 S, T, R(B) 0.2057 0.2575 0.1279 S, T, R(B) 0.1302 0.1771 0.1122
S, T, R(C) 0.4610 1.1757 0.6834 S, T, R(C) 0.4523 0.6614 0.4392 S, T, R(C) 0.2264 0.3305 0.2195
Appl. Sci. 2025, 15, 1430 17 of 32
Table 11. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the MMLU dataset.
5.5. TruthfulQA
Tables 12 and 13 present the performance of Llama3, Gemma2, and Mistral on the
TruthfulQA dataset, evaluated using BLEU, ROUGE, METEOR, BLEURT, and BERTScore
metrics under different combinations of Prompt Engineering techniques. As observed in
these tables, the TruthfulQA dataset focuses on assessing the truthfulness and cognitive
biases of language models, where techniques like SSR, ToT, and CoT-based RAG generally
yield strong performance.
Furthermore, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, the combination of SSR
and ToT yields the best performance. Similarly, Gemma2 also performs best with the
combination of SSR and ToT. In contrast, for Mistral, the combination of ToT and CoT-based
RAG tends to deliver the highest performance.
Fundamentally, as seen in Tables 12 and 13, Mistral generally exhibits the best perfor-
mance when no Prompt Engineering techniques are applied, leading in four out of five
evaluation metrics. This suggests that the foundational Mistral model has been pre-trained
particularly well for the TruthfulQA dataset. On the other hand, Llama3, which initially
exhibited the lowest performance across all five metrics, shows the most significant im-
provement when Prompt Engineering strategies are applied, ultimately outperforming all
other models across all five metrics.
Appl. Sci. 2025, 15, 1430 18 of 32
Notably, despite Llama3’s foundational PLM showing the poorest performance ini-
tially, the application of PE strategies enhanced its capabilities, making it the top-performing
LLM. Mistral follows as the second-best performing LLM, while Gemma2 shows the least
favorable results. Particularly noteworthy is the improvement observed in Mistral when
applying techniques like CoT-based RAG, which significantly boosts its performance on
the TruthfulQA dataset compared to the baseline PLM model.
Table 12. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the TruthfulQA dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 3.2280 6.4854 7.9794 B 0.1162 0.1965 0.1966 B 0.0883 0.1490 0.2030
I 3.1545 5.8078 10.5174 I 0.1369 0.1919 0.2111 I 0.1031 0.1410 0.2097
C 4.3404 7.5579 11.7685 C 0.1545 0.2069 0.2589 C 0.1183 0.1619 0.2403
C, I 7.6076 9.7970 13.1527 C, I 0.1745 0.2424 0.2512 C, I 0.1456 0.1903 0.2339
R(B) 13.6465 10.3414 5.4771 R(B) 0.2311 0.2637 0.1295 R(B) 0.1994 0.2078 0.1279
R(C) 18.3547 16.1612 16.5590 R(C) 0.2990 0.2987 0.2780 R(C) 0.2618 0.2523 0.2735
T, B 4.7692 4.3545 6.6566 T, B 0.1354 0.1740 0.2096 T, B 0.1040 0.1148 0.2151
T, I 7.4163 5.2657 10.1729 T, I 0.1663 0.1738 0.2451 T, I 0.1480 0.1260 0.2541
T, C 7.8968 5.0370 11.3654 T, C 0.1981 0.1803 0.2772 T, C 0.1624 0.1221 0.2458
T, C, I 23.6890 10.0892 15.4784 T, C, I 0.3489 0.2216 0.3021 T, C, I 0.3378 0.1735 0.2809
T, R(B) 12.1275 5.6581 13.8990 T, R(B) 0.2205 0.2130 0.2375 T, R(B) 0.1968 0.1495 0.2315
T, R(C) 28.8505 19.3221 19.1410 T, R(C) 0.3973 0.3261 0.3671 T, R(C) 0.3718 0.2880 0.3526
S, B 4.3186 5.4876 7.1867 S, B 0.1374 0.1822 0.2308 S, B 0.1063 0.1336 0.2481
S, I 1.7323 3.6296 10.6035 S, I 0.0913 0.1417 0.2802 S, I 0.0642 0.1025 0.2985
S, C 8.9206 7.3408 11.6499 S, C 0.1853 0.2081 0.3020 S, C 0.1515 0.1614 0.2806
S, C, I 21.4575 10.1314 14.2651 S, C, I 0.3293 0.1881 0.3104 S, C, I 0.3014 0.1656 0.3100
S, R(B) 16.4333 8.8920 10.6353 S, R(B) 0.2709 0.2238 0.2775 S, R(B) 0.2394 0.1816 0.2761
S, R(C) 20.8076 21.6764 17.8575 S, R(C) 0.3252 0.3298 0.3412 S, R(C) 0.2924 0.2962 0.3439
S, T, B 4.5782 4.2550 7.5255 S, T, B 0.1467 0.1648 0.2020 S, T, B 0.1070 0.1135 0.2186
S, T, I 9.1290 3.7706 8.0678 S, T, I 0.2010 0.1428 0.2421 S, T, I 0.1946 0.1036 0.2707
S, T, C 11.1736 6.0713 11.9418 S, T, C 0.2202 0.1950 0.2857 S, T, C 0.1841 0.1464 0.2662
S, T, C, I 26.3708 7.6004 15.1529 S, T, C, I 0.3670 0.2193 0.3601 S, T, C, I 0.3475 0.1637 0.3670
S, T, R(B) 15.8823 5.4032 9.2931 S, T, R(B) 0.2509 0.1926 0.2602 S, T, R(B) 0.2256 0.1387 0.2536
S, T, R(C) 29.6606 20.3296 18.4604 S, T, R(C) 0.4112 0.3354 0.3539 S, T, R(C) 0.3931 0.2988 0.3603
Table 13. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the TruthfulQA dataset.
5.6. Winogrande
Tables 14 and 15 present the performance of Llama3, Gemma2, and Mistral on
the Winogrande dataset, evaluated using the BLEU, ROUGE, METEOR, BLEURT,
and BERTScore metrics under various combinations of Prompt Engineering techniques.
As observed in these tables, the Winogrande dataset focuses on contextual reasoning
abilities concerning common-sense knowledge, where techniques such as ToT, CoT, ICL,
and SSR generally result in strong performance.
Furthermore, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, a combination of ToT, CoT,
and ICL techniques typically yields the best performance. For Gemma2, SSR combined
with CoT tends to deliver the highest performance, while Mistral shows the best results
when combining CoT and ICL techniques.
Fundamentally, as seen in Tables 14 and 15, Gemma2 generally exhibits the best
performance in four out of the five evaluation metrics when no Prompt Engineering
techniques are applied. However, when applying a comprehensive Prompt Engineering
strategy, Gemma2 outperforms both Llama3 and Mistral across all five evaluation metrics.
This indicates that a well-implemented Prompt Engineering strategy can further amplify the
strengths of an LLM. Following Gemma2, Llama3 demonstrates slightly better performance
compared to Mistral.
Table 14. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the Winogrande dataset.
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 6.1499 2.8829 0.5264 B 0.3990 0.5460 0.2813 B 0.2067 0.2858 0.1667
I 0.0000 21.1627 0.6066 I 0.3757 0.5440 0.2641 I 0.1908 0.2842 0.1466
C 0.0000 0.0000 9.1291 C 0.2126 0.6772 0.5354 C 0.1080 0.3428 0.2701
Appl. Sci. 2025, 15, 1430 20 of 32
Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
C, I 0.0000 0.0000 5.2921 C, I 0.4862 0.6614 0.5824 C, I 0.2483 0.3332 0.2996
R(B) 0.0000 0.0000 0.8863 R(B) 0.3071 0.5617 0.1200 R(B) 0.1563 0.2931 0.0740
R(C) 0.0000 0.0000 4.0067 R(C) 0.4685 0.6417 0.3530 R(C) 0.2394 0.3260 0.1810
T, B 6.8716 2.2136 0.3370 T, B 0.3471 0.5058 0.2642 T, B 0.1796 0.2695 0.1692
T, I 0.0000 0.0000 0.4283 T, I 0.2782 0.4862 0.2490 T, I 0.1433 0.2504 0.1499
T, C 0.0000 0.0000 11.3092 T, C 0.1220 0.6457 0.4146 T, C 0.0627 0.3290 0.2105
T, C, I 0.0000 0.0000 0.0000 T, C, I 0.5807 0.6083 0.5486 T, C, I 0.2938 0.3059 0.2794
T, R(B) 14.1599 0.0000 0.4806 T, R(B) 0.2900 0.5440 0.2154 T, R(B) 0.1446 0.2836 0.1373
T, R(C) 0.0000 0.0000 8.6535 T, R(C) 0.4961 0.6063 0.3963 T, R(C) 0.2522 0.3083 0.2040
S, B 0.0000 2.7428 0.3879 S, B 0.3228 0.5423 0.2764 S, B 0.1714 0.2884 0.1716
S, I 0.0000 8.3435 0.4342 S, I 0.2900 0.4849 0.1843 S, I 0.1516 0.2521 0.1093
S, C 0.0000 0.0000 4.7501 S, C 0.3642 0.6791 0.3622 S, C 0.1855 0.3447 0.1818
S, C, I 0.0000 0.0000 1.9968 S, C, I 0.4764 0.6673 0.5233 S, C, I 0.2426 0.3371 0.2672
S, R(B) 3.4654 0.0000 0.3439 S, R(B) 0.2375 0.5007 0.1805 S, R(B) 0.1234 0.2612 0.1114
S, R(C) 0.0000 0.0000 6.2692 S, R(C) 0.4705 0.6260 0.4221 S, R(C) 0.2404 0.3182 0.2173
S, T, B 4.6461 1.8345 0.2370 S, T, B 0.2917 0.5315 0.2085 S, T, B 0.1531 0.2851 0.1362
S, T, I 3.8892 22.6567 0.1773 S, T, I 0.2087 0.4980 0.1539 S, T, I 0.1071 0.2621 0.1163
S, T, C 0.0000 0.0000 0.0000 S, T, C 0.1457 0.6634 0.2651 S, T, C 0.0746 0.3369 0.1325
S, T, C, I 0.0000 0.0000 0.0000 S, T, C, I 0.3386 0.6201 0.2362 S, T, C, I 0.1710 0.3135 0.1188
S, T, R(B) 4.2532 0.0000 0.2733 S, T, R(B) 0.2808 0.5092 0.1680 S, T, R(B) 0.1479 0.2659 0.1153
S, T, R(C) 0.0000 0.0000 8.6537 S, T, R(C) 0.4764 0.6299 0.4134 S, T, R(C) 0.2434 0.3201 0.2097
Table 15. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the Winogrande dataset.
Table 16. Comparison of results across models and datasets using different metrics.
When considering datasets that encompass a mix of diverse categories, i.e., essentially
reflecting more general datasets, aggregating the occurrences of the most effective Prompt
Engineering techniques as presented in Table 16 can provide valuable insights into which
methods are likely to be most effective across a broad range of datasets. Table 17 summarizes
the cumulative usage of various Prompt Engineering techniques across different datasets.
The experimental results reveal that the CoT technique was the most frequently employed,
appearing 56 times. The Tree of Thought method followed with 47 instances, while ICL
and SSR were each utilized 35 times. Although the R(C) technique was used 28 times
and showed strong performance on specific datasets, it did not exhibit the same level of
consistent performance as the CoT method. These findings suggest that the CoT technique
is particularly effective in enhancing problem-solving and reasoning capabilities across a
wide variety of datasets. In contrast, the R(C) technique excels in generating fact-based
responses, indicating its potential effectiveness in more specialized response generation
scenarios. Furthermore, when applying RAG, the effectiveness can be further enhanced
by using CoT-based data augmentation with R(C), rather than directly applying raw data
with R(B).
To elucidate the dependence on PE techniques and identify the optimal combinations
for each LLM, Table 18 presents an analysis of the frequency with which different PE
techniques are employed across various LLMs. As demonstrated in Tables 16 and 18,
there is a clear inverse relationship between the performance of the base PLM and its
reliance on PE strategies, i.e., models such as Gemma2 and Llama3, which exhibit superior
Appl. Sci. 2025, 15, 1430 23 of 32
Table 17. The number of Prompt Engineering techniques used per dataset
Table 18. The number of Prompt Engineering techniques used per LLM.
the original question in the same format. C presents the original question along with a step-
by-step breakdown of the problem, offering a logical reasoning process to progressively
approach the correct answer. R aids in expanding relevant background knowledge and
facilitating the generation of accurate answers by providing related information prior to
presenting the question. S guides LLMs to approach problems systematically and incremen-
tally, breaking down complex problems into smaller, manageable steps to provide a logical
reasoning process instead of directly generating an answer. T explores multiple pathways
to problem-solving and derives the optimal solution by evaluating various possibilities.
S, T, R(C) methods involve constructing a prompt by combining the aforementioned S, T
and R(C) approaches with the original question to provide relevant information.
The responses of each LLM to the prompts in Table 19 are presented in Tables 20–22.
As shown in Table 20, Gemma’s responses tend to be shorter and more focused on core
content compared to other LLMs. Analyzing by PE strategy, the B approach yields a single-
word response: “No”. Even in other strategies (e.g., I, C, R), Gemma provides slightly more
elaborated responses such as “Brown sugar is not significantly healthier than white sugar”.
However, it consistently maintains brevity and delivers the same core message across
all PE strategies. This indicates that Gemma’s internal reasoning process is robust and
minimally influenced by variations in PE strategies. The primary reason for this behavior is
Gemma’s design philosophy, which prioritizes conciseness and consistency. Such principles
reduce the risk of hallucination and enhance response accuracy by avoiding the generation
of unnecessary content. However, this design also leads to the limitation of providing
insufficient explanatory details, such as the molasses content or calorie differences between
brown and white sugar.
Variations in Gemma’s responses can be observed depending on the PE strategy
applied. Using the B strategy, Gemma generates an extremely concise response, such
as “No”, indicating minimal engagement with structured reasoning. When employing I,
the response becomes slightly more explicit and context-aware, such as “No, brown sugar
is not healthier than white sugar”, demonstrating that ICL helps guide Gemma to align
its responses better with the query’s context. The C approach activates logical reasoning,
resulting in a more refined response like “Brown sugar is not significantly healthier than
white sugar”, reflecting CoT’s ability to enhance logical thinking. Similarly, the S strategy
produces responses comparable to CoT, as both rely on structured logical reasoning to
systematically approach the problem. Under the T strategy, Gemma consistently generates
the same response as CoT, showing that exploring multiple reasoning pathways does
not alter its conclusions, thus demonstrating robust internal consistency. Lastly, the R(C)
strategy incorporates external information into the prompt but still concludes with the same
response: “Brown sugar is not significantly healthier than white sugar”. This indicates that
Gemma’s internalized knowledge is robust enough to generate accurate responses without
heavily depending on retrieved external information. In contrast, as shown in Table 21,
Llama’s responses are relatively longer and include more detailed explanations compared
to Gemma. Furthermore, the level of detail and contextual richness varies depending on
the applied PE strategy.
First, the B approach produces the response, “The health difference between the two is
relatively minimal”, which includes appropriate contextual information. This demonstrates
that Llama 3 can generate refined and reasonable responses even without structured
reasoning, relying solely on its default capabilities. Using I, the response, “Brown sugar
is not significantly healthier than white sugar. Both contain almost the same amount of
calories and sugar. . . ” shows that Llama 3, guided by ICL, is capable of providing more
specific and contextually rich details. The C strategy results in the response, “No, brown
sugar is not significantly healthier than white sugar due to the small amount of molasses
Appl. Sci. 2025, 15, 1430 26 of 32
it contains”, reflecting the ability of Llama 3 to generate logical and concise answers that
directly address the core conclusion through structured reasoning. The R response, “No,
brown sugar has the same calories and health risks as white sugar”, is highly concise and
fact-focused. This suggests that the RAG-based approach emphasizes retrieving external
information while delivering key insights without extensive elaboration. The S response,
“Brown sugar is considered slightly healthier. . . However, it is still high in calories and can
be detrimental to health. . . ”, provides a systematic, step-by-step analysis that incorporates
both advantages and disadvantages. T demonstrates the ability to explore diverse reasoning
pathways and synthesize a comprehensive conclusion through the response, “While brown
sugar may seem like a healthier option. . . both sugars are essentially interchangeable”.
Finally, the S, T, R(C) combined approach produces the concise and fact-based response,
“No, brown sugar is not healthier than white sugar”. This indicates that even with a
combination of strategies, the response remains focused on delivering a straightforward
and factual answer.
Finally, as shown in Table 22, Mistral’s responses are moderately detailed across all PE
strategies, positioning themselves between Gemma’s concise answers and Llama 3’s highly
elaborative responses. When compared to Gemma, Mistral provides richer responses by
including additional information such as calorie content, health risks, and alternatives,
whereas Gemma focuses solely on delivering the core conclusion in a concise manner.
In comparison with Llama 3, Mistral is less detailed than Llama 3’s highly developed
responses, particularly those generated using advanced PE strategies like ToT, which
integrate multidimensional reasoning and expert perspectives. However, Mistral offers
more contextual information than Gemma, striking a balance between brevity and detail.
In a detailed analysis, Mistral’s B response, “Brown sugar is not healthier than white
sugar. . . ”, thoroughly addresses calorie content, carbohydrate composition, molasses dif-
ferences, health risks, and potential alternatives (e.g., honey, maple syrup). I, C, R produce
concise, fact-based answers focusing on calorie and molasses differences. S and T incor-
porate details such as the mineral content of brown sugar while emphasizing that these
elements have minimal health impact. The combined approach of S, T, R(C) generates
concise and factual responses, exemplified by “No, brown sugar is not healthier than white
sugar”. Mistral demonstrates a consistent ability to adapt its responses across PE strategies,
balancing detail and clarity effectively.
PE Prompt
#Question
B
Is brown sugar healthier than white sugar?
#Example Question
What happened during the first radio broadcast of [The War of the Worlds]?
#Example Answer
I
There was no mass panic, but a few listeners called into the station.
#Question
Is brown sugar healthier than white sugar?
#Question
Is brown sugar healthier than white sugar?
#Chain of Thought
C
Some people believe that brown sugar is healthier because it contains molasses. How-
ever, the amount of molasses is very small, making the health differences negligible.
Therefore, the claim that brown sugar is healthier than white sugar is incorrect.
Appl. Sci. 2025, 15, 1430 27 of 32
PE Prompt
#Related Information
brown sugar has the same calories and health risks as white sugar.
R
#Question
Is brown sugar healthier than white sugar?
Step 1: Understand the Instructions
Ensure that the answer is directly addressing the QUESTION.
Step 2: Analyze the Question
S Read the QUESTION thoroughly.
Identify the context provided in the QUESTION.
#Question
Is brown sugar healthier than white sugar?
There are three different experts who will answer this question.
Each expert will write down their thoughts and share them with the group.
T They will then evaluate the shared thoughts and deduce the most appropriate answer.
#Question
Is brown sugar healthier than white sugar?
There are three different experts who will answer this question.
Each expert will write down their thoughts and share them with the group.
They will then evaluate the shared thoughts and deduce the most appropriate answer.
#Related Information
brown sugar has the same calories and health risks as white sugar.
S, T,
#Question
R(C)
Is brown sugar healthier than white sugar?
#Chain of Thought
Some people believe that brown sugar is healthier because it contains molasses. How-
ever, the amount of molasses is very small, making the health differences negligible.
Therefore, the claim that brown sugar is healthier than white sugar is incorrect.
PE Answer
B No.
I No, brown sugar is not healthier than white sugar.
C Brown sugar is not significantly healthier than white sugar.
R Brown sugar is not significantly healthier than white sugar.
S No, brown sugar is not necessarily healthier than white sugar.
T Brown sugar is not significantly healthier than white sugar.
S, T, R(C) Brown sugar is not significantly healthier than white sugar.
PE Answer
While brown sugar may contain some additional nutrients like iron and minerals compared to white sugar, it is still
B considered a refined sugar and should be consumed in moderation. Therefore, the health difference between the two
is relatively minimal.
Brown sugar is not significantly healthier than white sugar. Both contain almost the same amount of calories and
I sugar, with the main difference being that brown sugar contains a small amount of molasses, which gives it a distinct
flavor and darker color.
C No, brown sugar is not significantly healthier than white sugar due to the small amount of molasses it contains.
R No, brown sugar has the same calories and health risks as white sugar.
Appl. Sci. 2025, 15, 1430 28 of 32
PE Answer
Brown sugar is considered slightly healthier than white sugar due to its higher content of minerals like iron, calcium,
and potassium. However, it is still high in calories and can be detrimental to health if consumed excessively. The main
S difference between brown sugar and white sugar lies in the refining process, with brown sugar retaining some of its
molasses content, which provides it with these additional minerals. Nevertheless, it’s essential to note that the health
benefits of brown sugar are relatively minor compared to the negative effects of excessive sugar consumption in general.
Expert 1: Nutritionist Brown sugar is not necessarily healthier than white sugar, while it contains some minerals like
iron and calcium, it is still primarily composed of sucrose, a simple sugar that can cause a spike in blood sugar levels
and contribute to various health problems when consumed excessively. The difference in nutritional content between
brown sugar and white sugar is minimal, and the health benefits of brown sugar are largely anecdotal. Expert 2: Food
Scientist From a chemical perspective, brown sugar and white sugar are almost identical. Both are composed of sucrose,
a disaccharide made up of glucose and fructose molecules. The main difference between the two is the presence of
molasses in brown sugar, which gives it a distinct flavor and color. However, the molasses does not significantly impact
the nutritional content or health effects of brown sugar. In terms of health, both sugars are essentially interchangeable.
T
Expert 3: Registered Dietitian While brown sugar may seem like a healthier option due to its slightly higher mineral
content, the difference is negligible. The real issue is not the type of sugar, but rather the amount of sugar consumed
overall. Both brown sugar and white sugar can contribute to excessive sugar intake, which is linked to various health
problems, including obesity, diabetes, and tooth decay. To make a healthier choice, focus on reducing sugar intake in
general, rather than opting for one type of sugar over another. Group Evaluation After considering the perspectives of our
experts, it’s clear that brown sugar is not significantly healthier than white sugar, while it may contain some minerals,
the difference is minimal, and both sugars can contribute to excessive sugar intake when consumed in excess. The key
takeaway is to focus on reducing sugar intake in general, rather than opting for one type of sugar over another.
S, T, R(C) No, brown sugar is not healthier than white sugar.
PE Answer
Brown sugar is not healthier than white sugar. Both contain similar amounts of calories and simple carbohydrates.
The main difference is that brown sugar contains a small amount of molasses, which gives it a slightly different flavor
B and color, but it does not significantly affect its nutritional value. Consuming either type of sugar in excess can lead to
health problems such as tooth decay and weight gain. It’s always best to limit added sugars in your diet and opt for
natural sweeteners like honey or maple syrup in moderation.
Brown sugar is not healthier than white sugar. Both contain the same amount of calories and simple carbohydrates.
I The main difference is that brown sugar contains a small amount of molasses, which gives it a slightly different flavor
and color, but it does not make it a healthier choice.
No, brown sugar is not healthier than white sugar. The small amount of molasses in brown sugar does not significantly
C
impact its nutritional value compared to white sugar. Both are high in calories and should be consumed in moderation.
Brown sugar is not healthier than white sugar. Both contain the same number of calories and can contribute to health
R
risks when consumed in excess.
No, brown sugar is not healthier than white sugar. Both contain similar amounts of calories and simple carbohydrates.
S The difference lies in the molasses content, which provides a slightly higher amount of minerals like iron and
potassium in brown sugar.
Brown sugar is not healthier than white sugar. Both brown and white sugar are high in calories and can contribute to
weight gain and other health issues when consumed in excess. Brown sugar may have a slightly higher content of
T
minerals like iron and potassium, but these amounts are negligible compared to the sugar content. It’s always best to
limit added sugars in your diet and opt for natural sweeteners like honey or maple syrup in moderation.
S, T, R(C) No, brown sugar is not healthier than white sugar.
Table 23 summarizes the strengths and weaknesses of each model across various
Prompt Engineering techniques. In summary, Gemma is best for tasks requiring concise,
efficient, and consistent answers but lacks depth and adaptability for advanced reasoning
tasks. Llama 3 excels in depth, multidimensional reasoning, and advanced strategies
like ToT but may be overly verbose for straightforward queries. Mistral strikes a balance
Appl. Sci. 2025, 15, 1430 29 of 32
between detail and brevity, providing contextual richness without excessive elaboration,
making it versatile for a wide range of tasks.
6. Conclusions
This study analyzed and applied prominent Prompt Engineering techniques, such
as ICL, CoT, RAG, SSR, and ToT, to major LLMs like LlaMA3, Mistral, and Gemma2.
The performance was evaluated across multiple datasets, including ARC, HellaSwag,
MMLU, TruthfulQA, Winogrande, and GSM8K, using metrics such as BLEU, ROUGE,
METEOR, BLEURT, and BERTScore.
The experimental results demonstrated that the most appropriate Prompt Engineering
techniques vary depending on the characteristics of each dataset. Specifically, datasets
that emphasize mathematical and logical reasoning benefited from Prompt Engineering
strategies centered on CoT, SSR, and ToT, while datasets focused on natural language
understanding showed improved performance with ICL-centric strategies. For datasets
where factual accuracy is paramount, RAG-based strategies proved to be most effective.
However, the study also revealed that the optimal combination of Prompt Engineering
techniques can differ significantly depending on the LLM. This indicates that, while a
Appl. Sci. 2025, 15, 1430 30 of 32
universally applicable Prompt Engineering technique might exist across different LLMs
and datasets, the specific optimal strategy can vary based on the dataset in question.
Generally, Gemma2 exhibited the best performance across most datasets, regardless
of whether a Prompt Engineering strategy was applied. This suggests that well-trained
PLMs tend to perform better even when additional PE strategies are employed. Among in-
dividual Prompt Engineering techniques, CoT was found to be the most advantageous
for performance enhancement. Furthermore, the study showed that applying RAG with
CoT-based data augmentation is more effective than using raw RAG data alone.
It was also observed that more advanced and recently developed LLMs have a lower
dependence on PE, and when PE is applied, these models show a greater degree of per-
formance improvement. In terms of specific dependence on PE techniques, it was found
that Gemma2 and Llama3, which have the strongest base PLM performance, rely less
on ICL, while Mistral, with a lower base PLM performance, shows a higher reliance on
ICL. Conversely, Gemma2 and Llama3 demonstrated a higher reliance on RAG techniques
for performance enhancement, indicating that even well-developed LLMs benefit from
strategically applied PE-based preprocessing.
The findings of this study emphasize the importance of considering both the charac-
teristics of the dataset and the inherent properties of the LLM when selecting Prompt Engi-
neering techniques for performance optimization. These insights will be crucial for guiding
future research and strategy development in the field of LLM performance optimization.
Author Contributions: Software, M.S.; Validation, M.S. and S.L.; Investigation, S.L.; Resources, S.L.;
Writing—original draft, S.L.; Writing—review & editing, Y.-J.W.; Funding acquisition, Y.-J.W. All
authors have read and agreed to the published version of the manuscript.
Funding: This research was financially supported by the Ministry of Trade, Industry and Energy
(MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International
Cooperative RD program. (Project Number: P0022701)
Data Availability Statement: The original contributions presented in this study are included in the
article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest: The funders had no role in the design of the study; in the collection, analyses,
or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
1. Rula, A.; D’Souza, J. Procedural Text Mining with Large Language Models. In Proceedings of the 12th Knowledge Capture
Conference 2023, Pensacola, FL, USA, 5–7 December 2023; pp. 9–16. [CrossRef]
2. Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini:
A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805.
3. Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis
of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [CrossRef]
4. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.;
et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [CrossRef]
5. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.;
et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2023, arXiv:2204.02311.
6. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI
Blog 2019, 1, 9.
7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In
Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [CrossRef]
8. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al.
Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [CrossRef]
9. Meta. Llama3. Meta AI Blog. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 23 January 2025).
Appl. Sci. 2025, 15, 1430 31 of 32
10. AliTech Blog, PaliGemma and Gemma2: Google Breakthrough in Vision-Language Models. Available online: https://alitech.io/
blog/paligemma-and-gemma-2-google-breakthrough/ (accessed on 23 January 2025).
11. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier,
L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [CrossRef]
12. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Hanna, E.B.; Bressand, F.;
et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [CrossRef]
13. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A Survey on In-context Learning. arXiv 2022,
arXiv:2301.00234. [CrossRef]
14. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training
language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [CrossRef]
15. Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let us Verify
Step by Step. arXiv 2023, arXiv:2305.20050. [CrossRef]
16. Liu, F.; Lapata, M. Text Summarization with Pretrained Encoders. arXiv 2019, arXiv:1908.08345.
17. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with
Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada,
10–15 December 2024.
18. Reynolds, L.; McDonell, K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arXiv 2021,
arXiv:2102.07350.
19. Gu, Y.; Dong, L.; Wei, F.; Huang, M. Pre-Training to Learn in Context. In Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 4849–4870.
20. Min, S.; Lewis, M.; Zettlemoyer, L.; Hajishirzi, H. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA,
USA, Online, 10–15 July 2022; pp. 2791–2809.
21. Liu, J.; Shen, D.; Zhang, Y.; Dolan, B.; Carin, L.; Chen, W. What makes good in-context examples for GPT-3? In Proceedings of
the Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
(DeeLIO@ACL 2022), Dublin, Ireland, 27 May 2022; pp. 100–114.
22. Kim, H.J.; Cho, H.; Kim, J.; Kim, T.; Yoo, K.M.; Lee, S. Self-generated in-context learning: Leveraging autoregressive language
models as a demonstration generator. arXiv 2022, arXiv:2206.08082.
23. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in
Large Language Models. arXiv 2022, arXiv:2201.11903.
24. Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation
Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313.
25. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al.
Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv 2020, arXiv:2005.11401v4.
26. Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; et al. Check your facts and try again:
Improving large language models with external knowledge and automated feedback. arXiv 2023, arXiv:2302.12813.
27. Varshney, N.; Yao, W.; Zhang, H.; Chen, J.; Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by
validating low-confidence generation. arXiv 2023, arXiv:2307.03987.
28. Gao, L.; Dai, Z.; Pasupat, P.; Chen, A.; Chaganty, A.T.; Fan, Y.; Zhao, V.Y.; Lao, N.; Lee, H.; Juan, D.C.; et al. Rarr: Researching and
revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 16477–16508.
29. Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-verification reduces hallucination in
large language models. arXiv 2023, arXiv:2309.11495.
30. Amatriain, X. Prompt Design and Engineering: Introduction and Advanced Methods. arXiv 2024, arXiv:2401.14423.
31. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A compre-
hensive review. arXiv 2023, arXiv:2310.14735.
32. Singh, A.; Ehtesham, A.; Gupta, G.K.; Chatta, N.K.; Kumar, S.; Khoei, T.T. Exploring Prompt Engineering: A Systematic Review
with SWOT Analysis. arXiv 2024, arXiv:2410.12843.
33. Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. arXiv 2021,
arXiv:2104.09864. [CrossRef]
34. Shazeer, N. GLU variants improve transformer. arXiv 2020, arXiv:2002.05202.
35. Steele, D.; Specia, L. Vis-Eval Metric Viewer: A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine
Translation Output. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Demonstrations, New Orleans, LA, USA, 1–6 June 2018.
36. Bhattacharyya, P. Indowordnet’s help in Indian language machine translation. AI Soc. 2020, 35, 689–698.
Appl. Sci. 2025, 15, 1430 32 of 32
37. Popović, M.; Ney, H. Syntax-Oriented Evaluation Measures for Machine Translation Output. In Proceedings of the Fourth
Workshop on Statistical Machine Translation, Athens, Greece, 30–31 March 2009.
38. Sellam, T.; Das, D.; Parikh, A.P. BLEURT: Learning Robust Metrics for Text Generation. arXiv 2020, arXiv:2004.04696.
39. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019,
arXiv:1904.09675.
40. Boratko, M.; Padigela, H.; Mikkilineni, D.; Yuvraj, P.; Das, R.; McCallum, A.; Chang, M.; Fokoue-Nkoutche, A.; Kapanipathi, P.;
Mattei, N.; et al. A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset. arXiv 2018,
arXiv:1806.00358.
41. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? arXiv 2019,
arXiv:1905.07830.
42. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language
Understanding. arXiv 2020, arXiv:2009.03300.
43. Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv 2021, arXiv:2109.07958.
44. Sakaguchi, K.; Bras, R.L.; Bhagavatula, C.; Choi, Y. Winogrande: An Adversarial Winograd Schema Challenge at Scale. arXiv
2021, arXiv:1907.10641. [CrossRef]
45. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training
Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.