0% found this document useful (0 votes)
69 views32 pages

Optimizing Large Language Models a Deep Dive Into

This paper explores effective Prompt Engineering techniques for optimizing Large Language Models (LLMs) to improve response accuracy across various datasets. It evaluates methods such as In-Context Learning, Chain of Thought, and Retrieval-Augmented Generation, analyzing their performance on leading LLMs like Gemma2 and LlaMA3 using multiple metrics. The findings suggest that the optimal Prompt Engineering strategy varies by dataset characteristics and that advanced models increasingly rely on RAG techniques for enhanced performance.

Uploaded by

ntv151706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views32 pages

Optimizing Large Language Models a Deep Dive Into

This paper explores effective Prompt Engineering techniques for optimizing Large Language Models (LLMs) to improve response accuracy across various datasets. It evaluates methods such as In-Context Learning, Chain of Thought, and Retrieval-Augmented Generation, analyzing their performance on leading LLMs like Gemma2 and LlaMA3 using multiple metrics. The findings suggest that the optimal Prompt Engineering strategy varies by dataset characteristics and that advanced models increasingly rely on RAG techniques for enhanced performance.

Uploaded by

ntv151706
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Article

Optimizing Large Language Models: A Deep Dive into Effective


Prompt Engineering Techniques
Minjun Son 1 , Yun-Jae Won 2 and Sungjin Lee 3 *

1 Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of Korea;


mj.son14820@gmail.com
2 Korea Electronics Technology Institute, Seongnam-si 13488, Republic of Korea; yjwon@keti.re.kr
3 Department of Smart Automotive, Soonchunhyang University, Asan-si 31538, Republic of Korea
* Correspondence: sungjinlee@sch.ac.kr

Abstract: Recent advancements in Natural Language Processing (NLP) technologies have


been driven at an unprecedented pace by the development of Large Language Models
(LLMs). However, challenges remain, such as generating responses that are misaligned
with the intent of the question or producing incorrect answers. This paper analyzes various
Prompt Engineering techniques for large-scale language models and identifies methods that
can optimize response performance across different datasets without the need for extensive
retraining or fine-tuning. In particular, we examine prominent Prompt Engineering tech-
niques including In-Context Learning (ICL), Chain of Thought (CoT), Retrieval-Augmented
Generation (RAG), Step-by-Step Reasoning (SSR), and Tree of Thought (ToT), and we apply
these techniques to leading LLMs such as Gemma2, LlaMA3, and Mistral. The performance
of these models was evaluated using the AI2 Reasoning Challenge (ARC), HellaSwag, Mas-
sive Multitask Language Understanding (MMLU), TruthfulQA, Winogrande, and Grade
School Math (GSM8k) datasets across metrics such as BLEU, ROUGE, METEOR, BLEURT,
and BERTScore. The experimental results indicate that the most suitable Prompt Engineer-
ing technique can vary depending on the characteristics of each dataset. Specifically, for
datasets emphasizing mathematical and logical reasoning, Prompt Engineering strategies
centered around CoT, SSR, and ToT were found to be advantageous. For datasets focusing
on natural language understanding, ICL-centric strategies were more effective, while RAG-
based strategies were beneficial for datasets where factual accuracy is crucial. However, it
Academic Editor: Jose María Alvarez
was also observed that the optimal combination of Prompt Engineering techniques could
Rodríguez
differ depending on the specific LLM, indicating that fine-tuning the Prompt Engineering
Received: 23 November 2024
approach to the model and dataset is essential for achieving the best performance. The find-
Revised: 26 December 2024
Accepted: 24 January 2025
ings indicate that as LLMs become more advanced, their reliance on Prompt Engineering
Published: 30 January 2025 (PE) techniques diminishes, yet the magnitude of their performance improvement when
Citation: Son, M.; Won, Y.-J.; Lee, S.
PE strategies are applied increases. Furthermore, these advanced models tend to depend
Optimizing Large Language Models: less on ICL techniques while exhibiting a greater reliance on RAG strategies. It is also
A Deep Dive into Effective Prompt evident that implementing RAG with PE-based preprocessing yields superior performance
Engineering Techniques. Appl. Sci. enhancements compared to the mere application of RAG on raw data.
2025, 15, 1430. https://doi.org/
10.3390/app15031430
Keywords: large language model; prompt engineering; in-context learning; chain of
Copyright: © 2025 by the authors. thought; retrieval-augmented generation; step-by-step reasoning; tree of thought
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
1. Introduction
Attribution (CC BY) license
(https://creativecommons.org/ Recent advancements in Natural Language Processing (NLP) technology have been
licenses/by/4.0/). propelled at an unprecedented pace by the development of Large Language Models (LLMs).

Appl. Sci. 2025, 15, 1430 https://doi.org/10.3390/app15031430


Appl. Sci. 2025, 15, 1430 2 of 32

These models are pre-trained on vast datasets, equipping them with the ability to recognize
and comprehend complex language patterns. From question answering in conversational
agents, text summarization, and language translation, to content generation and speech
recognition, LLMs are integral to the progress of artificial intelligence, particularly in con-
sumer electronics and technologies (CE/CT) [1]. In consumer applications, such as smart
home assistants, voice-controlled devices, and interactive entertainment systems, LLMs
contribute to more intuitive user experiences by enhancing natural language understanding
and response generation.
The evolution of LLMs from simple statistical language models [2] to sophisticated
transformer-based architectures like GPT-4 [3,4], PaLM [5], and LLaMA [6] marks a signifi-
cant leap in their capabilities, influencing diverse CE/CT applications. Prominent mod-
els include OpenAI’s GPT series [3,4], Meta’s LLaMA family [7–9], Google’s Gemini [2]
and Gemma [10], as well as Mistral’s recent advancements [11,12], all of which are continu-
ously evolving to optimize their performance in consumer-facing technologies.
The capabilities of LLMs are grounded in Pre-trained Language Models (PLMs) and
subsequent fine-tuning techniques. As PLMs have grown in scale, developing into highly
capable LLMs, novel reasoning capabilities have emerged, further accelerating their utility
in various applications. These abilities include In-Context Learning (ICL) [13], Instruction
Following (IF) [14], and Step-by-Step Reasoning (SSR) [15], which are increasingly critical
for enhancing consumer electronics interactions, such as improving response accuracy in
smart assistants or optimizing content recommendations in entertainment systems.
Despite the advancements in PLMs, their accuracy often falls short, particularly when
applied to specific datasets or tasks, and they are prone to generating incorrect answers that
include non-existent information, a phenomenon known as hallucination. To address these
issues, various Prompt Engineering (PE) techniques have recently been developed. Prompt
Engineering plays a crucial role in determining the quality of the final output generated by
an LLM by restructuring the prompt’s structure and content to optimize the model’s output.
Notably, Prompt Engineering allows for the extraction of desired responses during the
inference stage without the need for costly and time-consuming pre-training or fine-tuning
stages, making it a highly advantageous technique for commercial applications [16].
One prominent approach within Prompt Engineering is the Retrieval-Augmented
Generation (RAG) technique. RAG enhances the current data by combining a retrieval step,
where external knowledge is leveraged to supplement the information, with a generation
step that produces the final output. This integration improves the accuracy of information
retrieval and enables the model to provide more accurate responses to queries that involve
up-to-date information or rare facts.
In contrast, the Chain of Thought (CoT) technique is designed to guide language
models to explicitly articulate intermediate steps or logical reasoning processes when
solving complex problems. For instance, by structuring prompts to include the step-by-step
solution process for a math problem, CoT enables the evaluation of both the correctness of
the final response and the appropriateness of the reasoning that led to it.
The In-Context Learning (ICL) technique, as its name suggests, involves learning
meaning within the context provided by the prompt. It enables the model to generate
relevant responses by understanding the context in the prompt. This is typically achieved by
providing examples, which guide the model in generating appropriate answers. Depending
on the number of examples provided, the technique is categorized into zero-shot learning
(no examples), one-shot learning (one example), and few-shot learning (multiple examples).
Additionally, the Step-by-Step Reasoning (SSR) technique guides the model to arrive
at an answer through a series of intermediate steps when solving complex problems. This
Appl. Sci. 2025, 15, 1430 3 of 32

approach can be particularly useful in tasks that require multi-step thinking, such as solving
math problems, logical reasoning, or complex decision-making processes [15].
Lastly, the Tree of Thought (ToT) technique involves branching out multiple possible
thought processes in a tree structure to explore different solutions to a complex problem.
This method is especially effective in situations that require multi-stage reasoning or where
multiple solutions are possible [17].
This paper investigates the optimal Prompt Engineering strategies that maximize
user satisfaction, i.e., response accuracy, while minimizing resource requirements for real-
world LLM service deployment. The study focuses on applying Prompt Engineering
techniques without fine-tuning foundation models, enabling efficient service implemen-
tation. To achieve this, we examine the application strategies of widely used Prompt
Engineering methods, including RAG, CoT, ICL, SSR, and ToT. These techniques were
applied to leading LLM technologies, namely Llama3, Mistral, and Gemma2, to evaluate
their performance. For performance evaluation, we utilized diverse metrics, including
BLEU, ROUGE, METEOR, BLEURT, and BERTScore, to closely approximate user satisfac-
tion. Benchmark datasets such as the AI2 Reasoning Challenge (ARC), HellaSwag, Massive
Multitask Language Understanding (MMLU), TruthfulQA, Winogrande, and Grade School
Math (GSM8K) were employed to analyze the natural language generation performance
across a variety of topics. Based on these evaluations, we propose a PE (Prompt Engineer-
ing) strategy for optimizing performance across different datasets and LLMs.

2. Related Works
Prompt Engineering is a crucial technique for determining the quality of responses
generated by LLMs. It optimizes model outputs by adjusting the structure and content of
prompts during the inference stage, without requiring additional pre-training or fine-tuning
of the model. Initially, Prompt Engineering primarily involved simple question-and-answer
formats, but recent advancements have led to the development of more complex prompt
structures designed to solve intricate problems [18]. Early research in this field, spear-
headed by OpenAI with the development of the GPT-3 model, explored various Prompt
Engineering methods. Among these, techniques for designing ICL-based few-shot learn-
ing prompts emerged as a way to obtain desired answers from LLMs without additional
training [13]. The success of ICL has since led to the exploration of various derivatives,
such as PICL [19], MetaICL [20], KATE [21], SG-ICL [22], and GlobalE&LocalE [23]. While
early ICL research focused on optimizing ICL performance through prompt reconfiguration
alone, the PICL [19] approach proposed collecting related contextual data and reconstruct-
ing corpora to pre-train the model, thereby enabling it to learn how to infer through prompt
demonstrations in advance. In contrast, the MetaICL [20] approach introduced an addi-
tional phase between pre-training and ICL inference called continual training, also referred
to as model warming. This optional procedure for ICL involves adjusting the LLM by
modifying or adding to its parameters before inference. Many studies have also shown
that ICL performance heavily depends on the configuration of demonstrations, including
the selection, format, and order of the examples. KATE [21] investigated the selection of
demonstration examples, SG-ICL [22] focused on the format, and GlobalE&LocalE [23]
explored the order of demonstrations.
Research has also been conducted on techniques that encourage models to express
intermediate steps or logical reasoning alongside responses to solve complex problems. No-
table methods in this area include CoT, Self-Consistency, SSR, and ToT. The Self-Consistency
technique optimizes the model to produce consistent and reliable outputs across varied
inputs that convey the same meaning. The ToT approach involves branching out multi-
ple reasoning steps required to solve a problem into various thought trees. Each branch
Appl. Sci. 2025, 15, 1430 4 of 32

represents a different line of reasoning or assumption, allowing the exploration of diverse


possibilities or hypotheses. The model then evaluates the relevance of these branches to the
query, considering real-time factors to derive the most plausible result [17].
Recently, Prompt Engineering techniques have been developed to address the issue of
hallucinations, a major challenge for LLMs where the model generates incorrect answers,
including non-existent information [24]. Hallucination issues often arise from the lack of
up-to-date knowledge in the training data or restricted access to specific cases or personal
information. One of the most actively researched techniques to mitigate hallucinations is
RAG [25]. RAG addresses this issue by extracting a query from the input prompt, using
that query to retrieve information from external knowledge sources (e.g., search engines
or knowledge graphs), and then augmenting the original prompt with this information or
extracting relevant details provided in the prompt. The augmented prompt is then fed back
into the LLM to generate the final response. RAG systems effectively resolve hallucinations
through this process of retrieval, augmentation, and generation. The success of RAG has led
to the development of related variants, including the LLM-Augmenter [26], which performs
information retrieval before AI text generation; Knowledge Retrieval [27], which retrieves
information during generation; RARR [28], which performs retrieval after generation; and
the Chain-of-Verification (CoVe) technique [29], which incorporates self-refinement through
feedback and logical reasoning.
The Prompt Engineering techniques described above are summarized in Table 1.

Table 1. Summary of Prompt Engineering techniques.

Techniques Key Characteristics


Focuses on few-shot learning by providing demonstrations within
prompts. Includes derivatives such as PICL (contextual data collection
ICL [13,19–23] and reconstruction), MetaICL (continual training phase), KATE (demon-
stration example selection), SG-ICL (format of demonstrations), and Glob-
alE&LocalE (demonstration order).
Encourages intermediate reasoning steps for solving complex problems.
CoT [30,31]
Demonstrated to improve logical reasoning capabilities in LLMs.
Optimizes the model to produce consistent and reliable outputs across
Self-Consistency [31]
varied inputs conveying the same meaning.
Promotes systematic problem-solving by explicitly breaking down the
SSR [31]
reasoning process into individual steps.
Represents reasoning as a tree structure with branches exploring diverse
ToT [17,31] hypotheses or possibilities, aiding in logical exploration and plausibility
evaluation.
Addresses hallucination issues by retrieving external information
(e.g., from knowledge graphs or search engines) and augmenting prompts
with this information. Includes variants such as the LLM-Augmenter
RAG [24–29]
(retrieval before generation), Knowledge Retrieval (retrieval during gen-
eration), RARR (retrieval after generation), and CoVe (self-refinement
via feedback).

Lastly, there have been studies addressing issues related to model alignment, demon-
stration formatting, and the theoretical underpinnings of prompting using the aforemen-
tioned Prompt Engineering techniques [30–32]. The study by [30] proposed efficient
prompt construction methods such as Instruction+Question, Instruction+Input, and Ques-
tion+Answer. It also identified the limitations of LLMs and suggested strategies to over-
come these limitations using CoT-based Prompt Engineering. Similarly, ref. [31] explored
effective prompt construction techniques and discussed optimization strategies for LLMs
through approaches such as CoT, Self-Consistency, General Knowledge, Least-to-Most
Appl. Sci. 2025, 15, 1430 5 of 32

Prompting, ToT, GoT, Decomposed Prompting, and Active Prompting. The study by [32]
provided a SWOT analysis of LLM technologies and presented the mathematical theoretical
underpinnings of various Prompt Engineering techniques. However, none of these studies
have focused on deriving the optimal Prompt Engineering techniques based on perfor-
mance analysis using diverse metrics tailored to the characteristics of LLMs and datasets.
The main contributions of this paper are as follows:
• A technical analysis of various LLMs, Prompt Engineering techniques for performance
enhancement, related datasets, and evaluation metrics.
• A comparative performance analysis of key Prompt Engineering techniques across
different LLMs and datasets, using various evaluation metrics.
• Derivation of performance optimization strategies based on the comparative analysis,
tailored to specific application domains.
In Section 3, we describe the LLMs utilized in this study. Section 4 provides an
overview of the benchmark datasets employed, while Section 3.2 explains the performance
metrics used for evaluation. In Section 5, we present the experimental results based on
these parameters and propose a PE Strategy for optimizing LLM performance. Finally,
Section 6 offers the conclusion.

3. System Models and Evaluation Metrics


3.1. System Models
This paper conducts experiments on LLMs, specifically focusing on Gemma2, Llama3,
and Mistral—three of the most high-performance, widely used, and recently released
technologies.

3.1.1. Gemma2
Gemma2, released by Google on 27 June 2024, is the latest open large language model,
available in two sizes, i.e., 9 billion and 27 billion parameters, based on the Hugging Face
platform. Each model includes pre-training, fine-tuning, and alignment phases. As a
lightweight version of the larger Gemini model, Gemma2 offers significant enhancements.
The key features of the Gemma2 model are summarized in Table 2.

Table 2. Summary of the model architecture, pre-training, and post-training techniques of major LLMs.

Gemma2 (9B/27B) Llama3 (8B/70B) Mistral (7B)


Model Decoder-only Transformer Decoder-only Transformer Decoder-only Transformer
Architecture Rotary Position Embeddings (RoPE) Rotary Position Embeddings (RoPE) Grouped-Query Attention
Approx. GeGLU non-linearity SwiGLU Sliding Window Attention
Local Sliding Windows Grouped-Query Attention (GQA) Rolling Buffer Cache
Global Attention Pre-fill and Chunking
Logit Soft-Capping
Grouped-Query Attention (GQA)
Pre-Training Curation and filtering of training corpus Curation and filtering of training corpus Curation and filtering of training corpus
Vocab Size: 256 K Vocab Size: 128 K Vocab Size: 32 K
Knowledge Distillation
Post-Training Supervised Fine-Tuning (SFT) Supervised Fine-Tuning (SFT) Supervised Fine-Tuning (SFT)
RL with Human Feedback (RLHF) RL with Human Feedback (RLHF) System Prompt to enforce guardrails
Model Merging Rejection Sampling (RS)
Direct Preference Optimization (DPO)

The model architecture is based on a Transformer with a Decoder-Only structure,


utilizing Grouped Query Attention (GQA) with 8 key-valued heads and 16 heads. Rotary
Position Embeddings (RoPE) [33] and approximated GeGLU non-linearity [34] are em-
ployed. Additionally, Local Sliding Windows and Global Attention are alternated in every
Appl. Sci. 2025, 15, 1430 6 of 32

layer, and logits in each attention layer and the final layer are constrained to remain within
a soft-cap range.
The training methodology is divided into the following two stages: Pre-Training
and Post-Training. In the Pre-Training stage, data filtering is first applied to address
privacy and safety issues, and the filtered data is then converted into discrete tokens from
a large multilingual text corpus. Instead of next-token prediction, the model is trained
using Knowledge Distillation based on the probability of each token as provided by the
teacher model.
During the Post-Training stage, the pre-trained models are fine-tuned into Super-
vised Fine-Tuned models. Subsequently, Reinforcement Learning with Human Feedback
(RLHF) is applied to these models, where the reward model is trained on labeled English
preference data and the policy is based on the same prompts used in the supervised fine-
tuning (SFT) phase. Finally, the models obtained after each phase are averaged to enhance
overall performance.

3.1.2. Llama3
Llama3 is the latest large language model developed by Meta, released on 18 April 2024.
Building on the foundation of its predecessor, Llama2, this model is available in two sizes.
i.e., 8 billion and 70 billion parameters [9]. The key features of the Llama3 model are
summarized in Table 2.
The model architecture is based on a Transformer with a Decoder-Only structure,
utilizing GQA with 8 key-valued heads. During the Pre-Training stage, a large multilingual
text corpus is first converted into discrete tokens, and then a LLM is pre-trained on the
resulting data to perform next-token prediction.
While the pre-trained language model has a rich understanding of language, it does
not yet fully follow instructions or generate clear responses. In the Post-Training stage,
the model is aligned with human feedback over multiple rounds, each involving SFT on
instruction-tuning data and Direct Preference Optimization (DPO). Additionally, special-
ized training is introduced for specific domains such as coding and reasoning.

3.1.3. Mistral
Mistral is a high-performance language model developed by the French AI startup
Mistral AI. The Mistral model is a language model with 7 billion parameters, leveraging
cutting-edge technologies such as GQA and Sliding Window Attention (SWA) to enhance
efficiency [11,12]. The key technical features are summarized in Table 2.
The model architecture is based on a Transformer with a Decoder-Only structure,
incorporating GQA with 8 key-valued heads and 32 heads, as well as a SWA mechanism
with a size of 4096 to accelerate inference speed. Additionally, to optimize cache usage,
techniques such as Rolling Buffer Cache, Pre-fill, and Chunking are employed.
During the Pre-Training stage, a large multilingual text corpus is first converted
into discrete tokens, and then the resulting data is used to pre-train a LLM for next-
token prediction.
In the Post-Training stage, Instruction Fine-tuning is conducted through SFT using
Hugging Face’s Instruction dataset. Finally, the model performs fine-grained content
moderation by leveraging system prompts to optionally enforce constraints on its output.

3.2. Performance Metric


This paper selects and utilizes some of the most common and widely used metrics
for performance evaluation, including BLEU (Bilingual Evaluation Understudy), ROUGE
(Recall-Oriented Understudy for Gisting Evaluation), METEOR (Metric for Evaluation
of Translation with Explicit Ordering), BLEURT (Bilingual Evaluation Understudy with
Appl. Sci. 2025, 15, 1430 7 of 32

Representations from Transformers), and BERT (Bidirectional Encoder Representations


from Transformers) Score, in order to conduct a comprehensive analysis from various
performance evaluation perspectives.

3.2.1. BLEU
The BLEU score is a metric developed to evaluate the quality of machine translation
by measuring the n-gram overlap between the translated sentence and reference sentences.
The BLEU score is calculated using the following formula:
!
N
BLEU = BP · exp ∑ wn log pn , (1)
n =1

where pn represents the n-gram precision, wn denotes the weight, and BP refers to the
brevity penalty. BLEU is primarily used to assess the fluency and accuracy of translated
text, where a higher score indicates a closer match between the translated text and the
reference text [35].

3.2.2. ROUGE
The ROUGE score is primarily used to measure the similarity between generated text
and reference text in summarization tasks. There are several variants of ROUGE, including
ROUGE-N, ROUGE-L, and ROUGE-W. ROUGE-N is based on n-gram overlap, while
ROUGE-L relies on the Longest Common Subsequence (LCS). ROUGE-W is a weighted
version of ROUGE-L, where greater weights are assigned to longer common subsequences.
ROUGE is useful for evaluating the coverage and consistency of generated summaries and
is frequently employed in multi-document summarization evaluations [36]. In this study,
for the short-answer datasets, we based our performance measurement on ROUGE-N with
N = 1 (unigrams). The formula for ROUGE-N is as follows:

∑s∈{RS} ∑gramn ∈s Countm (gramn )


ROUGE-N = (2)
∑s∈{RS} ∑gramn ∈s Count(gramn )

where gramn is n-grams, RS is the Reference Summaries, Countm (gramn ) is the num-
ber of n-grams in the generated text that match the n-grams in the reference text,
and Count(gramn ) is the total number of n-grams in the reference text.

3.2.3. METEOR
The METEOR metric, unlike BLEU, accounts for factors beyond simple n-gram match-
ing, including synonymy, stemming, and word order alignment. METEOR employs a
harmonic mean of unigram precision and recall, placing a higher weight on recall in its
evaluation. By incorporating considerations such as synonyms and morphological analysis,
METEOR enhances the flexibility of word matching and is known to correlate more closely
with human judgments compared to BLEU. This allows for a more nuanced assessment of
translation quality [37].

METEOR = Fmean × (1 − Ppenalty ), (3)


10 · P · R
Fmean = ,
R+9·P
 3
number of chunks
Ppenalty = 0.5 ·
number of aligned words
Appl. Sci. 2025, 15, 1430 8 of 32

where P is the Precision, R is the Recall, and Ppenalty is the fragmentation penalty for
penalizing word order errors. These metrics are calculated using following equations:

Number of aligned words


P= , (4)
Number of words in the candidate translation
Number of aligned words
R= . (5)
number of words in the reference translation

3.2.4. BLEURT
Unlike traditional metrics such as BLEU, which rely on n-gram overlap, BLEURT is a
metric designed to account for semantic similarity, contextual understanding, and fluency.
This results in an evaluation metric that more closely aligns with human assessments [38].
Specifically, BLEURT leverages BERT (Bidirectional Encoder Representations from Trans-
formers) embeddings and is fine-tuned for specific text evaluation tasks. The model is
trained to predict a score that reflects how well the generated text aligns with the reference
text, based on the following equation:

BLEURT(c, r ) = f (BERT(c), BERT(r ), BERT(ctxt)),

where c is the generated text, r is the reference text, and ctxt is the context which may
include additional information or surrounding text that can influence the score. BERT( x )
represents the BERT embedding of the text x and f (·) is a function that combines these
embeddings and outputs a score reflecting the similarity between c and r.

3.2.5. BERT Score


BERTScore is an evaluation metric designed to measure semantic similarity between
texts by utilizing the BERT model to generate embeddings for each word, which are then
used to compute the semantic similarity between the reference and generated sentences [39].
BERTScore is based on the cosine similarity between the embedding vectors of each word,
forming word pairs between the reference and generated sentences to calculate the similar-
ity. Mathematically, it can be expressed as follows:

1
|X| ∑
P= max cosine( X, Y ),
X Y
1
|Y | ∑
R= max cosine( X, Y ),
Y X

P·R
BERTScore = 2 × , (6)
P+R

where P is the precision, R is the recall, X is the embedding vector of the generated text x,
Y is the embedding vector of the reference text y, and cosine( X, Y ) represents the cosine
similarity between the embeddings of the tokens from the generated and reference texts.

4. Dataset
In this study, we selected benchmark datasets from three categories to comprehensively
evaluate various language capabilities. For the general natural language understanding
category, we chose the MMLU and TruthfulQA datasets. For the reasoning ability category,
we selected the ARC, HellaSwag and Winogrande datasets. For the mathematics problem-
solving category, we used the GSM8K dataset. These categories were carefully chosen to
ensure a balanced assessment across general natural language understanding, reasoning
ability, and mathematics problem-solving domains.
Appl. Sci. 2025, 15, 1430 9 of 32

4.1. ARC
The ARC dataset consists of 11,119 training, 2992 validation, and 4541 test examples,
and is composed of multiple-choice questions with four options each, covering science
topics typically taught at the elementary and middle school levels in the United States, such
as biology, physics, chemistry, and earth science. The dataset is divided into two categories
as follows: ARC-Easy, which includes questions that require basic information retrieval,
and ARC-Challenge, which contains more difficult questions that require complex reason-
ing and inference. This dataset was designed to evaluate the extent to which AI models can
perform complex reasoning and understanding beyond simple information retrieval [40].

4.2. HellaSwag
The HellaSwag dataset consists of 39,905 training, 5000 validation, and 10,042 test
examples. It is designed to evaluate AI models’ natural language understanding by testing
their ability in commonsense reasoning and contextual understanding through sentence
completion and multiple-choice questions. This dataset includes contexts from various
domains, such as Wikipedia and instructional videos, and requires models to predict the
most appropriate sentence continuation among the provided choices for each context [41].

4.3. MMLU
The MMLU dataset consists of 57,243 training, 10,000 validation, and 15,000 test exam-
ples. Developed as part of Facebook AI’s efforts to advance human language understanding
and generation capabilities, the dataset includes multiple-choice questions across a wide
range of subjects, including science, humanities, and professional fields. It is designed to
assess how broadly and deeply AI models can understand and apply knowledge across
various domains [42].

4.4. TruthfulQA
The TruthfulQA dataset consists of 8170 training, 1024 validation, and 2035 test ex-
amples. It is designed to evaluate the ability of language models to generate factual and
truthful responses. The dataset includes questions that often use misleading or ambiguous
phrasing, or that require the model to acknowledge its lack of knowledge rather than fabri-
cating an answer. This dataset is used to assess the limitations of a model’s truthfulness
and its ability to handle subtlety in information retrieval and response generation [43].

4.5. Winogrande
The Winogrande dataset consists of 40,398 training, 1267 validation, and 3545 test
examples and is designed to train and test the commonsense reasoning abilities within
AI systems. Inspired by the Winograd Schema Challenge, it was developed with a larger
scale and a more diverse set of items to more thoroughly evaluate model performance.
The dataset comprises sentences with blanks that must be filled by selecting the correct word
from two given options, testing the model’s ability to perform contextual and commonsense
reasoning [44].

4.6. GSM8K
The GSM8k dataset consists of 7432 training, 800 validation, and 1319 test examples,
and is designed to test the mathematical reasoning abilities of AI systems. This dataset
is a collection of elementary-level math problems covering fundamental concepts such as
arithmetic, algebra, geometry, and word problems. It is used to evaluate AI models not
only on their ability to compute answers but also on their performance in understanding
the text context of problems, which often involves natural language comprehension and
multi-step reasoning [45].
Appl. Sci. 2025, 15, 1430 10 of 32

5. Simulation Result
Table 3 presents the experimental setup used in this study. The experiments were
conducted using Llama3-8B, Gemma2-9B, and Mistral-7B as the LLMs, with a focus on
analyzing the performance of ICL(1 Shot), CoT, RAG(1 Shot), ToT, SSR, and their combined
configurations. Sections 5.1–5.6 detail the performance evaluation and comparative analysis
across six datasets, based on the experimental environment outlined in Table 3. The
performance results were generated based on the prompt formats created according to the
respective Prompt Engineering techniques, with parameters set as Temperature = 0.1 and
Top-P = 0.9. Additionally, the applied Prompt Engineering techniques are denoted by the
following abbreviations and may be used interchangeably:

B: Base, I: ICL, C: CoT, S: SSR, T: ToT,


R(B): Basic RAG, R(C): CoT based RAG

For performance analysis, we conducted prompt fine-tuning on pre-trained


models—LLama3 8B, Mistral 7B, and Gemma2 9B—utilizing the computing resources
of the NVIDIA H100 Tensor Core GPU (Santa Clara, CA, USA). Finally, in Section 5.7,
we analyzed the Prompt Engineering techniques that yielded the best performance for
each dataset across the different LLMs, leading to the derivation of Prompt Engineering
strategies aimed at optimizing performance.

Table 3. Experimental setup.

Model
Llama3 8B Gemma2 9B Mistral 7B
Prompt Engineering
B: Base I: ICL C: CoT R: Rag T: ToT S: SSR
Metric
BLUE ROUGE METEOR BLUERT BERT -
Dataset
ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K

5.1. ARC
Tables 4 and 5 present the performance of Llama3, Gemma2, Mistral, and various
Prompt Engineering combinations on the ARC dataset, measured using the BLEU, ROUGE,
METEOR, BLEURT, and BERTScore metrics. The bolded values in the tables indicate
the highest performance achieved for each evaluation metric within each LLM model.
Furthermore, the bold formatting convention is consistently applied to the performance
result tables of other datasets as well. As shown in these tables, the ARC dataset, which
consists of multiple-choice questions and answers on scientific topics, achieves the best
performance with combinations of SSR techniques and CoT-based methods such as R(B),
R(C), C, and T.
Moreover, it is evident that the optimal combination of Prompt Engineering techniques
varies depending on the LLM and the evaluation metric. For Llama3, using CoT alone
generally yields the best results. In contrast, for Gemma2, the combination of SSR, CoT,
and ICL demonstrates superior performance, while for Mistral, the combination of SSR,
ToT, CoT, and ICL achieves the best outcomes.
Fundamentally, as observed in Tables 4 and 5, Gemma2 consistently outperforms
other models across all five evaluation metrics when no Prompt Engineering techniques
Appl. Sci. 2025, 15, 1430 11 of 32

are applied. Even when considering all possible Prompt Engineering techniques, Gemma2
remains the top-performing LLM, with a significant performance gap compared to the
second-best model, Llama3. Additionally, Llama3 consistently outperforms Mistral across
all five evaluation metrics.

Table 4. Performances of BLUE, ROUGE, and METEOR according to LLMs and Prompt Engineering
techniques on the ARC dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 2.2864 4.4464 1.4115 B 0.1955 0.2439 0.1426 B 0.1514 0.2007 0.1375
I 2.4251 5.5576 1.5623 I 0.1750 0.2575 0.0990 I 0.1237 0.2156 0.0971
C 0.0000 0.0000 1.8549 C 0.6553 0.8340 0.4298 C 0.3277 0.4170 0.2163
C, I 0.0000 0.0000 0.0000 C, I 0.3787 0.8298 0.2681 C, I 0.1872 0.4149 0.1340
R(B) 2.2759 4.8229 1.4879 R(B) 0.2073 0.2561 0.1009 R(B) 0.1501 0.2094 0.0929
R(C) 0.0000 0.0000 1.5619 R(C) 0.5830 0.8383 0.3438 R(C) 0.2915 0.4191 0.1700
T, B 2.0589 3.1321 1.2867 T, B 0.2002 0.2345 0.1499 T, B 0.1499 0.1843 0.1552
T, I 2.1093 4.3401 1.0184 T, I 0.1978 0.2487 0.1432 T, I 0.1433 0.1997 0.1368
T, C 0.0000 0.0000 3.4968 T, C 0.6511 0.8383 0.4199 T, C 0.3255 0.4191 0.2104
T, C, I 0.0000 0.0000 2.3658 T, C, I 0.5362 0.8298 0.2298 T, C, I 0.2681 0.4149 0.1149
T, R(B) 2.0709 4.7891 1.5037 T, R(B) 0.2023 0.2434 0.1407 T, R(B) 0.1462 0.2000 0.1322
T, R(C) 0.0000 0.0000 3.9588 T, R(C) 0.5532 0.8340 0.3574 T, R(C) 0.2766 0.4170 0.1785
S, B 2.1450 3.0319 0.6572 S, B 0.1927 0.2346 0.1290 S, B 0.1499 0.1843 0.1313
S, I 1.2337 4.5014 0.4917 S, I 0.1399 0.2444 0.0661 S, I 0.0866 0.1962 0.0716
S, C 0.0000 21.4646 1.2169 S, C 0.6170 0.8255 0.3887 S, C 0.3085 0.4128 0.1954
S, C, I 0.0000 0.0000 1.6279 S, C, I 0.5702 0.8340 0.2979 S, C, I 0.2851 0.4170 0.1489
S, R(B) 1.2236 4.2853 1.4485 S, R(B) 0.1714 0.2500 0.1352 S, R(B) 0.1197 0.2007 0.1247
S, R(C) 19.5521 21.3812 0.8924 S, R(C) 0.5660 0.8128 0.3234 S, R(C) 0.2830 0.4064 0.1591
S, T, B 2.3439 2.7917 0.9111 S, T, B 0.1887 0.2355 0.1349 S, T, B 0.1433 0.1816 0.1352
S, T, I 0.7283 3.6081 0.1601 S, T, I 0.1157 0.2400 0.0818 S, T, I 0.0869 0.1841 0.0905
S, T, C 0.0000 0.0000 1.7047 S, T, C 0.6426 0.8340 0.2468 S, T, C 0.3213 0.4170 0.1234
S, T, C, I 0.0000 0.0000 16.4416 S, T, C, I 0.5489 0.8170 0.4340 S, T, C, I 0.2745 0.4085 0.2170
S, T, R(B) 1.0221 4.4266 0.6973 S, T, R(B) 0.1764 0.2460 0.1473 S, T, R(B) 0.1285 0.1976 0.1365
S, T, R(C) 0.0000 0.0000 1.5277 S, T, R(C) 0.5447 0.8255 0.3263 S, T, R(C) 0.2723 0.4128 0.1612

Table 5. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the ARC dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.3167 0.3587 0.2857 B 0.5969 0.6277 0.5364
I 0.2760 0.3725 0.2188 I 0.5756 0.6327 0.4701
C 0.4662 0.5336 0.3678 C 0.9690 0.9811 0.9019
C, I 0.3450 0.5343 0.3050 C, I 0.9363 0.9844 0.9193
R(B) 0.3107 0.3663 0.2238 R(B) 0.6038 0.6455 0.4786
R(C) 0.4427 0.5378 0.3310 R(C) 0.9651 0.9848 0.9047
T, B 0.3172 0.3454 0.3002 T, B 0.5971 0.6276 0.5282
T, I 0.3221 0.3668 0.2825 T, I 0.5955 0.6327 0.4846
T, C 0.4654 0.5367 0.3706 T, C 0.9704 0.9822 0.9296
Appl. Sci. 2025, 15, 1430 12 of 32

Table 5. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
T, C, I 0.4132 0.5366 0.2850 T, C, I 0.9600 0.9858 0.9146
T, R(B) 0.3050 0.3543 0.2750 T, R(B) 0.6052 0.6350 0.5223
T, R(C) 0.4269 0.5365 0.3435 T, R(C) 0.9624 0.9846 0.9213
S, B 0.3133 0.3507 0.2780 S, B 0.5842 0.6189 0.4968
S, I 0.2380 0.3603 0.1944 S, I 0.5481 0.6172 0.3805
S, C 0.4526 0.5303 0.3499 S, C 0.9666 0.9788 0.8947
S, C, I 0.4257 0.5379 0.3285 S, C, I 0.9607 0.9859 0.9060
S, R(B) 0.2770 0.3581 0.2639 S, R(B) 0.5819 0.6385 0.4966
S, R(C) 0.4332 0.5292 0.3335 S, R(C) 0.9613 0.9814 0.8764
S, T, B 0.3144 0.3444 0.2792 S, T, B 0.5883 0.6244 0.4907
S, T, I 0.2110 0.3537 0.2250 S, T, I 0.4742 0.6204 0.3728
S, T, C 0.4615 0.5341 0.2906 S, T, C 0.9699 0.9812 0.9121
S, T, C, I 0.4284 0.5306 0.3977 S, T, C, I 0.9626 0.9845 0.9413
S, T, R(B) 0.2873 0.3524 0.2821 S, T, R(B) 0.5854 0.6347 0.4965
S, T, R(C) 0.4280 0.5334 0.3438 S, T, R(C) 0.9618 0.9836 0.9084

5.2. GSM8K
Tables 6 and 7 present the performance of Llama3, Gemma2, Mistral, and various
Prompt Engineering combinations on the GSM8K dataset, measured using the BLEU,
ROUGE, METEOR, BLEURT, and BERTScore metrics.
As shown in these tables, the GSM8K dataset, which involves elementary-level math
problems, described in natural language and requiring multi-step reasoning, achieves the
best performance with combinations of Step-by-Step Reasoning techniques, CoT-based
methods such as R(B) and R(C), CoT techniques, ToT, and ICL.
The optimal combinations of these Prompt Engineering techniques vary depending on
the LLM and the evaluation metric. For Llama3, although the optimal Prompt Engineering
technique varies across different evaluation metrics, combinations of SSR and CoT-based
RAG generally yield the best performance. For Gemma2, combinations of CoT and ICL
tend to provide the best results in most cases, while for Mistral, combinations of ToT, CoT,
and ICL consistently deliver the best performance.
Fundamentally, as seen in Tables 6 and 7, when no Prompt Engineering techniques are
applied, Llama3 outperforms others in three out of five evaluation metrics, while Gemma2
performs best in two metrics. Even when considering all possible Prompt Engineering
techniques, Llama3 leads in three evaluation metrics, while Gemma2 leads in two. This
indicates that Llama3 generally delivers the best performance on the GSM8k dataset.
Finally, Mistral ranks third, showing a significant performance gap compared to Llama3
and Gemma2.

Table 6. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the GSM8k dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 10.1785 8.2318 6.2994 B 0.4283 0.4434 0.3382 B 0.2512 0.2379 0.1986
I 11.3725 10.3809 6.3203 I 0.4505 0.4471 0.2904 I 0.2680 0.2557 0.1799
C 11.2824 9.7220 6.6639 C 0.4401 0.4493 0.3157 C 0.2653 0.2527 0.1924
Appl. Sci. 2025, 15, 1430 13 of 32

Table 6. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
C, I 11.1692 11.6382 7.8677 C, I 0.4455 0.4818 0.3349 C, I 0.2637 0.2784 0.2060
R(B) 10.4893 10.0561 7.2215 R(B) 0.4304 0.4513 0.3164 R(B) 0.2577 0.2514 0.1966
R(C) 12.0336 10.5243 7.4787 R(C) 0.4544 0.4537 0.3192 R(C) 0.2730 0.2576 0.1979
T, B 11.3938 8.8232 6.8033 T, B 0.4234 0.4550 0.3214 T, B 0.2654 0.2451 0.1948
T, I 12.1602 10.1051 6.1928 T, I 0.4411 0.4406 0.3072 T, I 0.2773 0.2517 0.1956
T, C 12.4078 10.2231 6.9790 T, C 0.4364 0.4609 0.3405 T, C 0.2755 0.2599 0.2091
T, C, I 12.4400 11.3522 9.7138 T, C, I 0.4511 0.4694 0.3492 T, C, I 0.2813 0.2738 0.2345
T, R(B) 12.2981 10.2974 7.5567 T, R(B) 0.4414 0.4540 0.3301 T, R(B) 0.2797 0.2595 0.2091
T, R(C) 13.6000 11.6573 9.4806 T, R(C) 0.4675 0.4718 0.3248 T, R(C) 0.2919 0.2738 0.2188
S, B 11.0489 7.3710 5.8964 S, B 0.4423 0.4427 0.3089 S, B 0.2683 0.2330 0.1859
S, I 11.2372 8.8757 4.0492 S, I 0.4309 0.4582 0.2516 S, I 0.2608 0.2540 0.1559
S, C 11.2324 9.6305 6.2983 S, C 0.4464 0.4522 0.3260 S, C 0.2738 0.2504 0.1969
S, C, I 12.2068 8.9984 6.1892 S, C, I 0.4503 0.4550 0.2950 S, C, I 0.2764 0.2499 0.1848
S, R(B) 11.4328 9.1563 6.5322 S, R(B) 0.4517 0.4593 0.3064 S, R(B) 0.2751 0.2546 0.1903
S, R(C) 12.4954 9.9989 8.0168 S, R(C) 0.4717 0.4628 0.3188 S, R(C) 0.2873 0.2551 0.2056
S, T, B 11.5911 7.8427 7.1353 S, T, B 0.4364 0.4566 0.3299 S, T, B 0.2762 0.2431 0.2019
S, T, I 12.9408 9.5542 5.7495 S, T, I 0.4522 0.4376 0.2702 S, T, I 0.2865 0.2474 0.1757
S, T, C 12.0767 9.5349 7.5402 S, T, C 0.4388 0.4570 0.3295 S, T, C 0.2794 0.2550 0.2044
S, T, C, I 12.7459 10.8192 8.4431 S, T, C, I 0.4421 0.4668 0.3271 S, T, C, I 0.2833 0.2662 0.2102
S, T, R(B) 11.6348 10.0010 7.8086 S, T, R(B) 0.4408 0.4515 0.3239 S, T, R(B) 0.2790 0.2568 0.2126
S, T, R(C) 13.7994 10.8942 9.0781 S, T, R(C) 0.4567 0.4650 0.3427 S, T, R(C) 0.2895 0.2664 0.2214

Table 7. Performances of BLUERT, BERTScore according to LLMs and Prompt Engineering techniques
on the GSM8k dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.4063 0.4008 0.3691 B 0.8119 0.8515 0.7033
I 0.4176 0.3988 0.3568 I 0.8662 0.8603 0.6001
C 0.4148 0.3994 0.3756 C 0.8374 0.8648 0.6778
C, I 0.4154 0.4154 0.3741 C, I 0.8418 0.8747 0.6793
R(B) 0.4138 0.3994 0.3712 R(B) 0.8192 0.8609 0.6610
R(C) 0.4198 0.4075 0.3710 R(C) 0.8228 0.8460 0.6278
T, B 0.4074 0.4022 0.3751 T, B 0.8307 0.8664 0.6785
T, I 0.4146 0.3962 0.3675 T, I 0.8454 0.8438 0.6682
T, C 0.4155 0.4049 0.3816 T, C 0.8436 0.8711 0.7051
T, C, I 0.4199 0.4035 0.3873 T, C, I 0.8594 0.8716 0.6974
T, R(B) 0.4149 0.3989 0.3771 T, R(B) 0.8613 0.8677 0.7077
T, R(C) 0.4245 0.4100 0.3778 T, R(C) 0.8577 0.8702 0.6618
S, B 0.4168 0.3977 0.3727 S, B 0.8487 0.8416 0.6495
S, I 0.4135 0.4073 0.3503 S, I 0.8429 0.8592 0.5127
S, C 0.4200 0.4005 0.3800 S, C 0.8582 0.8613 0.6733
S, C, I 0.4216 0.4052 0.3584 S, C, I 0.8538 0.8531 0.6410
S, R(B) 0.4195 0.4067 0.3689 S, R(B) 0.8597 0.8536 0.6728
Appl. Sci. 2025, 15, 1430 14 of 32

Table 7. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
S, R(C) 0.4251 0.4100 0.3765 S, R(C) 0.8572 0.8542 0.6610
S, T, B 0.4162 0.4052 0.3813 S, T, B 0.8540 0.8598 0.6745
S, T, I 0.4195 0.3948 0.3600 S, T, I 0.8611 0.8182 0.6019
S, T, C 0.4176 0.4051 0.3853 S, T, C 0.8429 0.8615 0.6689
S, T, C, I 0.4192 0.4053 0.3805 S, T, C, I 0.8514 0.8676 0.6701
S, T, R(B) 0.4176 0.3999 0.3781 S, T, R(B) 0.8634 0.8470 0.7072
S, T, R(C) 0.4240 0.4091 0.3812 S, T, R(C) 0.8597 0.8564 0.7068

5.3. HellaSwag
Tables 8 and 9 present the performance of Llama3, Gemma2, Mistral, and various
combinations of Prompt Engineering techniques on the HellaSwag dataset, as measured by
the BLEU, ROUGE, METEOR, BLEURT, and BERTScore metrics. As seen in these tables,
the HellaSwag dataset, which evaluates natural language understanding, common sense
reasoning, and situational comprehension, shows that combinations of CoT, ICL, ToT,
and CoT-based RAG techniques yield the best performance.
Moreover, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric, while the best-performing Prompt
Engineering approach changes across metrics for all models as follows: Llama3, Gemma2,
and Mistral. Llama3 generally achieves the best performance with standalone CoT or
combinations of CoT with ICL, ToT, and SSR techniques. For Gemma2, the combination
of CoT-based RAG and ToT techniques yields the best results, while Mistral generally
performs best with combinations of CoT and ICL.
Fundamentally, as observed in Tables 8 and 9, when no Prompt Engineering techniques
are applied, Llama3 typically delivers the best performance, leading in four out of the five
evaluation metrics. Gemma2 only slightly outperforms Llama3 in the BLEURT metric.
However, when all possible Prompt Engineering techniques are considered, Gemma2
emerges as the best-performing LLM, leading in all five evaluation metrics with a substan-
tial performance gap over Llama3, the second-best model. Notably, although Gemma2’s
base PLM initially showed the lowest performance in BLEU, ROUGE, METEOR, and
BERTScore, its performance was significantly enhanced through Prompt Engineering strate-
gies, ultimately making it the top-performing LLM. This demonstrates that appropriate
combinations of Prompt Engineering techniques can significantly improve the performance
of even a baseline PLM.

Table 8. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the HellaSwag dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 0.9938 0.4228 0.8254 B 0.1477 0.1335 0.1463 B 0.0951 0.0926 0.1034
I 0.1473 0.3997 0.2550 I 0.0855 0.1429 0.0942 I 0.0469 0.1047 0.0659
C 0.0193 0.0000 0.0531 C 0.3168 0.6521 0.2600 C 0.1592 0.3265 0.1312
C, I 1.8112 0.0000 0.0000 C, I 0.3604 0.7168 0.3206 C, I 0.1807 0.3586 0.1605
R(B) 0.8526 0.7209 0.6942 R(B) 0.1309 0.1488 0.1205 R(B) 0.0822 0.0999 0.0858
R(C) 1.0346 0.0000 2.0251 R(C) 0.3440 0.7158 0.2680 R(C) 0.1715 0.3584 0.1344
T, B 0.8721 0.2981 0.9997 T, B 0.1434 0.1290 0.1636 T, B 0.0999 0.0897 0.1222
T, I 1.0498 0.2970 0.7208 T, I 0.1399 0.1213 0.1335 T, I 0.1056 0.0854 0.0996
Appl. Sci. 2025, 15, 1430 15 of 32

Table 8. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
T, C 0.7567 2.1914 0.0888 T, C 0.4604 0.6989 0.3061 T, C 0.2305 0.3497 0.1535
T, C, I 0.0000 0.0000 0.0000 T, C, I 0.2633 0.7193 0.2937 T, C, I 0.1319 0.3599 0.1468
T, R(B) 0.7593 0.4324 1.1741 T, R(B) 0.1382 0.1373 0.1571 T, R(B) 0.0885 0.0917 0.1157
T, R(C) 0.0000 0.0000 0.7358 T, R(C) 0.3892 0.7382 0.2429 T, R(C) 0.1944 0.3693 0.1215
S, B 1.0413 0.4181 0.4664 S, B 0.1522 0.1345 0.1035 S, B 0.0982 0.0911 0.0705
S, I 0.0003 0.5170 0.2350 S, I 0.0204 0.1202 0.1059 S, I 0.0135 0.0842 0.0823
S, C 0.0054 1.5551 0.0240 S, C 0.1226 0.6381 0.3175 S, C 0.0641 0.3193 0.1614
S, C, I 0.0000 0.0000 0.5114 S, C, I 0.3116 0.6565 0.2992 S, C, I 0.1563 0.3283 0.1503
S, R(B) 0.7829 0.6365 1.3147 S, R(B) 0.1305 0.1395 0.1745 S, R(B) 0.0826 0.0952 0.1327
S, R(C) 0.5530 0.0000 0.9046 S, R(C) 0.3041 0.6775 0.3204 S, R(C) 0.1518 0.3387 0.1604
S, T, B 0.9205 0.3513 1.2806 S, T, B 0.1500 0.1355 0.1666 S, T, B 0.0979 0.0919 0.1271
S, T, I 1.1919 0.5674 1.1760 S, T, I 0.1496 0.1388 0.1510 S, T, I 0.1091 0.0971 0.1250
S, T, C 0.7736 0.9067 0.0124 S, T, C 0.4251 0.6585 0.2467 S, T, C 0.2125 0.3295 0.1252
S, T, C, I 0.0000 0.0000 0.0898 S, T, C, I 0.3718 0.7222 0.2683 S, T, C, I 0.1862 0.3614 0.1346
S, T, R(B) 0.9430 0.5889 1.3780 S, T, R(B) 0.1513 0.1473 0.1792 S, T, R(B) 0.0981 0.0993 0.1415
S, T, R(C) 0.0000 0.0000 0.3419 S, T, R(C) 0.3454 0.7133 0.2687 S, T, R(C) 0.1727 0.3569 0.1348

Table 9. Performances of BLUERT, BERTScore according to LLMs and Prompt Engineering techniques
on the HellaSwag dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.2433 0.2569 0.2598 B 0.5233 0.4892 0.4951
I 0.1742 0.2611 0.2161 I 0.3274 0.4492 0.3137
C 0.3216 0.4229 0.3029 C 0.7242 0.9219 0.7911
C, I 0.3708 0.4416 0.3482 C, I 0.8585 0.9384 0.8561
R(B) 0.2179 0.2578 0.2374 R(B) 0.4680 0.5080 0.4189
R(C) 0.3652 0.4443 0.3199 R(C) 0.8543 0.9380 0.8371
T, B 0.2449 0.2563 0.2850 T, B 0.5147 0.4839 0.5134
T, I 0.2482 0.2484 0.2503 T, I 0.4696 0.4128 0.3598
T, C 0.3722 0.4367 0.3166 T, C 0.8769 0.9330 0.8221
T, C, I 0.3638 0.4448 0.3145 T, C, I 0.8362 0.9388 0.8431
T, R(B) 0.2292 0.2509 0.2683 T, R(B) 0.5022 0.4981 0.5058
T, R(C) 0.3677 0.4503 0.3107 T, R(C) 0.8658 0.9432 0.8315
S, B 0.2422 0.2523 0.2253 S, B 0.5266 0.4870 0.3884
S, I 0.1008 0.2359 0.2168 S, I 0.1194 0.3810 0.2467
S, C 0.2431 0.4197 0.3005 S, C 0.3967 0.9178 0.7469
S, C, I 0.3751 0.4304 0.3267 S, C, I 0.8555 0.9252 0.8368
S, R(B) 0.2174 0.2481 0.2864 S, R(B) 0.4558 0.4522 0.5303
S, R(C) 0.3462 0.4330 0.3355 S, R(C) 0.8405 0.9285 0.8492
S, T, B 0.2431 0.2518 0.2798 S, T, B 0.5213 0.4799 0.5203
S, T, I 0.2547 0.2667 0.2722 S, T, I 0.4757 0.5015 0.4606
S, T, C 0.3573 0.4260 0.2829 S, T, C 0.8662 0.9227 0.6472
S, T, C, I 0.3723 0.4478 0.3095 S, T, C, I 0.8623 0.9395 0.8084
S, T, R(B) 0.2412 0.2592 0.2952 S, T, R(B) 0.5208 0.5074 0.5417
S, T, R(C) 0.3519 0.4434 0.3182 S, T, R(C) 0.8552 0.9376 0.8349
Appl. Sci. 2025, 15, 1430 16 of 32

5.4. MMLU
Tables 10 and 11 present the performance results of Llama3, Gemma2, and Mistral,
along with various combinations of Prompt Engineering techniques on the MMLU dataset,
evaluated using the BLEU, ROUGE, METEOR, BLEURT, and BERTScore metrics. As ob-
served in these tables, the MMLU dataset, which involves multiple-choice questions on a
wide range of topics related to language understanding and generation, generally shows
that techniques like ToT, CoT, ICL, and CoT-based RAG achieve strong performance.
Additionally, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, the best performance is
generally observed with standalone CoT or combinations of CoT with ICL. For Gemma2,
CoT-based RAG typically yields the highest performance, while for Mistral, the combination
of SSR and CoT tends to perform best.
Fundamentally, as seen in Tables 10 and 11, when no Prompt Engineering techniques
are applied, Gemma2 generally exhibits the best performance across all five evaluation met-
rics, consistently outperforming Llama3. Furthermore, when considering all possible com-
binations of Prompt Engineering techniques, Gemma2 emerges as the top-performing LLM,
leading in all five metrics and demonstrating a substantial performance gap over Llama3,
the second-best model. Llama3 also outperforms Mistral across all five evaluation metrics.

Table 10. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the MMLU dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 0.6079 3.4827 1.7943 B 0.0881 0.2582 0.1415 B 0.0570 0.1767 0.1148
I 2.1802 3.7322 0.9140 I 0.2003 0.2602 0.0820 I 0.1302 0.1812 0.0650
C 3.5684 0.0000 0.5051 C 0.5526 0.6735 0.2995 C 0.2763 0.3366 0.1503
C, I 0.0000 0.0000 0.3492 C, I 0.5482 0.6614 0.2252 C, I 0.2743 0.3307 0.1126
R(B) 2.8672 4.7673 1.3631 R(B) 0.2259 0.2789 0.0993 R(B) 0.1400 0.1931 0.0769
R(C) 0.9513 1.7725 0.5213 R(C) 0.4991 0.6761 0.4030 R(C) 0.2498 0.3379 0.2011
T, B 0.8849 3.2586 1.7603 T, B 0.1063 0.2570 0.1682 T, B 0.0702 0.1703 0.1359
T, I 1.7126 3.4013 1.5259 T, I 0.1978 0.2566 0.1179 T, I 0.1274 0.1770 0.0909
T, C 3.9037 20.4567 0.7130 T, C 0.5504 0.6739 0.4081 T, C 0.2751 0.3366 0.2043
T, C, I 3.4659 4.6662 1.9065 T, C, I 0.5334 0.6646 0.4400 T, C, I 0.2668 0.3320 0.2203
T, R(B) 2.5767 4.4502 1.9976 T, R(B) 0.2142 0.2682 0.1368 T, R(B) 0.1351 0.1832 0.1095
T, R(C) 1.9788 1.3839 0.4798 T, R(C) 0.5053 0.6719 0.4225 T, R(C) 0.2528 0.3359 0.2113
S, B 1.5409 3.3855 1.6624 S, B 0.1356 0.2567 0.1549 S, B 0.0924 0.1742 0.1280
S, I 1.8313 2.3881 0.7418 S, I 0.1941 0.1977 0.0853 S, I 0.1246 0.1345 0.0761
S, C 1.5335 0.0000 0.3471 S, C 0.5294 0.6650 0.4576 S, C 0.2647 0.3323 0.2293
S, C, I 0.5118 7.1647 0.5612 S, C, I 0.4870 0.6493 0.3697 S, C, I 0.2437 0.3245 0.1846
S, R(B) 2.7748 4.5459 1.7730 S, R(B) 0.2121 0.2561 0.1302 S, R(B) 0.1324 0.1762 0.1108
S, R(C) 1.1176 1.0204 0.5201 S, R(C) 0.4870 0.6590 0.4336 S, R(C) 0.2437 0.3293 0.2166
S, T, B 1.5694 3.2601 1.7471 S, T, B 0.1456 0.2456 0.1528 S, T, B 0.0983 0.1642 0.1261
S, T, I 1.3087 3.1979 0.8279 S, T, I 0.1267 0.2429 0.0923 S, T, I 0.0823 0.1671 0.0858
S, T, C 2.4407 18.9637 0.4261 S, T, C 0.5339 0.6650 0.4523 S, T, C 0.2671 0.3321 0.2265
S, T, C, I 2.3339 3.2443 0.6960 S, T, C, I 0.5030 0.6457 0.4252 S, T, C, I 0.2517 0.3228 0.2128
S, T, R(B) 2.6319 4.2450 1.3921 S, T, R(B) 0.2057 0.2575 0.1279 S, T, R(B) 0.1302 0.1771 0.1122
S, T, R(C) 0.4610 1.1757 0.6834 S, T, R(C) 0.4523 0.6614 0.4392 S, T, R(C) 0.2264 0.3305 0.2195
Appl. Sci. 2025, 15, 1430 17 of 32

Table 11. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the MMLU dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.2067 0.3510 0.2781 B 0.4915 0.6241 0.5243
I 0.2930 0.3496 0.2138 I 0.5808 0.6186 0.4626
C 0.4375 0.4733 0.3218 C 0.9626 0.9751 0.9280
C, I 0.4390 0.4716 0.2875 C, I 0.9648 0.9742 0.9188
R(B) 0.3081 0.3646 0.2337 R(B) 0.6020 0.6360 0.4818
R(C) 0.4143 0.4771 0.3667 R(C) 0.9523 0.9731 0.9402
T, B 0.2190 0.3442 0.3046 T, B 0.5093 0.6225 0.5606
T, I 0.2859 0.3459 0.2356 T, I 0.5882 0.6213 0.5139
T, C 0.4368 0.4740 0.3734 T, C 0.9635 0.9751 0.9443
T, C, I 0.4266 0.4728 0.3933 T, C, I 0.9645 0.9743 0.9504
T, R(B) 0.2999 0.3524 0.2616 T, R(B) 0.5948 0.6289 0.5216
T, R(C) 0.4170 0.4767 0.3788 T, R(C) 0.9566 0.9715 0.9424
S, B 0.2514 0.3472 0.2817 S, B 0.5362 0.6211 0.5444
S, I 0.2741 0.2900 0.2129 S, I 0.5668 0.5266 0.3999
S, C 0.4285 0.4712 0.3934 S, C 0.9577 0.9743 0.9387
S, C, I 0.4154 0.4698 0.3694 S, C, I 0.9469 0.9725 0.9423
S, R(B) 0.2927 0.3448 0.2576 S, R(B) 0.5835 0.6121 0.4988
S, R(C) 0.4125 0.4736 0.3887 S, R(C) 0.9525 0.9691 0.9415
S, T, B 0.2531 0.3305 0.2798 S, T, B 0.5452 0.6086 0.5412
S, T, I 0.2037 0.3338 0.2277 S, T, I 0.5061 0.6013 0.4223
S, T, C 0.4309 0.4717 0.3896 S, T, C 0.9601 0.9740 0.9396
S, T, C, I 0.4241 0.4667 0.3852 S, T, C, I 0.9584 0.9725 0.9383
S, T, R(B) 0.2907 0.3459 0.2568 S, T, R(B) 0.5825 0.6203 0.4884
S, T, R(C) 0.3927 0.4746 0.3929 S, T, R(C) 0.9402 0.9701 0.9462

5.5. TruthfulQA
Tables 12 and 13 present the performance of Llama3, Gemma2, and Mistral on the
TruthfulQA dataset, evaluated using BLEU, ROUGE, METEOR, BLEURT, and BERTScore
metrics under different combinations of Prompt Engineering techniques. As observed in
these tables, the TruthfulQA dataset focuses on assessing the truthfulness and cognitive
biases of language models, where techniques like SSR, ToT, and CoT-based RAG generally
yield strong performance.
Furthermore, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, the combination of SSR
and ToT yields the best performance. Similarly, Gemma2 also performs best with the
combination of SSR and ToT. In contrast, for Mistral, the combination of ToT and CoT-based
RAG tends to deliver the highest performance.
Fundamentally, as seen in Tables 12 and 13, Mistral generally exhibits the best perfor-
mance when no Prompt Engineering techniques are applied, leading in four out of five
evaluation metrics. This suggests that the foundational Mistral model has been pre-trained
particularly well for the TruthfulQA dataset. On the other hand, Llama3, which initially
exhibited the lowest performance across all five metrics, shows the most significant im-
provement when Prompt Engineering strategies are applied, ultimately outperforming all
other models across all five metrics.
Appl. Sci. 2025, 15, 1430 18 of 32

Notably, despite Llama3’s foundational PLM showing the poorest performance ini-
tially, the application of PE strategies enhanced its capabilities, making it the top-performing
LLM. Mistral follows as the second-best performing LLM, while Gemma2 shows the least
favorable results. Particularly noteworthy is the improvement observed in Mistral when
applying techniques like CoT-based RAG, which significantly boosts its performance on
the TruthfulQA dataset compared to the baseline PLM model.

Table 12. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the TruthfulQA dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 3.2280 6.4854 7.9794 B 0.1162 0.1965 0.1966 B 0.0883 0.1490 0.2030
I 3.1545 5.8078 10.5174 I 0.1369 0.1919 0.2111 I 0.1031 0.1410 0.2097
C 4.3404 7.5579 11.7685 C 0.1545 0.2069 0.2589 C 0.1183 0.1619 0.2403
C, I 7.6076 9.7970 13.1527 C, I 0.1745 0.2424 0.2512 C, I 0.1456 0.1903 0.2339
R(B) 13.6465 10.3414 5.4771 R(B) 0.2311 0.2637 0.1295 R(B) 0.1994 0.2078 0.1279
R(C) 18.3547 16.1612 16.5590 R(C) 0.2990 0.2987 0.2780 R(C) 0.2618 0.2523 0.2735
T, B 4.7692 4.3545 6.6566 T, B 0.1354 0.1740 0.2096 T, B 0.1040 0.1148 0.2151
T, I 7.4163 5.2657 10.1729 T, I 0.1663 0.1738 0.2451 T, I 0.1480 0.1260 0.2541
T, C 7.8968 5.0370 11.3654 T, C 0.1981 0.1803 0.2772 T, C 0.1624 0.1221 0.2458
T, C, I 23.6890 10.0892 15.4784 T, C, I 0.3489 0.2216 0.3021 T, C, I 0.3378 0.1735 0.2809
T, R(B) 12.1275 5.6581 13.8990 T, R(B) 0.2205 0.2130 0.2375 T, R(B) 0.1968 0.1495 0.2315
T, R(C) 28.8505 19.3221 19.1410 T, R(C) 0.3973 0.3261 0.3671 T, R(C) 0.3718 0.2880 0.3526
S, B 4.3186 5.4876 7.1867 S, B 0.1374 0.1822 0.2308 S, B 0.1063 0.1336 0.2481
S, I 1.7323 3.6296 10.6035 S, I 0.0913 0.1417 0.2802 S, I 0.0642 0.1025 0.2985
S, C 8.9206 7.3408 11.6499 S, C 0.1853 0.2081 0.3020 S, C 0.1515 0.1614 0.2806
S, C, I 21.4575 10.1314 14.2651 S, C, I 0.3293 0.1881 0.3104 S, C, I 0.3014 0.1656 0.3100
S, R(B) 16.4333 8.8920 10.6353 S, R(B) 0.2709 0.2238 0.2775 S, R(B) 0.2394 0.1816 0.2761
S, R(C) 20.8076 21.6764 17.8575 S, R(C) 0.3252 0.3298 0.3412 S, R(C) 0.2924 0.2962 0.3439
S, T, B 4.5782 4.2550 7.5255 S, T, B 0.1467 0.1648 0.2020 S, T, B 0.1070 0.1135 0.2186
S, T, I 9.1290 3.7706 8.0678 S, T, I 0.2010 0.1428 0.2421 S, T, I 0.1946 0.1036 0.2707
S, T, C 11.1736 6.0713 11.9418 S, T, C 0.2202 0.1950 0.2857 S, T, C 0.1841 0.1464 0.2662
S, T, C, I 26.3708 7.6004 15.1529 S, T, C, I 0.3670 0.2193 0.3601 S, T, C, I 0.3475 0.1637 0.3670
S, T, R(B) 15.8823 5.4032 9.2931 S, T, R(B) 0.2509 0.1926 0.2602 S, T, R(B) 0.2256 0.1387 0.2536
S, T, R(C) 29.6606 20.3296 18.4604 S, T, R(C) 0.4112 0.3354 0.3539 S, T, R(C) 0.3931 0.2988 0.3603

Table 13. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the TruthfulQA dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.2011 0.2725 0.3078 B 0.3992 0.4850 0.4741
I 0.2227 0.2677 0.3378 I 0.4531 0.4906 0.4965
C 0.2290 0.2856 0.3455 C 0.4553 0.4864 0.5519
C, I 0.2519 0.3046 0.3397 C, I 0.4715 0.5175 0.5353
R(B) 0.2970 0.3156 0.2439 R(B) 0.4950 0.5214 0.3377
R(C) 0.3551 0.3538 0.3532 R(C) 0.5598 0.5416 0.5091
T, B 0.2130 0.2334 0.3233 T, B 0.3994 0.4463 0.5174
T, I 0.2563 0.2506 0.3545 T, I 0.4552 0.4505 0.5338
Appl. Sci. 2025, 15, 1430 19 of 32

Table 13. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
T, C 0.2696 0.2542 0.3557 T, C 0.4861 0.4639 0.5800
T, C, I 0.4077 0.2837 0.3792 T, C, I 0.5991 0.5051 0.5881
T, R(B) 0.2981 0.2731 0.3361 T, R(B) 0.5103 0.4669 0.4832
T, R(C) 0.4508 0.3842 0.4214 T, R(C) 0.6359 0.5758 0.6142
S, B 0.2188 0.2662 0.3445 S, B 0.4113 0.4693 0.5326
S, I 0.1690 0.2573 0.3939 S, I 0.3438 0.4733 0.5579
S, C 0.2628 0.2946 0.3826 S, C 0.4861 0.5022 0.5912
S, C, I 0.3925 0.2969 0.3937 S, C, I 0.5993 0.4949 0.5548
S, R(B) 0.3241 0.3012 0.3635 S, R(B) 0.5216 0.5125 0.5694
S, R(C) 0.3792 0.3810 0.4153 S, R(C) 0.5851 0.5741 0.5923
S, T, B 0.2235 0.2464 0.3194 S, T, B 0.4207 0.4536 0.4880
S, T, I 0.3046 0.2281 0.3542 S, T, I 0.5106 0.4332 0.5607
S, T, C 0.2899 0.2719 0.3860 S, T, C 0.5093 0.4852 0.5937
S, T, C, I 0.4290 0.2913 0.4377 S, T, C, I 0.6248 0.5138 0.6190
S, T, R(B) 0.3130 0.2647 0.3480 S, T, R(B) 0.5110 0.4594 0.5684
S, T, R(C) 0.4631 0.3935 0.4285 S, T, R(C) 0.6401 0.5877 0.6232

5.6. Winogrande
Tables 14 and 15 present the performance of Llama3, Gemma2, and Mistral on
the Winogrande dataset, evaluated using the BLEU, ROUGE, METEOR, BLEURT,
and BERTScore metrics under various combinations of Prompt Engineering techniques.
As observed in these tables, the Winogrande dataset focuses on contextual reasoning
abilities concerning common-sense knowledge, where techniques such as ToT, CoT, ICL,
and SSR generally result in strong performance.
Furthermore, the optimal combination of these Prompt Engineering techniques varies
depending on the LLM and the evaluation metric. For Llama3, a combination of ToT, CoT,
and ICL techniques typically yields the best performance. For Gemma2, SSR combined
with CoT tends to deliver the highest performance, while Mistral shows the best results
when combining CoT and ICL techniques.
Fundamentally, as seen in Tables 14 and 15, Gemma2 generally exhibits the best
performance in four out of the five evaluation metrics when no Prompt Engineering
techniques are applied. However, when applying a comprehensive Prompt Engineering
strategy, Gemma2 outperforms both Llama3 and Mistral across all five evaluation metrics.
This indicates that a well-implemented Prompt Engineering strategy can further amplify the
strengths of an LLM. Following Gemma2, Llama3 demonstrates slightly better performance
compared to Mistral.

Table 14. Performances of BLUE, ROUGE, METEOR according to LLMs and Prompt Engineering
techniques on the Winogrande dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
BLEU ROUGE METEOR
B 6.1499 2.8829 0.5264 B 0.3990 0.5460 0.2813 B 0.2067 0.2858 0.1667
I 0.0000 21.1627 0.6066 I 0.3757 0.5440 0.2641 I 0.1908 0.2842 0.1466
C 0.0000 0.0000 9.1291 C 0.2126 0.6772 0.5354 C 0.1080 0.3428 0.2701
Appl. Sci. 2025, 15, 1430 20 of 32

Table 14. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral
C, I 0.0000 0.0000 5.2921 C, I 0.4862 0.6614 0.5824 C, I 0.2483 0.3332 0.2996
R(B) 0.0000 0.0000 0.8863 R(B) 0.3071 0.5617 0.1200 R(B) 0.1563 0.2931 0.0740
R(C) 0.0000 0.0000 4.0067 R(C) 0.4685 0.6417 0.3530 R(C) 0.2394 0.3260 0.1810
T, B 6.8716 2.2136 0.3370 T, B 0.3471 0.5058 0.2642 T, B 0.1796 0.2695 0.1692
T, I 0.0000 0.0000 0.4283 T, I 0.2782 0.4862 0.2490 T, I 0.1433 0.2504 0.1499
T, C 0.0000 0.0000 11.3092 T, C 0.1220 0.6457 0.4146 T, C 0.0627 0.3290 0.2105
T, C, I 0.0000 0.0000 0.0000 T, C, I 0.5807 0.6083 0.5486 T, C, I 0.2938 0.3059 0.2794
T, R(B) 14.1599 0.0000 0.4806 T, R(B) 0.2900 0.5440 0.2154 T, R(B) 0.1446 0.2836 0.1373
T, R(C) 0.0000 0.0000 8.6535 T, R(C) 0.4961 0.6063 0.3963 T, R(C) 0.2522 0.3083 0.2040
S, B 0.0000 2.7428 0.3879 S, B 0.3228 0.5423 0.2764 S, B 0.1714 0.2884 0.1716
S, I 0.0000 8.3435 0.4342 S, I 0.2900 0.4849 0.1843 S, I 0.1516 0.2521 0.1093
S, C 0.0000 0.0000 4.7501 S, C 0.3642 0.6791 0.3622 S, C 0.1855 0.3447 0.1818
S, C, I 0.0000 0.0000 1.9968 S, C, I 0.4764 0.6673 0.5233 S, C, I 0.2426 0.3371 0.2672
S, R(B) 3.4654 0.0000 0.3439 S, R(B) 0.2375 0.5007 0.1805 S, R(B) 0.1234 0.2612 0.1114
S, R(C) 0.0000 0.0000 6.2692 S, R(C) 0.4705 0.6260 0.4221 S, R(C) 0.2404 0.3182 0.2173
S, T, B 4.6461 1.8345 0.2370 S, T, B 0.2917 0.5315 0.2085 S, T, B 0.1531 0.2851 0.1362
S, T, I 3.8892 22.6567 0.1773 S, T, I 0.2087 0.4980 0.1539 S, T, I 0.1071 0.2621 0.1163
S, T, C 0.0000 0.0000 0.0000 S, T, C 0.1457 0.6634 0.2651 S, T, C 0.0746 0.3369 0.1325
S, T, C, I 0.0000 0.0000 0.0000 S, T, C, I 0.3386 0.6201 0.2362 S, T, C, I 0.1710 0.3135 0.1188
S, T, R(B) 4.2532 0.0000 0.2733 S, T, R(B) 0.2808 0.5092 0.1680 S, T, R(B) 0.1479 0.2659 0.1153
S, T, R(C) 0.0000 0.0000 8.6537 S, T, R(C) 0.4764 0.6299 0.4134 S, T, R(C) 0.2434 0.3201 0.2097

Table 15. Performances of BLUERT and BERTScore according to LLMs and Prompt Engineering
techniques on the Winogrande dataset.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
B 0.3965 0.5100 0.2990 B 0.7897 0.8292 0.6530
I 0.3856 0.5090 0.2890 I 0.7682 0.8401 0.6867
C 0.2258 0.6037 0.4991 C 0.6986 0.8903 0.8387
C, I 0.4818 0.5965 0.5421 C, I 0.8206 0.8856 0.8531
R(B) 0.3357 0.5251 0.1963 R(B) 0.7480 0.8442 0.6461
R(C) 0.4633 0.5782 0.3543 R(C) 0.8192 0.8775 0.7700
T, B 0.3547 0.4719 0.2909 T, B 0.7523 0.8088 0.6357
T, I 0.3141 0.4748 0.2815 T, I 0.7446 0.8213 0.6462
T, C 0.1492 0.5776 0.4109 T, C 0.6596 0.8767 0.7996
T, C, I 0.5339 0.5531 0.5205 T, C, I 0.8555 0.8621 0.8482
T, R(B) 0.3155 0.5157 0.2609 T, R(B) 0.7447 0.8435 0.6306
T, R(C) 0.4784 0.5538 0.3992 T, R(C) 0.8207 0.8637 0.7878
S, B 0.3355 0.5045 0.3014 S, B 0.7428 0.8226 0.6409
S, I 0.3106 0.4699 0.2428 S, I 0.7463 0.7927 0.6481
S, C 0.3509 0.6088 0.3807 S, C 0.7649 0.8944 0.7820
S, C, I 0.4815 0.6015 0.4958 S, C, I 0.8213 0.8894 0.8260
S, R(B) 0.2742 0.4828 0.2336 S, R(B) 0.7031 0.8006 0.6078
S, R(C) 0.4715 0.5717 0.4177 S, R(C) 0.8128 0.8733 0.7920
Appl. Sci. 2025, 15, 1430 21 of 32

Table 15. Cont.

Method Llama3 Gemma2 Mistral Method Llama3 Gemma2 Mistral


BLEURT BERTScore
S, T, B 0.3107 0.4943 0.2538 S, T, B 0.7235 0.8111 0.6057
S, T, I 0.2496 0.4827 0.2010 S, T, I 0.7045 0.8178 0.4834
S, T, C 0.1733 0.5990 0.3089 S, T, C 0.6736 0.8846 0.7514
S, T, C, I 0.3399 0.5679 0.2511 S, T, C, I 0.7478 0.8679 0.7410
S, T, R(B) 0.3108 0.4980 0.2257 S, T, R(B) 0.7260 0.8273 0.5934
S, T, R(C) 0.4719 0.5759 0.4118 S, T, R(C) 0.8150 0.8734 0.7988

5.7. Overall Performance Analysis and PE Strategies


Table 16 summarizes the BLEU, ROUGE, METEOR, BLEURT, and BERTScore per-
formance metrics for the three LLMs—Llama3, Gemma2, and Mistral—based on the PE
strategy employed, as presented in Tables 16 through Tables 4–15. This table also identifies
the optimal PE strategy for each dataset and the best-performing LLM per dataset.
As demonstrated in Table 16, it becomes apparent that there is an optimal Prompt
Engineering technique that can be universally applied across all LLMs, contingent upon
the characteristics of the dataset. For instance, datasets such as ARC, GSM8k, HellaSwag,
MMLU, and Winograde, which demand advanced reasoning abilities in diverse specialized
domains including mathematics, science, and humanities, tend to benefit from Prompt En-
gineering strategies rooted in CoT or a combination of CoT and ICL. In contrast, for datasets
like TruthfulQA, where the primary objectives are information retrieval and truthfulness,
RAG-based approaches exhibit superior performance. This distinction underscores that the
preferred Prompt Engineering technique varies in alignment with the specific objectives of
each dataset.
Nevertheless, it is evident that the optimal PE strategy is not uniform across all models
and datasets, but rather varies depending on the specific LLM and dataset in question.
For example, in the ARC dataset, which demands mathematical reasoning skills, the most
effective PE strategy differs among Llama3, Gemma2, and Mistral. Although all approaches
involve CoT or build upon it, the ideal PE combinations vary by model, such as CoT for
Llama3, SSR combined with CoT and ICL for Gemma2, and a combination of SSR, ToT,
CoT, and ICL for Mistral.
In the absence of a Prompt Engineering strategy, the performance of the base PLM
varies depending on the dataset. However, Gemma2 generally demonstrates superior per-
formance, especially on datasets that involve multiple-choice questions related to scientific
topics, language understanding and generation, and contextual reasoning in commonsense
content (e.g., ARC, MMLU, Winogrande). Llama3 excels on datasets requiring natural
language understanding and related reasoning capabilities, such as GSM8K and HellaSwag,
while Mistral shows its strengths in datasets focused on truthfulness and cognitive error
detection, such as TruthfulQA.
When PE strategies are employed, the intrinsic characteristics of the LLMs become
particularly influential, with results generally aligning with the trends observed in the
selection of the most suitable PLM for each dataset. Specifically, as demonstrated in
Table 2, language models that have undergone more advanced pre- and post-training
techniques (e.g., Gemma2) exhibit a heightened capacity to either amplify their inherent
strengths or mitigate their weaknesses when PE strategies are implemented. This is ex-
emplified by Gemma2’s outstanding performance on the HellaSwag dataset, where its
advantages are further emphasized, while Llama3 attains the highest performance on the
TruthfulQA dataset.
Appl. Sci. 2025, 15, 1430 22 of 32

Table 16. Comparison of results across models and datasets using different metrics.

Metric ARC GSM8k HellaSwag MMLU TruthfulQA Winograde


Category Reasoning Math Reasoning General General Reasoning
Llama3
BLEU S, R(C) S, T, R(C) C, I T, C S, T, R(C) T, R(B)
ROUGE C S, R(C) T, C C S, T, R(C) T, C, I
METEOR C T, R(C) T, C C S, T, R(C) T, C, I
BLEURT C S, R(C) S, C, I C, I S, T, R(C) T, C, I
BERTScore T, C I T, C C, I S, T, R(C) T, C, I
Best PES C S, R(C) T, C C & C, I S, T, R(C) T, C, I
Gemma2
BLEU S, C T, R(C) T, C T, C S, R(C) T, S, I
ROUGE R(C) S, C T, R(C) R(C) S, T, R(C) S, C
METEOR R(C) C, I T, R(C) R(C) S, T, R(C) S, C
BLEURT S, C, I C, I T, R(C) R(C) S, T, R(C) S, C
BERTScore S, C, I C, I T, R(C) C S, T, R(C) S, C
Best PES R(C) & S, C, I C, I T, R(C) R(C) S, T, R(C) S, C
Mistral
BLEU S, T, C, I T, C, I R(C) T, R(B) T, R(C) T, C
ROUGE S, T, C, I T, C, I C, I S, C T, R(C) C, I
METEOR S, T, C, I T, C, I S, C S, C S, T, C, I C, I
BLEURT S, T, C, I T, C, I C, I S, C S, T, C, I C, I
BERTScore T, R(B) T, C, I C, I T, C, I S, T, C, I C, I
Best PES S, T, C, I T, C, I C, I S, C S, T, C, I C, I
Best PLM Gemma2 Llama3 Llama3 Gemma2 Mistral Gemma2
Best LLM
Gemma2 Llama3 Gemma2 Gemma2 Llama3 Gemma2
with PES

When considering datasets that encompass a mix of diverse categories, i.e., essentially
reflecting more general datasets, aggregating the occurrences of the most effective Prompt
Engineering techniques as presented in Table 16 can provide valuable insights into which
methods are likely to be most effective across a broad range of datasets. Table 17 summarizes
the cumulative usage of various Prompt Engineering techniques across different datasets.
The experimental results reveal that the CoT technique was the most frequently employed,
appearing 56 times. The Tree of Thought method followed with 47 instances, while ICL
and SSR were each utilized 35 times. Although the R(C) technique was used 28 times
and showed strong performance on specific datasets, it did not exhibit the same level of
consistent performance as the CoT method. These findings suggest that the CoT technique
is particularly effective in enhancing problem-solving and reasoning capabilities across a
wide variety of datasets. In contrast, the R(C) technique excels in generating fact-based
responses, indicating its potential effectiveness in more specialized response generation
scenarios. Furthermore, when applying RAG, the effectiveness can be further enhanced
by using CoT-based data augmentation with R(C), rather than directly applying raw data
with R(B).
To elucidate the dependence on PE techniques and identify the optimal combinations
for each LLM, Table 18 presents an analysis of the frequency with which different PE
techniques are employed across various LLMs. As demonstrated in Tables 16 and 18,
there is a clear inverse relationship between the performance of the base PLM and its
reliance on PE strategies, i.e., models such as Gemma2 and Llama3, which exhibit superior
Appl. Sci. 2025, 15, 1430 23 of 32

base PLM performance, show reduced dependence on PE techniques. In contrast, Mistral,


characterized by the lowest base PLM performance, exhibits a greater reliance on PE
strategies. This observation implies that when the base PLM or foundational model
possesses limited capacity or insufficient performance, it becomes imperative to leverage
PE strategies to mitigate these deficiencies.

Table 17. The number of Prompt Engineering techniques used per dataset

PE Method ARC GSM8k HellaSwag MMLU TruthfulQA Winograde


Total Count
Category Reasoning Math Reasoning General General Reasoning
C 11 8 10 11 3 13 56
T 6 8 8 4 14 7 47
I 6 9 5 3 3 9 35
S 8 4 2 3 13 5 35
R(C) 3 5 5 3 12 0 28
R(B) 1 0 0 1 0 1 3

Table 18. The number of Prompt Engineering techniques used per LLM.

LLM Llama3 Gemma2 Mistral


C 18 14 25
T 17 12 16
I 9 6 20
S 10 14 11
R(C) 10 15 3
R(B) 1 0 2
Total Count 65 61 77

In terms of specific dependence on PE techniques, it is evident that models like


Gemma2 and Llama3, which have the most robust base PLM performance, exhibit a
reduced reliance on ICL techniques, whereas Mistral, with its comparatively lower base
PLM performance, demonstrates a heightened dependence on ICL. This discrepancy may
suggest that Gemma2 and Llama3 have already undergone sufficient ICL training during
their base PLM stages, whereas Mistral may have experienced inadequate ICL. Additionally,
Gemma2 and Llama3 display a greater reliance on R(C) techniques, indicating that even in
the most advanced LLMs, there may exist contexts within the data that remain undefined.
Thus, it is advantageous to assign appropriate weights to these new areas of knowledge
and contexts, distill this knowledge, and integrate it into the generation process to enhance
performance. Furthermore, it becomes evident that employing RAG with well-designed
preprocessing based on PE strategies offers superior performance improvements compared
to the mere application of raw R(B).
Based on the above findings, the final PE strategies have been summarized as follows:
1. Optimal Prompt Engineering strategies vary by dataset and model
• The effectiveness of a Prompt Engineering (PE) strategy depends on the charac-
teristics of both the dataset and the LLM.
• CoT and CoT-ICL combinations are generally effective for datasets requiring ad-
vanced reasoning (e.g., ARC, GSM8K, HellaSwag), while RAG-based techniques
excel in datasets emphasizing factual accuracy (e.g., TruthfulQA).
2. Model-specific PE strategy preferences
Appl. Sci. 2025, 15, 1430 24 of 32

• Llama3: Performs best in tasks involving natural language understanding and


reasoning (e.g., GSM8K, HellaSwag).
• Gemma2: Excels in multiple-choice, science-related, and contextual reasoning
tasks (e.g., ARC, MMLU).
• Mistral: Demonstrates strength in datasets focused on truthfulness and error
detection (e.g., TruthfulQA).
3. PE strategies amplify model strengths
• Advanced models (e.g., Gemma2) show enhanced performance with PE strategies,
amplifying their inherent capabilities or mitigating weaknesses.
• For example, Gemma2’s superior performance on HellaSwag is further improved
through CoT-based PE strategies.
4. CoT dominates as a universal strategy
• CoT was the most frequently used PE strategy across diverse datasets, highlight-
ing its adaptability for reasoning and problem-solving tasks.
• ToT, ICL, and SSR techniques also showed frequent usage but with varying
effectiveness based on dataset requirements.
5. Performance of base PLM determines PE dependency
• Models with superior base PLM performance (e.g., Gemma2, Llama3) show
reduced reliance on PE strategies.
• Mistral, with lower base PLM performance, heavily depends on PE strategies to
achieve competitive results.
6. RAG techniques and augmented preprocessing
• RAG methods excel when combined with CoT-based preprocessing rather than
using raw data directly.
• Enhanced performance is observed when PE strategies refine context or knowl-
edge areas, particularly in fact-based datasets.
7. ICL dependency highlights model limitations
• Models with robust base PLM performance exhibit reduced dependence on ICL,
suggesting adequate training during PLM development stages.
• Mistral’s higher reliance on ICL indicates potential gaps in its base PLM training.
8. Insights on general and specialized datasets
• General datasets benefit from aggregating common PE techniques, with CoT
emerging as a universally effective strategy.
• Specialized datasets require tailored approaches, with R(C) based techniques
excelling in fact-based generation tasks.

5.8. Example Analysis and Discussion


Table 19 compares the results obtained by applying various Prompt Engineering
techniques to the same question. The question, extracted from the TruthfulQA dataset, is
as follows,
“Is brown sugar healthier than white sugar?”
Each technique addresses this question using a different design approach.
As shown in Table 19, B directly presents the question without any additional hints or
explanations. This technique evaluates the model’s fundamental reasoning capabilities, re-
quiring it to independently understand and solve the problem. I provides example question
and example answer alongside the original question, helping the model understand and
learn. It first presents an example question, followed by a brief answer, and then introduces
Appl. Sci. 2025, 15, 1430 25 of 32

the original question in the same format. C presents the original question along with a step-
by-step breakdown of the problem, offering a logical reasoning process to progressively
approach the correct answer. R aids in expanding relevant background knowledge and
facilitating the generation of accurate answers by providing related information prior to
presenting the question. S guides LLMs to approach problems systematically and incremen-
tally, breaking down complex problems into smaller, manageable steps to provide a logical
reasoning process instead of directly generating an answer. T explores multiple pathways
to problem-solving and derives the optimal solution by evaluating various possibilities.
S, T, R(C) methods involve constructing a prompt by combining the aforementioned S, T
and R(C) approaches with the original question to provide relevant information.
The responses of each LLM to the prompts in Table 19 are presented in Tables 20–22.
As shown in Table 20, Gemma’s responses tend to be shorter and more focused on core
content compared to other LLMs. Analyzing by PE strategy, the B approach yields a single-
word response: “No”. Even in other strategies (e.g., I, C, R), Gemma provides slightly more
elaborated responses such as “Brown sugar is not significantly healthier than white sugar”.
However, it consistently maintains brevity and delivers the same core message across
all PE strategies. This indicates that Gemma’s internal reasoning process is robust and
minimally influenced by variations in PE strategies. The primary reason for this behavior is
Gemma’s design philosophy, which prioritizes conciseness and consistency. Such principles
reduce the risk of hallucination and enhance response accuracy by avoiding the generation
of unnecessary content. However, this design also leads to the limitation of providing
insufficient explanatory details, such as the molasses content or calorie differences between
brown and white sugar.
Variations in Gemma’s responses can be observed depending on the PE strategy
applied. Using the B strategy, Gemma generates an extremely concise response, such
as “No”, indicating minimal engagement with structured reasoning. When employing I,
the response becomes slightly more explicit and context-aware, such as “No, brown sugar
is not healthier than white sugar”, demonstrating that ICL helps guide Gemma to align
its responses better with the query’s context. The C approach activates logical reasoning,
resulting in a more refined response like “Brown sugar is not significantly healthier than
white sugar”, reflecting CoT’s ability to enhance logical thinking. Similarly, the S strategy
produces responses comparable to CoT, as both rely on structured logical reasoning to
systematically approach the problem. Under the T strategy, Gemma consistently generates
the same response as CoT, showing that exploring multiple reasoning pathways does
not alter its conclusions, thus demonstrating robust internal consistency. Lastly, the R(C)
strategy incorporates external information into the prompt but still concludes with the same
response: “Brown sugar is not significantly healthier than white sugar”. This indicates that
Gemma’s internalized knowledge is robust enough to generate accurate responses without
heavily depending on retrieved external information. In contrast, as shown in Table 21,
Llama’s responses are relatively longer and include more detailed explanations compared
to Gemma. Furthermore, the level of detail and contextual richness varies depending on
the applied PE strategy.
First, the B approach produces the response, “The health difference between the two is
relatively minimal”, which includes appropriate contextual information. This demonstrates
that Llama 3 can generate refined and reasonable responses even without structured
reasoning, relying solely on its default capabilities. Using I, the response, “Brown sugar
is not significantly healthier than white sugar. Both contain almost the same amount of
calories and sugar. . . ” shows that Llama 3, guided by ICL, is capable of providing more
specific and contextually rich details. The C strategy results in the response, “No, brown
sugar is not significantly healthier than white sugar due to the small amount of molasses
Appl. Sci. 2025, 15, 1430 26 of 32

it contains”, reflecting the ability of Llama 3 to generate logical and concise answers that
directly address the core conclusion through structured reasoning. The R response, “No,
brown sugar has the same calories and health risks as white sugar”, is highly concise and
fact-focused. This suggests that the RAG-based approach emphasizes retrieving external
information while delivering key insights without extensive elaboration. The S response,
“Brown sugar is considered slightly healthier. . . However, it is still high in calories and can
be detrimental to health. . . ”, provides a systematic, step-by-step analysis that incorporates
both advantages and disadvantages. T demonstrates the ability to explore diverse reasoning
pathways and synthesize a comprehensive conclusion through the response, “While brown
sugar may seem like a healthier option. . . both sugars are essentially interchangeable”.
Finally, the S, T, R(C) combined approach produces the concise and fact-based response,
“No, brown sugar is not healthier than white sugar”. This indicates that even with a
combination of strategies, the response remains focused on delivering a straightforward
and factual answer.
Finally, as shown in Table 22, Mistral’s responses are moderately detailed across all PE
strategies, positioning themselves between Gemma’s concise answers and Llama 3’s highly
elaborative responses. When compared to Gemma, Mistral provides richer responses by
including additional information such as calorie content, health risks, and alternatives,
whereas Gemma focuses solely on delivering the core conclusion in a concise manner.
In comparison with Llama 3, Mistral is less detailed than Llama 3’s highly developed
responses, particularly those generated using advanced PE strategies like ToT, which
integrate multidimensional reasoning and expert perspectives. However, Mistral offers
more contextual information than Gemma, striking a balance between brevity and detail.
In a detailed analysis, Mistral’s B response, “Brown sugar is not healthier than white
sugar. . . ”, thoroughly addresses calorie content, carbohydrate composition, molasses dif-
ferences, health risks, and potential alternatives (e.g., honey, maple syrup). I, C, R produce
concise, fact-based answers focusing on calorie and molasses differences. S and T incor-
porate details such as the mineral content of brown sugar while emphasizing that these
elements have minimal health impact. The combined approach of S, T, R(C) generates
concise and factual responses, exemplified by “No, brown sugar is not healthier than white
sugar”. Mistral demonstrates a consistent ability to adapt its responses across PE strategies,
balancing detail and clarity effectively.

Table 19. Prompt example.

PE Prompt
#Question
B
Is brown sugar healthier than white sugar?
#Example Question
What happened during the first radio broadcast of [The War of the Worlds]?
#Example Answer
I
There was no mass panic, but a few listeners called into the station.
#Question
Is brown sugar healthier than white sugar?
#Question
Is brown sugar healthier than white sugar?
#Chain of Thought
C
Some people believe that brown sugar is healthier because it contains molasses. How-
ever, the amount of molasses is very small, making the health differences negligible.
Therefore, the claim that brown sugar is healthier than white sugar is incorrect.
Appl. Sci. 2025, 15, 1430 27 of 32

Table 19. Cont.

PE Prompt
#Related Information
brown sugar has the same calories and health risks as white sugar.
R
#Question
Is brown sugar healthier than white sugar?
Step 1: Understand the Instructions
Ensure that the answer is directly addressing the QUESTION.
Step 2: Analyze the Question
S Read the QUESTION thoroughly.
Identify the context provided in the QUESTION.
#Question
Is brown sugar healthier than white sugar?
There are three different experts who will answer this question.
Each expert will write down their thoughts and share them with the group.
T They will then evaluate the shared thoughts and deduce the most appropriate answer.
#Question
Is brown sugar healthier than white sugar?
There are three different experts who will answer this question.
Each expert will write down their thoughts and share them with the group.
They will then evaluate the shared thoughts and deduce the most appropriate answer.
#Related Information
brown sugar has the same calories and health risks as white sugar.
S, T,
#Question
R(C)
Is brown sugar healthier than white sugar?
#Chain of Thought
Some people believe that brown sugar is healthier because it contains molasses. How-
ever, the amount of molasses is very small, making the health differences negligible.
Therefore, the claim that brown sugar is healthier than white sugar is incorrect.

Table 20. Answer to the question (Gemma).

PE Answer
B No.
I No, brown sugar is not healthier than white sugar.
C Brown sugar is not significantly healthier than white sugar.
R Brown sugar is not significantly healthier than white sugar.
S No, brown sugar is not necessarily healthier than white sugar.
T Brown sugar is not significantly healthier than white sugar.
S, T, R(C) Brown sugar is not significantly healthier than white sugar.

Table 21. Answer to the question (Llama).

PE Answer
While brown sugar may contain some additional nutrients like iron and minerals compared to white sugar, it is still
B considered a refined sugar and should be consumed in moderation. Therefore, the health difference between the two
is relatively minimal.
Brown sugar is not significantly healthier than white sugar. Both contain almost the same amount of calories and
I sugar, with the main difference being that brown sugar contains a small amount of molasses, which gives it a distinct
flavor and darker color.
C No, brown sugar is not significantly healthier than white sugar due to the small amount of molasses it contains.
R No, brown sugar has the same calories and health risks as white sugar.
Appl. Sci. 2025, 15, 1430 28 of 32

Table 21. Cont.

PE Answer
Brown sugar is considered slightly healthier than white sugar due to its higher content of minerals like iron, calcium,
and potassium. However, it is still high in calories and can be detrimental to health if consumed excessively. The main
S difference between brown sugar and white sugar lies in the refining process, with brown sugar retaining some of its
molasses content, which provides it with these additional minerals. Nevertheless, it’s essential to note that the health
benefits of brown sugar are relatively minor compared to the negative effects of excessive sugar consumption in general.
Expert 1: Nutritionist Brown sugar is not necessarily healthier than white sugar, while it contains some minerals like
iron and calcium, it is still primarily composed of sucrose, a simple sugar that can cause a spike in blood sugar levels
and contribute to various health problems when consumed excessively. The difference in nutritional content between
brown sugar and white sugar is minimal, and the health benefits of brown sugar are largely anecdotal. Expert 2: Food
Scientist From a chemical perspective, brown sugar and white sugar are almost identical. Both are composed of sucrose,
a disaccharide made up of glucose and fructose molecules. The main difference between the two is the presence of
molasses in brown sugar, which gives it a distinct flavor and color. However, the molasses does not significantly impact
the nutritional content or health effects of brown sugar. In terms of health, both sugars are essentially interchangeable.
T
Expert 3: Registered Dietitian While brown sugar may seem like a healthier option due to its slightly higher mineral
content, the difference is negligible. The real issue is not the type of sugar, but rather the amount of sugar consumed
overall. Both brown sugar and white sugar can contribute to excessive sugar intake, which is linked to various health
problems, including obesity, diabetes, and tooth decay. To make a healthier choice, focus on reducing sugar intake in
general, rather than opting for one type of sugar over another. Group Evaluation After considering the perspectives of our
experts, it’s clear that brown sugar is not significantly healthier than white sugar, while it may contain some minerals,
the difference is minimal, and both sugars can contribute to excessive sugar intake when consumed in excess. The key
takeaway is to focus on reducing sugar intake in general, rather than opting for one type of sugar over another.
S, T, R(C) No, brown sugar is not healthier than white sugar.

Table 22. Answer to the question (Mistral).

PE Answer
Brown sugar is not healthier than white sugar. Both contain similar amounts of calories and simple carbohydrates.
The main difference is that brown sugar contains a small amount of molasses, which gives it a slightly different flavor
B and color, but it does not significantly affect its nutritional value. Consuming either type of sugar in excess can lead to
health problems such as tooth decay and weight gain. It’s always best to limit added sugars in your diet and opt for
natural sweeteners like honey or maple syrup in moderation.
Brown sugar is not healthier than white sugar. Both contain the same amount of calories and simple carbohydrates.
I The main difference is that brown sugar contains a small amount of molasses, which gives it a slightly different flavor
and color, but it does not make it a healthier choice.
No, brown sugar is not healthier than white sugar. The small amount of molasses in brown sugar does not significantly
C
impact its nutritional value compared to white sugar. Both are high in calories and should be consumed in moderation.
Brown sugar is not healthier than white sugar. Both contain the same number of calories and can contribute to health
R
risks when consumed in excess.
No, brown sugar is not healthier than white sugar. Both contain similar amounts of calories and simple carbohydrates.
S The difference lies in the molasses content, which provides a slightly higher amount of minerals like iron and
potassium in brown sugar.
Brown sugar is not healthier than white sugar. Both brown and white sugar are high in calories and can contribute to
weight gain and other health issues when consumed in excess. Brown sugar may have a slightly higher content of
T
minerals like iron and potassium, but these amounts are negligible compared to the sugar content. It’s always best to
limit added sugars in your diet and opt for natural sweeteners like honey or maple syrup in moderation.
S, T, R(C) No, brown sugar is not healthier than white sugar.

Table 23 summarizes the strengths and weaknesses of each model across various
Prompt Engineering techniques. In summary, Gemma is best for tasks requiring concise,
efficient, and consistent answers but lacks depth and adaptability for advanced reasoning
tasks. Llama 3 excels in depth, multidimensional reasoning, and advanced strategies
like ToT but may be overly verbose for straightforward queries. Mistral strikes a balance
Appl. Sci. 2025, 15, 1430 29 of 32

between detail and brevity, providing contextual richness without excessive elaboration,
making it versatile for a wide range of tasks.

Table 23. Simplified analysis of LLM responses by Prompt Engineering strategies.

PE Model Strengths Weaknesses


Gemma Extremely concise and efficient. Lacks explanation and context.
Llama Provides minimal context to support the answer. Limited depth; lacks detailed reasoning.
B
Includes detailed context and alternative
Mistral May include unnecessary details.
suggestions.
Gemma Slightly expanded with clearer structure. Lacks additional context or elaboration.
I Llama Adds contextual details on calories and molasses. Limited broader exploration of health aspects.
Mistral Concise but adds molasses context. Does not emphasize broader dietary implications.
Gemma Logical and concise. Lacks depth or elaboration.
C Llama Logical and well-structured. Does not include broader implications.
Mistral Combines conciseness and reasoning. Focuses narrowly on molasses and calories.
Gemma Concise and factual. No additional details retrieved.
R Llama Incorporates retrieved details effectively. Limited additional elaboration or context.
Mistral Retrieves relevant facts concisely. Lacks broader discussion.
Gemma Maintains clarity. Does not break down reasoning into steps.
S Llama Balances detailed analysis and conclusions. Could provide more structured reasoning.
Mistral Highlights pros and cons systematically. Lacks depth compared to Llama 3.
Gemma Consistent and clear. Does not explore multiple reasoning pathways.
T Llama Integrates diverse perspectives effectively. Lengthy; may include redundant details.
Mistral Explores health impacts and broader alternatives. Lacks multidimensional reasoning like Llama 3.
Gemma Concise and factual. Does not leverage retrieval or CoT for depth.
R(C) Llama Balances retrieval and structured reasoning. Limited elaboration or alternative views.
Mistral Effectively combines RAG with reasoning. Lacks the complexity of Llama 3.
Gemma Concise and consistent. Lacks added depth from combined approaches.
S, T, R(C) Llama Provides comprehensive and balanced reasoning. Can be lengthy and overly detailed.
Mistral Clear and concise; maintains core message. Misses opportunities for detailed exploration.

6. Conclusions
This study analyzed and applied prominent Prompt Engineering techniques, such
as ICL, CoT, RAG, SSR, and ToT, to major LLMs like LlaMA3, Mistral, and Gemma2.
The performance was evaluated across multiple datasets, including ARC, HellaSwag,
MMLU, TruthfulQA, Winogrande, and GSM8K, using metrics such as BLEU, ROUGE,
METEOR, BLEURT, and BERTScore.
The experimental results demonstrated that the most appropriate Prompt Engineering
techniques vary depending on the characteristics of each dataset. Specifically, datasets
that emphasize mathematical and logical reasoning benefited from Prompt Engineering
strategies centered on CoT, SSR, and ToT, while datasets focused on natural language
understanding showed improved performance with ICL-centric strategies. For datasets
where factual accuracy is paramount, RAG-based strategies proved to be most effective.
However, the study also revealed that the optimal combination of Prompt Engineering
techniques can differ significantly depending on the LLM. This indicates that, while a
Appl. Sci. 2025, 15, 1430 30 of 32

universally applicable Prompt Engineering technique might exist across different LLMs
and datasets, the specific optimal strategy can vary based on the dataset in question.
Generally, Gemma2 exhibited the best performance across most datasets, regardless
of whether a Prompt Engineering strategy was applied. This suggests that well-trained
PLMs tend to perform better even when additional PE strategies are employed. Among in-
dividual Prompt Engineering techniques, CoT was found to be the most advantageous
for performance enhancement. Furthermore, the study showed that applying RAG with
CoT-based data augmentation is more effective than using raw RAG data alone.
It was also observed that more advanced and recently developed LLMs have a lower
dependence on PE, and when PE is applied, these models show a greater degree of per-
formance improvement. In terms of specific dependence on PE techniques, it was found
that Gemma2 and Llama3, which have the strongest base PLM performance, rely less
on ICL, while Mistral, with a lower base PLM performance, shows a higher reliance on
ICL. Conversely, Gemma2 and Llama3 demonstrated a higher reliance on RAG techniques
for performance enhancement, indicating that even well-developed LLMs benefit from
strategically applied PE-based preprocessing.
The findings of this study emphasize the importance of considering both the charac-
teristics of the dataset and the inherent properties of the LLM when selecting Prompt Engi-
neering techniques for performance optimization. These insights will be crucial for guiding
future research and strategy development in the field of LLM performance optimization.

Author Contributions: Software, M.S.; Validation, M.S. and S.L.; Investigation, S.L.; Resources, S.L.;
Writing—original draft, S.L.; Writing—review & editing, Y.-J.W.; Funding acquisition, Y.-J.W. All
authors have read and agreed to the published version of the manuscript.

Funding: This research was financially supported by the Ministry of Trade, Industry and Energy
(MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International
Cooperative RD program. (Project Number: P0022701)

Data Availability Statement: The original contributions presented in this study are included in the
article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest: The funders had no role in the design of the study; in the collection, analyses,
or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References
1. Rula, A.; D’Souza, J. Procedural Text Mining with Large Language Models. In Proceedings of the 12th Knowledge Capture
Conference 2023, Pensacola, FL, USA, 5–7 December 2023; pp. 9–16. [CrossRef]
2. Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini:
A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805.
3. Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis
of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [CrossRef]
4. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.;
et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [CrossRef]
5. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.;
et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2023, arXiv:2204.02311.
6. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI
Blog 2019, 1, 9.
7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In
Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [CrossRef]
8. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al.
Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [CrossRef]
9. Meta. Llama3. Meta AI Blog. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 23 January 2025).
Appl. Sci. 2025, 15, 1430 31 of 32

10. AliTech Blog, PaliGemma and Gemma2: Google Breakthrough in Vision-Language Models. Available online: https://alitech.io/
blog/paligemma-and-gemma-2-google-breakthrough/ (accessed on 23 January 2025).
11. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier,
L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [CrossRef]
12. Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Hanna, E.B.; Bressand, F.;
et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [CrossRef]
13. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Li, L.; Sui, Z. A Survey on In-context Learning. arXiv 2022,
arXiv:2301.00234. [CrossRef]
14. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training
language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [CrossRef]
15. Lightman, H.; Kosaraju, V.; Burda, Y.; Edwards, H.; Baker, B.; Lee, T.; Leike, J.; Schulman, J.; Sutskever, I.; Cobbe, K. Let us Verify
Step by Step. arXiv 2023, arXiv:2305.20050. [CrossRef]
16. Liu, F.; Lapata, M. Text Summarization with Pretrained Encoders. arXiv 2019, arXiv:1908.08345.
17. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with
Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada,
10–15 December 2024.
18. Reynolds, L.; McDonell, K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. arXiv 2021,
arXiv:2102.07350.
19. Gu, Y.; Dong, L.; Wei, F.; Huang, M. Pre-Training to Learn in Context. In Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 4849–4870.
20. Min, S.; Lewis, M.; Zettlemoyer, L.; Hajishirzi, H. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA,
USA, Online, 10–15 July 2022; pp. 2791–2809.
21. Liu, J.; Shen, D.; Zhang, Y.; Dolan, B.; Carin, L.; Chen, W. What makes good in-context examples for GPT-3? In Proceedings of
the Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
(DeeLIO@ACL 2022), Dublin, Ireland, 27 May 2022; pp. 100–114.
22. Kim, H.J.; Cho, H.; Kim, J.; Kim, T.; Yoo, K.M.; Lee, S. Self-generated in-context learning: Leveraging autoregressive language
models as a demonstration generator. arXiv 2022, arXiv:2206.08082.
23. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in
Large Language Models. arXiv 2022, arXiv:2201.11903.
24. Tonmoy, S.M.; Zaman, S.M.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A Comprehensive Survey of Hallucination Mitigation
Techniques in Large Language Models. arXiv 2024, arXiv:2401.01313.
25. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al.
Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv 2020, arXiv:2005.11401v4.
26. Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; et al. Check your facts and try again:
Improving large language models with external knowledge and automated feedback. arXiv 2023, arXiv:2302.12813.
27. Varshney, N.; Yao, W.; Zhang, H.; Chen, J.; Yu, D. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by
validating low-confidence generation. arXiv 2023, arXiv:2307.03987.
28. Gao, L.; Dai, Z.; Pasupat, P.; Chen, A.; Chaganty, A.T.; Fan, Y.; Zhao, V.Y.; Lao, N.; Lee, H.; Juan, D.C.; et al. Rarr: Researching and
revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 16477–16508.
29. Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-verification reduces hallucination in
large language models. arXiv 2023, arXiv:2309.11495.
30. Amatriain, X. Prompt Design and Engineering: Introduction and Advanced Methods. arXiv 2024, arXiv:2401.14423.
31. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A compre-
hensive review. arXiv 2023, arXiv:2310.14735.
32. Singh, A.; Ehtesham, A.; Gupta, G.K.; Chatta, N.K.; Kumar, S.; Khoei, T.T. Exploring Prompt Engineering: A Systematic Review
with SWOT Analysis. arXiv 2024, arXiv:2410.12843.
33. Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. arXiv 2021,
arXiv:2104.09864. [CrossRef]
34. Shazeer, N. GLU variants improve transformer. arXiv 2020, arXiv:2002.05202.
35. Steele, D.; Specia, L. Vis-Eval Metric Viewer: A Visualisation Tool for Inspecting and Evaluating Metric Scores of Machine
Translation Output. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Demonstrations, New Orleans, LA, USA, 1–6 June 2018.
36. Bhattacharyya, P. Indowordnet’s help in Indian language machine translation. AI Soc. 2020, 35, 689–698.
Appl. Sci. 2025, 15, 1430 32 of 32

37. Popović, M.; Ney, H. Syntax-Oriented Evaluation Measures for Machine Translation Output. In Proceedings of the Fourth
Workshop on Statistical Machine Translation, Athens, Greece, 30–31 March 2009.
38. Sellam, T.; Das, D.; Parikh, A.P. BLEURT: Learning Robust Metrics for Text Generation. arXiv 2020, arXiv:2004.04696.
39. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019,
arXiv:1904.09675.
40. Boratko, M.; Padigela, H.; Mikkilineni, D.; Yuvraj, P.; Das, R.; McCallum, A.; Chang, M.; Fokoue-Nkoutche, A.; Kapanipathi, P.;
Mattei, N.; et al. A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset. arXiv 2018,
arXiv:1806.00358.
41. Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? arXiv 2019,
arXiv:1905.07830.
42. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language
Understanding. arXiv 2020, arXiv:2009.03300.
43. Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv 2021, arXiv:2109.07958.
44. Sakaguchi, K.; Bras, R.L.; Bhagavatula, C.; Choi, Y. Winogrande: An Adversarial Winograd Schema Challenge at Scale. arXiv
2021, arXiv:1907.10641. [CrossRef]
45. Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training
Verifiers to Solve Math Word Problems. arXiv 2021, arXiv:2110.14168.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy