Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
Abstract—Recent advancements in Large Language Models ability of function calling [14], that allows LLMs to execute
(LLMs) have demonstrated exceptional capabilities in natural provided functions and utilize their outputs to assist in task
language understanding and generation. While these models excel completion. These functions can vary from basic tools like
in general complex reasoning tasks, they still face challenges in
mathematical problem-solving and logical reasoning. To address a calculator [15] that performs arithmetic operations to more
arXiv:2410.18890v1 [cs.AI] 24 Oct 2024
these limitations, researchers have explored function calling advanced methods. However, concentrating on specific tasks
abilities, allowing LLMs to execute provided functions and utilize that use only a small portion of available APIs underscores
their outputs for task completion. However, concentrating on the inefficiency of depending solely on large models like
specific tasks can be very inefficient for large-scale LLMs to be GPT-4, which require significant computational resources for
used, because of the expensive cost of training and inference
stages they need in terms of computational resources. This study both the training and inference stages [7], [16]–[18]. This
introduces a novel framework for training smaller language situation calls for the creation of smaller, task-specific LLMs
models in function calling, focusing on specific logical and that maintain core functionality while reducing operational
mathematical reasoning tasks. The approach aims to improve costs [2], [19]. The trend towards smaller models, while
performances of small-scale models for these tasks using function promising, introduces new challenges. One significant concern
calling, ensuring a high level of accuracy. Our framework
employs an agent that, given a problem and a set of callable is the increased likelihood of errors or ”hallucinations,” which
functions, queries the LLM by injecting a description and can compromise the accuracy of output formatting [20]–
examples of the usable functions into the prompt and managing [22]. Given that precise output formatting is essential for
their calls in a step-by-step reasoning chain. This process is used developing robust software applications, this issue becomes
to create a dataset of correct and incorrect reasoning chain chat particularly critical. To address the drawbacks of oversized
completions from a large-scale LLM. This dataset is used to
train a smaller LLM using Reinforcement Learning from Human LLMs, which incur excessive training and inference costs,
Feedback (RLHF), specifically employing the Direct Preference we introduce a novel framework to train smaller language
Optimization (DPO) technique. Experimental results demonstrate models starting from the function calling abilitities of large
how the proposed approach balances the trade-off between model models, for specific logical and mathematical reasoning tasks.
size and performance, improving the ability of function calling This framework involves the use of an agent that, given
for reasoning tasks, in smaller models.
a problem and a set of possible functions useful for its
Index Terms—function calling, large language model, reason- solution, queries a large-scale LLM by injecting function
ing, logical reasoning, mathematical reasoning, reasoning task, descriptions and examples into the prompt and managing the
first-order logic, LLM, RLHF, DPO, FOL, GSM8K
proper function calls that the model needs to find the solution,
all that in a step-by-step reasoning chain. This procedure
I. I NTRODUCTION is so used for the creation of a dataset with correct and
incorrect chat completions. The generated dataset is then used
R Ecent years have seen the rapid development of Large
Language Models (LLMs) that have demonstrated ex-
ceptional natural language understanding and generation ca-
to train a smaller model using a Reinforcement Learning
from Human Feedback (RLHF) [23]–[26] approach, known
pabilities. Research has explored the unexpected abilities of as Direct Preference Optimization (DPO) [27]. We present
LLMs beyond their primary training task of text prediction the methodology tested on two different types of reasoning
[1]. These models have shown promise in function calling for tasks, First-Order Logic (FOL) and math. To achieve this goal
software APIs [2]–[6], boosted by the launch of GPT-4 plug- a set of FOL problems were built ad hoc, taking inspiration
in features [7]. Integrated tools include web broswer, transla- from the HuggingFace dataset SAGI-1/SYMBOLIC_DATA_
tion system, Dialogue State Tracking (DST) [8] and robotics PLUS_REASONING_DATA_V1 [28]. Examples of mathemat-
[9], [10]. Furthermore, while LLMs have shown promising ical problems were drew directly from the GSM8K [15], [29]
results in general complex reasoning benchmarks, they still dataset. In Section II FOL and the DPO are presented. In
face challenges in mathematical problem-solving and logi- Section III, the pipeline and methodologies to generate the
cal capacities [11]. To address these limitations, researchers dataset and train the small-scale model are shown. Finally, in
have proposed various techniques [12], [13], including the Section IV we present experimental results of our framework,
where performance of the trained model is compared to the
G. A. Manduzio, F. A. Galatolo, M. G. C. A. Cimino, L. original one.
Cominelli and E. P. Scilingo are with the Dipartimento di Ingegneria
dell’Informazione, Università di Pisa, 56122 Pisa, Italy (e-mail: grazianoal-
fredo.manduzio@phd.unipi.it), (e-mail: {federico.galatolo, mario.cimino,
lorenzo.cominelli, enzo.scilingo}@unipi.it).
2
II. R ELATED W ORKS logical and statistical AI approaches [41]. This fusion enables
A. First-Order Logic (FOL) AI systems to reason with uncertainty while maintaining the
structured representation offered by FOL. Additionally, FOL
First-Order Logic (FOL), also known as First-Order Pred-
has played a significant role in the semantic web and ontology
icate Calculus or Predicate Logic, is a formal system that
engineering, where it is used to define and reason about
extends propositional logic to include variables, quantifiers,
conceptual models of various domains [42].
and predicates. This extension allows for greater expressive
power in formalizing mathematical statements and reasoning C. Examples of First-Order Logic Statements
[30], [31]. The syntax of FOL comprises both logical and non-
To better understand the components and structure of First-
logical symbols. Logical symbols include connectives (such
Order Logic (FOL), let’s examine a specific example statement
as negation, conjunction, disjunction, implication, and bicon-
and break down its elements. Consider the following FOL
ditional), quantifiers (universal and existential), and paren-
statement:
theses. Non-logical symbols consist of constants, variables,
function symbols, and predicate symbols. These components ∀x∃y(P (x) → (Q(x, y) ∧ R(y)))
work together to create a rich language capable of expressing This statement can be read in natural language as: ”For every
complex logical relationships and structures. The semantics x, there exists a y such that if P(x) is true, then both Q(x,y)
of FOL provide a framework for interpreting formulas and and R(y) are true.”
determining their truth values [32]. A key feature of FOL is The structure of this statement demonstrates several key
the use of quantifiers, which allow for statements about all or features of FOL:
some elements in the domain [33]. The universal quantifier 1) Quantification: The use of both universal (∀) and
expresses that a property holds for all elements in the domain, existential (∃) quantifiers allows us to make statements
while the existential quantifier expresses that a property holds about all elements or the existence of elements in our
for at least one element in the domain. These quantifiers domain.
significantly enhance the expressive power of FOL compared 2) Variables: x and y are used to represent arbitrary
to propositional logic. FOL finds numerous applications across elements in the domain, allowing for general statements
various fields [13], [34]. In mathematics, it is used for for- about the relationships between elements.
malizing theories and proofs. In computer science, FOL is 3) Predicates: P, Q, and R represent properties or relations.
applied in the specification and verification of software and P and R are unary predicates (apply to one variable),
hardware systems. The field of artificial intelligence utilizes while Q is a binary predicate (applies to two variables),
FOL for knowledge representation and automated reasoning. showing FOL’s ability to express different types of
Additionally, linguistics employs FOL in the study of formal relations.
semantics of natural languages. 4) Logical Structure: The statement uses implication (→)
and conjunction (∧) to create a complex logical struc-
B. FOL and AI ture, demonstrating FOL’s ability to express intricate
The field of artificial intelligence (AI) extensively utilizes logical relationships.
First-Order Logic (FOL) for knowledge representation and Constants in First-Order Logic: In addition to variables,
automated reasoning. FOL’s expressive power and formal FOL also includes constants, which represent specific, named
semantics make it an ideal choice for capturing complex individuals in the domain of discourse. Constants allow us to
knowledge structures and facilitating inference in AI systems make statements about particular entities rather than arbitrary
[35]. In knowledge representation, FOL allows for the formal- ones. To illustrate the difference between variables and con-
ization of domain-specific knowledge, enabling AI systems to stants, consider the following two FOL statements:
reason about objects, their properties, and relationships in a ∃x(Movie(x) ∧ ActedIn(y, x))
structured manner [36]. This capability is crucial in expert
systems, where domain knowledge is encoded as logical rules ∃x(Movie(x) ∧ ActedIn(seanconnery, x))
and facts, allowing the system to make informed decisions The first statement uses two variables, x and y. It can be
based on logical inference [37]. In the realm of automated read as: ”There exists a movie x such that y acted in x.”
reasoning, FOL serves as the foundation for many theorem- This statement asserts the existence of a movie and an actor,
proving systems and logical inference engines [38]. These sys- but doesn’t specify who the actor is. The second statement
tems employ techniques such as resolution and unification to uses a variable x and a constant seanconnery. It can be
derive new knowledge from existing facts and rules, a process read as: ”There exists a movie x such that Sean Connery
fundamental to various AI applications, including planning and acted in x.” This statement is more specific, asserting the
decision-making [39]. Moreover, FOL has been instrumental existence of a movie in which the particular individual Sean
in the development of answer set programming, a paradigm Connery acted. The use of the constant seanconnery allows
for declarative problem solving that has found applications in us to make a claim about a specific person, whereas the
areas such as constraint satisfaction and automated planning variable y in the first statement could refer to any actor. The
[40]. The integration of FOL with probabilistic methods has latter and other examples of FOL statements can be found
led to the development of statistical relational learning and in the HuggingFace dataset SAGI-1/SYMBOLIC_DATA_
probabilistic logic programming, bridging the gap between PLUS_REASONING_DATA_V1 [28].
3
D. Direct Preference Optimization • once tasks and problems are defined, in turn a set of
While supervised fine-tuning is a common approach, al- functions needs to be defined for each problem. These
ternative methods leveraging Reinforcement Learning from functions serve for the LLM to solve the reasoning steps,
Human Feedback (RLHF) have gained prominence. One such control the chain flow and verify the intermediate and
method is Proximal Policy Optimization (PPO) [43], which in- final responses, working similar to the process-supervised
tegrates a reward model into the reinforcement learning frame- reward models (PRMs) or to the Outcome-supervised
work for policy optimization. Despite its effectiveness, PPO’s Reward Models (ORMs) presented in [44];
requirement for extensive human feedback to train the re- • choise of a pre-trained large-scale LLM to generate the
ward model makes it resource-intensive and time-consuming. dataset of right and wrong completions using a chain-of-
A more efficient and equally effective alternative is Direct thoughts prompting that forces the LLM to reason step-
Preference Optimization (DPO) [27]. DPO distinguishes itself by-step. An agent queries the LLM, stores the response
by enabling the model to learn a policy directly from user and calls the function until the solution to the problem
preference data, eliminating the need for an explicit reward is obtained. Final solutions can be right or wrong. The
function. Furthermore, DPO has demonstrated superior stabil- chains of thought for each problems are recorded;
ity compared to PPO. The DPO process begins with gathering • fine-tuning a small-scale LLM using reinforcement learn-
human feedback. Assessors evaluate pairs of model-generated ing on the given dataset. The DPO algorithm is per-
responses to identical prompts, creating a dataset of preference formed.
pairs. Unlike PPO, which trains a separate reward model, The pipeline of the presented framework is shown in Fig. 2.
DPO incorporates these preferences directly into the training
objective. Fig. 1 illustrates the key differences between these task and large-scale
function small-scale
approaches. DPO’s optimization algorithm iteratively adjusts problem LLM dataset
definition LLM training
formulation generation
B. Experimental setup
Figure 1: Comparison of typical RLHF approach and Direct Preference
Optimization (DPO), [27].
To test our framework, the following choices were made:
• we defined 6 FOL problems, 3 with a single predicate and
the model parameters to maximize the likelihood of outputs 3 with double predicates; we then drew 9 mathematical
aligning with reviewer preferences. This is achieved through problems from the GSM8K dataset for a total of 15
a specialized loss function (Equation 1) that directly penalizes problems overall;
the generation of less-preferred outputs. For a training instance • we defined a set of callable function for each problem.
(x, yw , yl ), comprising an input x, a preferred output yw , For example for the mathematical problems we defined
and a non-preferred output yl , the loss function compares the the fundamental operations. Furthermore, we defined the
probabilities of the reference policy πref (the initial policy) verifier CheckCorrectChain() and the chain flow
with those of the new policy πθ for both preferred and non- controller Stop();
preferred outputs. • we created a specific dataset for the DPO algorithm, using
assistant role, the agent calls the function the LLM needs to program stops correctly in n ≤ nmax = 10 iterations. Finally,
solve the reasoning step and proceed to the next one, recording Tab. VI shows an example of wrong completion yli,k for xi ,
the output as associated to the user role and so on, until where k ∈ Ki ≜ {0, 1, . . . , n̄i − 1} and n̄i is the number
the final solution is obtained. The CheckCorrectChain() of all the right completions related to the i-th prompt. The
function works as a verifier that checks whether or not the chain presents a syntax error and the Stop() function is
intermediate and the final responses are called in the right prematurely called before it is correctly executed. For each
order. The same work flow is implemented for a GSM8K problem prompt, we generated nic = ni + n̄i = 1000 samples
dataset problem, but the verifier checks only the correctness of for a total amount of |D∗ | = 15000 samples.
the final solution. Hence, the agent at each iteration records
the entire chain of LLM reponses/function outputs as a set
E. Data augmentation of the DPO dataset
of assistant/user role chat contents, presented in the format
commonly used for the chat completions of the LLMs [47], Given the original dataset D∗ of all tuples (xi , ywi,j
, yli,k ),
[48]. Therefore the LLM bases its next response on the ∀i ∈ I, ∀j ∈ J i , ∀k ∈ K i , we built an augmented dataset
previous ones recorded in the chain. A block diagram of the Da where for a given prompt xi a combination of all the
microchain library work flow is shown in Fig. 3. correct completions yw i,j
with all the wrong completions yli,k
was made, extending the cardinality of the dataset from |D∗ | =
|P| = |Dw | + |Dl | to |Da | = |Dw | × |Dl |. The final DPO
set of dataset D in (1) is a subset of uniform random samples from
callable agent LLM Da , such that D ⊂ Da , where the cardinality of D is |D| = ns ,
functions
i.e. the number of samples belonging D. In particular we drew
a total count of ns = 40000 samples. This approach allows
an augmentation of the original dataset. A sample of the final
prompt dataset can be written as
(x = xi ∼ P, yw = yw
i,j
∼ Dw , yl = yli,k ∼ Dl ) .
Figure 3: Schematic of the agent-based library microchain, used to
generate the dataset. We split the dataset in training and test set. Specifically, data
related to 4 FOL problems and 5 GSM8K problems were used
for the training, instead the remaining data (related to the
remaining 6 problems) was used for the model inference test.
D. Dataset Creation Process Further details are given in Tab. I and Tab. II.
Our dataset D∗ ≡ {P, Dw , Dl } was generated using
Llama3-70B through the microchain, producing, given a Dataset FOL GSM8K overall
set of prompts (denoted by P), a range of correct and incorrect Training set 17823 14045 31868
chain completions (whose sets are denoted respectively by Test set 7403 10729 18132
Dw and Dl ). The generation process encompassed various Table I: Sample count number for the train and test set, counted for the 2
scenarios, differentiated by the type of the previous discussed different problem tasks.
reasoning tasks (GSM8K or FOL) and the maximum number
of allowed n iterations in the chain of thought (nmax = 10
or 20), where each iteration is a couple of assistant/user role ip Train (FOL) Test (FOL) Train (GSM8K) Test (GSM8K)
chat contents. For each problem in the dataset, we recorded a 0 4421 - 1582 -
prompt, a set of correct completions (right completions), and a 1 2957 - 2941 -
2 5214 - 2581 -
set of incorrect completions (wrong completions). For a better 3 5231 - 5039 -
understanding of the dataset disposition, see Fig. 4 and Fig. 5, 4 - 5247 1902 -
where indices it ∈ B, in ∈ B, ip ∈ Ip ≜ {0, 1, 2, . . . , 8}, and 5 - 2156 - 3381
6 - - - 1673
i ∈ I ⊆ {B × B × Ip }, B ≜ {0, 1}, identify the type of task 7 - - - 386
(mathematical or logic), the maximum chain iteration number 8 - - - 5289
(10 or 20), the problem within the specific task, ranging from 0 tot. 17823 7403 14045 10729
to 5 for the FOL problem set and 0 to 9 for the GSM8K problem
Table II: Sample count number for the train and test set, counted for each
set, and the global index, equals to (it , in , ip ), respectively. of the 15 problems.
i,j
An example of a right completion yw for xi is shown in
Tab. IV, where j ∈ J ≜ {0, 1, . . . , n − 1} and ni is the
i i
three libraries specific for the training optimization process, computed on subsets of It related to different nmax values and
namely accelerate [51], deepspeed [52] and peft types of data subsets (training and test type of data subsets),
[53]. Training parameters used for DPO are specified in for the sake of brevity is still denoted withPāit . Finally, the
Tab. VII. Tab. VIII and IX summarize the hardware and the overall average accuracy is defined as ā ≜ i∈I ai /|I|.
software specification used, respectively. Note that despite
machine specifications report 4 GPU, only one GPU was used IV. E XPERIMENTAL RESULTS
for the training process. The percentage accuracy aiO × 100 for the problem i-th and
∗
the dataset DO , is compared to aiT ×100 for the dataset DT∗ , as
Prompt
shown in Fig. 6 and 7. We can see how FOL task performance
Act as a movie expert. is completely improved, except for the problem 3 with nmax =
You can use the following functions: 20. Model performance is also improved on the GSM8K data
subset, where on the whole is better than the original model,
Reasoning(reasoning: str) apart from some cases. Note that, as expected, performance
Use this function for your internal
reasoning.
with nmax = 20 is better than one with nmax = 10, for both
Example: the original and the fine-tuned models. Overall performance
Reasoning(reasoning="The next step to is so improved, but the fine-tuned model struggles more with
take is...") the GSM8K problems. The same trend is observed in Tab. X,
where task average percentage accuracy āit × 100 and overall
Actor(name: str)
Predicate to check if a given name
average percentage accuracy ā × 100 evaluations are reported
is an actor. for different models, types of data subsets and nmax values.
Example: Note that also performances on the training and test set, taken
Actor(name="Sean Connery") individually, are improved after the fine-tuning process for all
the configurations of nmax values and tasks. The trend is also
Movie(x: str)
Predicate that queries IMDb to determine
confirmed in Fig. 8, where a comparison of the original and the
if the argument is a movie. fine-tuned model task average percentage accuracies āit × 100
Example: with the overall percentage accuracy āit × 100 is presented.
Movie(x="Goldfinger") Furthermore, results are statistically evaluated with a Wilcoxon
signed-rank test where p-values are computed comparing the
ActsIn(actor: str, movie_title: str)
Check if a specific actor acted in
original and the fine-tuned model accuracies, for each task
a given movie. data subset and the whole dataset. Values are all lower than
Example: 0.05, confirming that the difference is statistically significant,
ActsIn(actor="Sean Connery", as shown in Tab. XI, where W represents the Wilcoxon signed-
movie_title="Goldfinger") rank test statistic. Finally, loss metric in (1) computed at each
CheckCorrectChain()
global step k during the training process is shown in Fig. 9,
Check if the labels are correct. where the evaluation on the test set is executed at the end of
Example: each epoch. A global step represents a single update of the
CheckCorrectChain() model parameters. It is incremented every time the optimizer
performs a backpropagation operation and updates the model
Stop()
Use this function to stop the program.
weights. Each epoch implies that the model has processed all
Example: the preference pairs (right and wrong completions) present in
Stop() the dataset. The loss metric is computed using the wandb
library [54].
Verify if Tom Hanks acted in
the movie Cast Away. V. C ONCLUSIONS AND FUTURE WORKS
In this study, we introduced a novel framework for improv-
Table III: An example of a prompt xi of a sample of the dataset D∗ .
ing the function calling abilities of small-scale LLMs, focusing
on specific logical and mathematical reasoning tasks. Our
approach addresses the inefficiencies and high computational
G. Performance metrics costs associated with relying solely on large-scale LLMs
After the training of the small-scale LLM, we tested by leveraging the capabilities of small-scale models through
performance in inference of both the original and trained RLHF. We employed an agent-based system that interacts with
∗
small-scale LLM, producing 2 new different datasets DO a large-scale LLM to generate a dataset comprising correct
∗
and DT , respectively, with 1000 generated samples for each and incorrect step-by-step reasoning chain chat completions
task problem. Metric used to compare model performances in the domains of First-Order Logic (FOL) and mathematical
is the accuracy, defined for the i-th problem prompt as reasoning tasks drawn from the GSM8K dataset. Utilizing
ai ≜ |J i |/(|J i | + |K i i i
P |) = in /nc . The task average accuracy this dataset, we trained a smaller model, Mistral-7B-Instruct-
is defined as āit ≜ i∈It a /|It | where It ⊂ I is the problem v0.2, employing RLHF with the DPO technique. Our ex-
index subset related to the t-th task. The average accuracy perimental results demonstrate significant improvements in
6
Right Completion
i,j
Table IV: An example of a right completion yw of a sample of the dataset D∗ . The chain presents the functions called in the correct order.
Right Completion
i,j
Table V: An example of a right completion yw of a sample of the dataset D∗ . At the beginning the chain presents a syntax error in the function calling,
but then the completion is correctly executed.
7
Wrong Completion
Table VI: An example of a wrong completion yli,k of a sample of the dataset D∗ . The chain presents a syntax error, and the function Stop() is called
before the chain is correctly executed.
root project
dataset
task: it ∈ B = {0, 1}
nmax: in ∈ B = {0, 1}
problem: ip ∈ Ip = {0, 1, . . . , 8}
xi ∈ P ywi,j ∈ Dw yli,k ∈ Dl
i ∈ I ⊆ {B × B × Ip} j ∈ J ≜ {0, 1, . . . , ni − 1}
i
j ∈ K ≜ {0, 1, . . . , n̄i − 1}
i
Optimizer adamw_bnb_8bit
OS Linux Ubuntu 20.04.6 LTS
Learning Rate 5e-8 (with decay)
Python v3.8.10
LR Scheduler cosine
torch v2.3.0
Zero Stage 3
transformers v4.42.3
Warmup Steps 10
trl v0.8.6
Table VII: Parameter for model training in DPO.
accelerate v0.31.0
deepspeed v0.14.2
peft v0.11.1
the performance of the small-scale model on FOL tasks,
achieving near-perfect accuracy in most cases. While the wandb v0.17.0
improvements on the GSM8K mathematical problems are more Table IX: Software environment.
modest, the trained model still outperforms the original model
in overall accuracy. These findings suggest that our framework
effectively improves the function calling abilities of smaller tools (the callable functions) and the abilities in the given
models enhancing their capabilities in the using of external reasoning tasks. By successfully improving the integration of
8
root project
dataset
GSM8K FOL
xi ∈ P ywi,j ∈ Dw yli,k ∈ Dl
i ∈ I ⊆ {B × B × Ip} j = 0, 1, . . . , ni − 1 k = 0, 1, . . . , n̄i − 1
Figure 5: Extended schematic of the generated original dataset D∗ disposition.
nmax = 10
Model Dataset FOL GSM8K Overall
original Training set 76.47% 12.82% 44.65%
original Test set 90.19% 2.56% 46.38%
original Whole set 81.04% 8.26% 44.65%
fine-tuned Training set 88.75% 16.81% 52.78%
fine-tuned Test set 99.75% 2.77% 51.26%
fine-tuned Whole set 92.42% 10.57% 51.50%
nmax = 20
Model Dataset FOL GSM8K Overall
original Training set 83.18% 20.22% 51.70%
original Test set 94.14% 7.22% 50.68%
original Whole set 86.84% 14.45% 50.65%
fine-tuned Training set 90.50% 26.36% 58.43%
fine-tuned Test set 99.75% 8.18% 53.97%
fine-tuned Whole set 93.58% 18.28% 55.93%
nmax = 10 and 20
Figure 8: Comparison of the task average percentage accuracies āit × 100
and the overall average percentage accuracy ā × 100 calculated for the Model Dataset FOL GSM8K Overall
original and the fine-tuned model, for different task data subsets and the
whole dataset. original Training set 79.83% 16.52% 48.18%
original Test set 92.16% 4.89% 48.53%
original Whole set 83.94% 11.35% 47.65%
fine-tuned Training set 89.62% 21.58% 55.60%
fine-tuned Test set 99.75% 5.47% 52.61%
fine-tuned Whole set 93.00% 14.42% 53.71%
Dataset W p-value
FOL 77 4.88e-4 < α
GSM8K 137 1.18e-2 < α
Overall 416 2.50e-5 < α
Table XI: Wilcoxon signed-rank test results comparing original and
fine-tuned model accuracies for different datasets. Significant p-values
(p < α = 0.05) are highlighted in bold. W is the Wilcoxon test statistic.
[7] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, [32] D. van Dalen, Logic and Structure. Springer Science & Business Media,
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 2012.
technical report,” arXiv preprint arXiv:2303.08774, 2023. [33] R. Smullyan, “First-order logic. dover publications inc,” 1995.
[8] Z. Li, Z. Z. Chen, M. Ross, P. Huber, S. Moon, Z. Lin, X. L. Dong, [34] M. Fitting, First-order logic and automated theorem proving. Springer
A. Sagar, X. Yan, and P. A. Crook, “Large language models as zero- Science & Business Media, 2012.
shot dialogue state tracker through function calling,” arXiv preprint [35] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach.
arXiv:2402.10466, 2024. Pearson, 2016.
[9] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for [36] R. J. Brachman, Knowledge Representation and Reasoning. Morgan
robotics: Design principles and model abilities. 2023,” Published by Kaufman/Elsevier, 2004.
Microsoft, 2023. [37] J. C. Giarratano and G. Riley, Expert systems: principles and program-
[10] C. Wang, S. Hasler, D. Tanneberg, F. Ocker, F. Joublin, A. Ceravola, ming. Brooks/Cole Publishing Co., 1989.
J. Deigmoeller, and M. Gienger, “Lami: Large language models for [38] A. J. Robinson and A. Voronkov, Handbook of automated reasoning.
multi-modal human-robot interaction,” in Extended Abstracts of the CHI Elsevier, 2001, vol. 1.
Conference on Human Factors in Computing Systems, 2024, pp. 1–10. [39] R. Kowalski, Logic for problem solving. Department of Computational
[11] J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, J. Xu, M. Ding, H. Li, Logic, Edinburgh University, 1974, vol. 75.
M. Geng et al., “A survey of reasoning with foundation models,” arXiv [40] V. Lifschitz, Answer set programming. Springer Heidelberg, 2019,
preprint arXiv:2312.11562, 2023. vol. 3.
[12] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, [41] L. De Raedt and K. Kersting, “Probabilistic inductive logic program-
and H. Chen, “Reasoning with language model prompting: A survey,” ming,” in Probabilistic inductive logic programming: theory and appli-
arXiv preprint arXiv:2212.09597, 2022. cations. Springer, 2008, pp. 1–27.
[13] M. Huth, Logic in Computer Science: Modelling and reasoning about [42] S. Staab and R. Studer, Handbook on ontologies. Springer Science &
systems. Cambridge University Press, 2004. Business Media, 2013.
[14] S. Kim, S. Moon, R. Tabrizi, N. Lee, M. W. Mahoney, K. Keutzer, [43] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
and A. Gholami, “An llm compiler for parallel function calling,” arXiv “Proximal policy optimization algorithms,” 2017.
preprint arXiv:2312.04511, 2023. [44] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee,
[15] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by
M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers step,” arXiv preprint arXiv:2305.20050, 2023.
to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. [45] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman,
[16] T. B. Brown, “Language models are few-shot learners,” arXiv preprint A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of
arXiv:2005.14165, 2020. models,” arXiv preprint arXiv:2407.21783, 2024.
[17] A. Radford, “Improving language understanding by generative pre- [46] F. Galatolo, “Microchain,” https://github.com/galatolofederico/
training,” 2018. microchain, 2023, accessed: 2024-09-11.
[18] Y. Wu, F. Jia, S. Zhang, H. Li, E. Zhu, Y. Wang, Y. T. Lee, R. Peng, [47] H. Face, “Chat templating in transformers,” 2024, accessed: 2024-09-
Q. Wu, and C. Wang, “An empirical study on challenging math problem 10. [Online]. Available: https://huggingface.co/docs/transformers/chat
solving with gpt-4,” arXiv preprint arXiv:2306.01337, 2023. templating
[19] V. Pallagani, B. C. Muppasani, K. Roy, F. Fabiano, A. Loreggia, [48] OpenAI, “Chat completions guide,” 2024, accessed: 2024-
K. Murugesan, B. Srivastava, F. Rossi, L. Horesh, and A. Sheth, “On the 09-10. [Online]. Available: https://platform.openai.com/docs/guides/
prospects of incorporating large language models (llms) in automated chat-completions
planning and scheduling (aps),” in Proceedings of the International [49] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E.
Conference on Automated Planning and Scheduling, vol. 34, 2024, pp. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for
432–444. large language model serving with pagedattention,” in Proceedings of
[20] J.-Y. Yao, K.-P. Ning, Z.-H. Liu, M.-N. Ning, and L. Yuan, “Llm lies: the ACM SIGOPS 29th Symposium on Operating Systems Principles,
Hallucinations are not bugs, but features as adversarial examples,” arXiv 2023.
preprint arXiv:2310.01469, 2023. [50] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush,
[21] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, N. Lambert, and S. Huang, “Trl: Transformer reinforcement learning,”
Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: a survey on hal- https://github.com/huggingface/trl, 2020.
lucination in large language models,” arXiv preprint arXiv:2309.01219, [51] S. Gugger, L. Debut, T. Wolf, P. Schmid, Z. Mueller, S. Mangrulkar,
2023. M. Sun, and B. Bossan, “Accelerate: Training and inference at scale
[22] Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating made simple, efficient and adaptable.” https://github.com/huggingface/
llm hallucination via self reflection,” in Findings of the Association for accelerate, 2022.
Computational Linguistics: EMNLP 2023, 2023, pp. 1827–1843. [52] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System
[23] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, optimizations enable training deep learning models with over 100 billion
“Deep reinforcement learning from human preferences,” Advances in parameters,” in Proceedings of the 26th ACM SIGKDD International
neural information processing systems, vol. 30, 2017. Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–
[24] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Rad- 3506.
ford, D. Amodei, and P. F. Christiano, “Learning to summarize with [53] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan,
human feedback,” Advances in Neural Information Processing Systems, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https://
vol. 33, pp. 3008–3021, 2020. github.com/huggingface/peft, 2022.
[25] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, [54] L. Biewald, “Experiment tracking with weights and biases,” 2020,
P. Christiano, and G. Irving, “Fine-tuning language models from human software available from wandb.com. [Online]. Available: https:
preferences,” arXiv preprint arXiv:1909.08593, 2019. //www.wandb.com/
[26] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, [55] L. Yang and A. Shami, “On hyperparameter optimization of machine
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language learning algorithms: Theory and practice,” Neurocomputing, vol. 415,
models to follow instructions with human feedback,” Advances in neural pp. 295–316, 2020.
information processing systems, vol. 35, pp. 27 730–27 744, 2022. [56] G. A. Manduzio, F. Galatolo, M. G. Cimino, M. Bruscia, L. Cominelli,
[27] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and and E. P. Scilingo, “Advanced control of humanoid facial robotics: A
C. Finn, “Direct preference optimization: Your language model is deep learning approach to inverse kinematics,” Authorea Preprints, 2024.
secretly a reward model,” 2023. [57] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,
[28] SAGI-1, “Symbolic data plus reasoning data v1,” https: and D. Kalenichenko, “Quantization and training of neural networks for
//huggingface.co/datasets/SAGI1/SYMBOLIC DATA PLUS\ efficient integer-arithmetic-only inference,” in Proceedings of the IEEE
REASONING DATA V1, 2023, accessed: [insert access date]. conference on computer vision and pattern recognition, 2018, pp. 2704–
[29] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, 2713.
M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schul- [58] G. Hinton, “Distilling the knowledge in a neural network,” arXiv preprint
man, “Training verifiers to solve math word problems,” arXiv preprint arXiv:1503.02531, 2015.
arXiv:2110.14168, 2021.
[30] H. B. Enderton, A mathematical introduction to logic. Elsevier, 2001.
[31] E. Mendelson, “Introduction to mathematical logic,” 2015.