0% found this document useful (0 votes)
12 views8 pages

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

This paper introduces the concept of 'retrials without feedback' to enhance the reasoning capabilities of large language models (LLMs) by allowing them to retry problem-solving attempts without requiring explicit self-reflection or verbalized feedback. The findings suggest that simpler retrial-based methods often outperform more complex reasoning frameworks, challenging the assumption that intricate strategies yield better performance. The study emphasizes the importance of cost-efficiency in reasoning strategies and proposes further exploration of the retrial mechanism under varying computational budgets.

Uploaded by

xk l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

This paper introduces the concept of 'retrials without feedback' to enhance the reasoning capabilities of large language models (LLMs) by allowing them to retry problem-solving attempts without requiring explicit self-reflection or verbalized feedback. The findings suggest that simpler retrial-based methods often outperform more complex reasoning frameworks, challenging the assumption that intricate strategies yield better performance. The study emphasizes the importance of cost-efficiency in reasoning strategies and proposes further exploration of the retrial mechanism under varying computational budgets.

Uploaded by

xk l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Are Retrials All You Need?

Enhancing Large Language Model Reasoning


Without Verbalized Feedback

Nearchos Potamitis Akhil Arora


Aarhus University Aarhus University
nearchos.potamitis@cs.au.dk akhil.arora@cs.au.dk

Abstract on reasoning (Cobbe et al., 2021) and knowledge-


intensive tasks (West et al., 2009), often requiring
Recent advancements in large language mod-
interactions with complex environments, such as
els (LLMs) have catalyzed the development of
arXiv:2504.12951v1 [cs.CL] 17 Apr 2025

general-purpose autonomous agents, demon- playing complex video games (Fan et al., 2022),
strating remarkable performance in complex performing web navigation (Yao et al., 2022), or
reasoning tasks across various domains. This enabling tool-use (Schick et al., 2023).
surge has spurred the evolution of a plethora of Naturally, the rise of LLM-based agents has con-
prompt-based reasoning frameworks. A recent tributed to the prosperity of prompt-based reason-
focus has been on iterative reasoning strate-
ing frameworks (Wei et al., 2022; Wang et al., 2023;
gies that refine outputs through self-evaluation
and verbalized feedback. However, these strate-
Besta et al., 2024; Sel et al., 2024; Yang et al., 2024;
gies require additional computational complex- Yao et al., 2024; Zhou et al., 2024; Shinn et al.,
ity to enable models to recognize and correct 2023; Yao et al., 2023; Potamitis et al., 2024) that
their mistakes, leading to a significant increase further enhance the problem-solving and reasoning
in their cost. In this work, we introduce the abilities of LLMs.
concept of “retrials without feedback”, an em-
barrassingly simple yet powerful mechanism Iterative refinement and its challenges. A bur-
for enhancing reasoning frameworks by allow- geoning area of research in this field involves the
ing LLMs to retry problem-solving attempts exploration of iterative reasoning strategies that re-
upon identifying incorrect answers. Unlike fine and improve responses through self-evaluation.
conventional iterative refinement methods, our Methods such as Self-Refine (Madaan et al., 2023)
method does not require explicit self-reflection and Reflexion (Shinn et al., 2023) employ explicit
or verbalized feedback, simplifying the refine-
verbalized feedback to guide an LLM in correcting
ment process. Our findings indicate that sim-
pler retrial-based approaches often outperform its mistakes and refining its outputs. While effec-
more sophisticated reasoning frameworks, sug- tive, these approaches add an additional layer of
gesting that the benefits of complex methods complexity, requiring the model to both recognize
may not always justify their computational its own errors and articulate useful self-correction
costs. By challenging the prevailing assump- strategies. Moreover, owing to an incremental
tion that more intricate reasoning strategies in- increase in the context window size, verbalized
herently lead to better performance, our work
feedback monotonically increases the cost for ev-
offers new insights into how simpler, more ef-
ficient approaches can achieve optimal results. ery subsequent iteration of the overall reasoning
So, are retrials all you need? pipeline, thereby rendering iterative refinement-
based methods exorbitantly expensive. For in-
1 Introduction stance, with GPT-4 as the base model, RAFA (Liu
et al., 2024), the state-of-the-art refinement strat-
With strong reasoning and problem-solving abili-
egy, requires a whopping ≃ 600$ on the Game of
ties, large language models (LLMs) (Brown et al.,
24 benchmark task (Yao et al., 2024) comprising
2020) such as GPT-4 (Achiam et al., 2024; OpenAI
just 100 examples in the test set.
and et al., 2024), LLaMA (Touvron et al., 2023a,b;
Grattafiori et al., 2024), and PaLM (Anil et al., Present work. In this paper, we introduce the con-
2023), have sparked a new-found interest in build- cept of “retrials without feedback”, an embarrass-
ing general-purpose autonomous agents. LLM- ingly simple yet effective mechanism that enhances
based agents have portrayed excellent performance reasoning frameworks by allowing them to retry
a problem-solving attempt whenever an incorrect employ a uniform, task-independent prompting
answer is identified. Unlike Reflexion (Shinn et al., framework across multiple tasks, enabling a single
2023), which relies on explicit self-generated feed- LLM to iteratively refine its responses and dynami-
back, retrials operate without requiring verbalized cally adapt to diverse input queries. The Buffer of
introspection: an incorrect answer simply triggers Thoughts (BoT) (Yang et al., 2024) framework ex-
another attempt until the correct solution is found tracts task-specific information, uses it to retrieve
or a predefined computational budget is exhausted. relevant thought templates from its meta-buffer,
Our results show that under the retrial and then instantiates them with more task-specific
mechanism, simpler methods such as chain-of- reasoning structures before continuing with the rea-
thoughts (Wei et al., 2022) often outperform more soning process.
sophisticated reasoning frameworks such as tree- Refinement. Closed-loop approaches that allow an
of-thoughts (Yao et al., 2024) or Reflexion (Shinn LLM to interact with an external environment can
et al., 2023). This suggests that, when given the op- help in choosing and potentially revising an action.
portunity to retry, the added complexity of sophis- Notable examples are ReAct (Yao et al., 2023), RE-
ticated reasoning frameworks and self-reflective FINER (Paul et al., 2023) and Self-Refine (Madaan
approaches may not always justify their compu- et al., 2023). Reflexion (Shinn et al., 2023) pro-
tational cost. This raises a fundamental question: vides further linguistic feedback based on previous
Are retrials all you need? By reframing the evalua- attempts while AdaPlanner (Sun et al., 2023) also
tion of reasoning strategies through the lens of cost incorporates positive and negative feedback of an
efficiency, our work challenges the assumption that individual trajectory. Reason for future, act for
more intricate frameworks necessarily lead to bet- now (RAFA) (Liu et al., 2024) develops further
ter performance and highlights the importance of by planning a trajectory, gathering feedback for
reconsidering optimization strategies for reasoning the potential planned actions, and then revising the
with LLMs. trajectory based on the feedback.
2 Related Work Tree search. Thoughts are individual ideas or steps
in reasoning, and when connected together, they
In this section, we review works that overlap can be modeled as a tree data structure. Tree search
closely with our study algorithms can then be used to explore a tree of
Prompt-based reasoning. Recent research focuses thoughts and optimize the search for a final an-
on developing strategies to enhance the reasoning swer. In “Tree of Thoughts” (ToT), the authors
capabilities of LLMs. Few-shot prompting em- utilize a value function that compares different
ploys demonstrations of high-quality input/output branches to describe both DFS and BFS flavors of
samples to exploit the eagerness of the LLMs to im- a guided tree-search (Yao et al., 2024). The closely
itate patterns seen in their context window (Brown related “Graph of Thoughts” (GoT) approach re-
et al., 2020). Algorithm of thoughts (AoT) (Sel laxes the assumption of a strict tree structure (Besta
et al., 2024), goes a step further by including al- et al., 2024). Reasoning via Planning (RAP) (Hao
gorithmic examples within the prompt to propel et al., 2023) augments LLMs with a world model
the LLM through algorithmic reasoning pathways. and employs Monte Carlo Tree Search (MCTS)-
Chain-of-Thought (CoT) prompting (Nye et al., based planning to reduce the search complexity.
2021; Wei et al., 2022; Kojima et al., 2022) as Language Agent Tree Search (LATS) (Zhou et al.,
well as other variants such as Decomposed Prompt- 2024) extends this concept by leveraging environ-
ing (Khot et al., 2023) and Least-to-Most (Zhou ment interactions, thereby eliminating the need for
et al., 2023) guide LLMs to decompose a com- a world model.
plex question into a sequence of thoughts and then
3 Experiments
synthesize an answer by resolving them method-
ically. It has been shown that Self-Consistency In this section, we provide details on the bench-
(CoT-SC) (Wang et al., 2022) can be used to aug- marks and the different types of analyses that we
ment such methods by generating multiple thought performed for our experiments. Please refer to Ap-
sequences and then selecting the most accurate pendix A for more information regarding the rea-
answer through majority voting. Recent meta- soning strategies used in this study. For additional
prompting techniques (Suzgun and Kalai, 2024) results, please see Appendix B.
Figure 1: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the base
model. Within the indicated budget, simpler methods outperform more complex ones while remaining cost-efficient.
3.1 Benchmark tasks Base model. We use GPT-4o-mini (OpenAI and
et al., 2024) and LLaMA-3.3-70B (Grattafiori et al.,
Game of 24. Game of 24 is a mathematical puzzle, 2024) as the base models.
where four numbers are given, and the objective
is to form an arithmetic expression that equals 24 3.3 Analysis
using each number exactly once. The benchmark
data consists of 1362 puzzles. Following ToT (Yao Cost and Number of Retrials Analysis. This
et al., 2024), we use the puzzles indexed 901-1000 experiment aims to evaluate the effectiveness of
as the test set. To evaluate the quality of the meth- different methods under a constrained budget. Each
ods we use success rate, that is, the percentage method follows an iterative approach, attempting to
of solved puzzles. For efficiency, we use cost (in solve all of the samples in the first trial. Unsolved
US$). samples are re-tried in subsequent trials until either
the budget is exhausted or of them all are solved. If
HumanEval. HumanEval is a programming puz-
the budget runs out during an iteration, the method
zle that measures functional correctness for synthe-
halts immediately. To investigate cost-effectiveness
sizing programs from natural language docstrings.
we present the quality and cost for GPT-4o-mini
Following Reflexion (Shinn et al., 2023), we eval-
in Fig. 1. The corresponding plot for LLama 3.3
uate the methods on 161 Python programs. We
70B can be found in the appendix (Fig. 4). To
use the accuracy evaluation pass@1 to measure the
also display sample efficiency, we plot quality and
quality of the benchmark. For efficiency, we use
number of trials.
cost (in US$).
Temperature Analysis. To better understand the
HotpotQA. HotpotQA (Zhilin et al., 2018) is a
cost-effectiveness of each method, we repeated the
large-scale question-answering dataset designed to
previous experiments using different temperature
evaluate multi-hop reasoning. Multi-step methods
values. Due to budget and time restrictions, we
such as ToT, are allowed to use an interactive API
only conduct this analysis for Game of 24 across
environment which allows the agent to search for
the CoT and ToT methods. The results can be
documents and look up specific information within
found in Fig. 2 for GPT-4o-mini and Fig. 3 for
them. Following previous methods (Zhou et al.,
Llama-3.3-70B.
2024; Shinn et al., 2023) we evaluate on 100 ran-
domly selected samples of their choice. We mea- 4 Main results
sure the quality of the answer based on whether
there is an exact match (EM), given an oracle an- Cost efficiency. Our results demonstrate that sim-
swer. For efficiency, we use cost (in US$), ple methods such as IO and CoT prompting are
3.2 Experiment setup significantly more efficient than complex reason-
ing approaches such as ToT (Yao et al., 2024) or
Baselines. We analyze the impact of retrials on four Reflexion (Shinn et al., 2023). Specifically, across
prompting strategies: (1) Standard IO prompting, all benchmarks and models, CoT prompting con-
(2) CoT (Wei et al., 2022), (3) ToT (Yao et al., sistently outperforms alternative methods, often by
2024), and (4) Reflexion (Shinn et al., 2023). a considerable margin. In fact, for Game of 24,
Figure 2: Comparing the cost-quality trade-off of CoT and ToT across different temperature levels using GPT-4o-
mini as the base model. For CoT success rate is strictly increasing as temperature increases and so does for ToT but
not strictly.
CoT achieved a 94% success rate: a result that 5 Discussion and Concluding Insights
methods such as (Liu et al., 2024; Yao et al., 2024)
would need multiple hundreds of dollars (≃ 600$)
Summary of Findings. Our results in Figs. 1
to achieve on GPT-4.
and 4 have shown that methods such as Chain of
Cost efficiency appears to be influenced not only Thought are more cost-efficient than complex rea-
by the method employed but also by the specific soning strategies on a variety of models and tasks.
task. For instance, in the HumanEval and Hot- In fact, for some tasks, we have achieved state-of-
potQA tasks, both CoT and IO methods reach a the-art quality performance with minimal resources
performance plateau earlier, whereas, in the Game in terms of cost and model capabilities. Based
of 24 task, the efficiency peak is more gradual. In on the results we have also shown that the cost-
addition, the base model plays a crucial role in efficiency of a method is affected not only by the
performance disparities. This is particularly evi- task but by the model as well. Finally, we have
dent in the quality gap between GPT-4o-mini and shown that the performance of the retrial concept
Llama-3.3-70B in the Game of 24 and HotpotQA can be improved even further by properly tuning
benchmarks. While the two methods exhibit com- the model’s temperature
parable performance with the former model, CoT Future work. In the future, we would like to ex-
substantially outperforms IO in the latter. These tend our results by showcasing the behavior of the
findings suggest that base models with stronger in- methods when allowing an even bigger budget. For
herent reasoning capabilities (e.g. GPT-4o-mini) the case of HumanEval, it is already apparent that
can simultaneously enhance the cost-quality trade- Reflexion and Tree of Thoughts will surpass the
off of simple methods such as IO prompting. other methods if allowed for more budget. Even
though this does not go against our claims regard-
Temperature analysis. For GPT-4o-mini (Fig. 2), ing cost-efficiency, we would like to investigate
we can clearly see that a higher temperature value further and showcase the full picture. Additionally,
results in a better success rate. This is clearly visi- we aim to explore methods that leverage the oc-
ble for CoT as the gap between each temperature currence of retrials to further optimize reasoning
result is significantly wide. Furthermore, we can processes. Specifically, we seek to develop tech-
even see that for a temperature value of 1.0 the ex- niques that can exploit the iterative nature of retrials
periment already achieves 100% accuracy at half of to improve efficiency, potentially reducing the num-
the allocated budget. For ToT, this is also predomi- ber of attempts needed to reach a correct solution.
nantly true. A possible reason why this is not as ev- Finally, our method uses trivial deterministic veri-
ident in the latter case is that multi-query methods fiers to decide whether an answer has been found
such as ToT, often introduce complex prompting and in this case, no retry is needed. We would aim
schemes that are affected in arbitrary ways by an to extend our method to tasks where there’s not a
increase in the temperature values. trivial deterministic verifier to validate the answer.
Overall, we hope that our work will inspire further Askell, et al. 2020. Language models are few-shot
research into the role of retrials in cost-efficient rea- learners. In NeurIPS, volume 33, pages 1877–1901.
soning and the broader optimization of reasoning Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
frameworks for LLM-based problem-solving Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Limitations Nakano, Christopher Hesse, and John Schulman.
2021. Training verifiers to solve math word prob-
The main limitation of our work is that it cannot lems. arXiv preprint arXiv:2110.14168.
be applied to tasks where the answer cannot be Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Man-
directly verified. For example, in the game of 24, dlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang,
the goal is to find a mathematical formula that eval- De-An Huang, Yuke Zhu, and Anima Anandkumar.
uates to 24 given some input numbers. Once a 2022. Minedojo: Building open-ended embodied
agents with internet-scale knowledge. In NeurIPS:
formula is found, checking whether it equals 24 Datasets and Benchmarks Track.
is straightforward. Similarly, in HumanEval, an
answer is verified based on whether the generated Aaron Grattafiori et al. 2024. The llama 3 herd of mod-
els. arXiv preprint arXiv:2407.21783.
function passes specific tests. This allows us to stop
retrying a puzzle once a correct answer has been Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong,
found. However, in cases similar to HotpotQA, the Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023.
Reasoning with language model is planning with
correct answer is hidden and only used for evalua- world model. In EMNLP.
tion, making direct verification impossible during
the solving process. As discussed in our future Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu,
Kyle Richardson, Peter Clark, and Ashish Sabharwal.
work, we aim to extend our method to tasks where 2023. Decomposed prompting: A modular approach
answers cannot be deterministically verified in a for solving complex tasks. In ICLR.
trivial manner.
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid,
Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large
Acknowledgements language models are zero-shot reasoners. In NeurIPS,
pages 22199–22213.
We thank Chris Schwiegelshohn and Niket Tan-
don for insightful discussions. Arora’s lab is partly Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi
supported by grants from the Novo Nordisk Foun- Ke, Boyi Liu, and Zhaoran Wang. 2024. Reason for
future, act for now: A principled architecture for
dation (NNF24OC0099109), the Pioneer Centre autonomous LLM agents. In ICML.
for AI, and EU Horizon 2020 (101168951).
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
References Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
et al. 2023. Self-refine: Iterative refinement with
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama self-feedback. arXiv preprint arXiv:2303.17651.
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-
Shyamal Anadkat, et al. 2024. Gpt-4 technical report. Ari, Henryk Michalewski, Jacob Austin, David
arXiv preprint arXiv:2303.08774. Bieber, David Dohan, Aitor Lewkowycz, Maarten
Bosma, David Luan, Charles Sutton, and Augustus
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Odena. 2021. Show your work: Scratchpads for inter-
Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mediate computation with language models. CoRR,
mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng abs/2112.00114.
Chen, et al. 2023. Palm 2 technical report. arXiv OpenAI and Hurst et al. 2024. Gpt-4o system card.
preprint arXiv:2305.10403. Preprint, arXiv:2410.21276.
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beat-
stenberger, Michal Podstawski, Lukas Gianinazzi, riz Borges, Antoine Bosselut, Robert West, and
Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom- Boi Faltings. 2023. Refiner: Reasoning feedback
ski, Piotr Nyczyk, et al. 2024. Graph of thoughts: on intermediate representations. arXiv preprint
Solving elaborate problems with large language mod- arXiv:2304.01904.
els. In AAAI, pages 17682–17690.
Nearchos Potamitis, Lars Klein, Roland Aydin, Caglar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Gulcehre, Robert West, and Akhil Arora. 2024. Fleet
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind of agents: Coordinated problem solving with large
Neelakantan, Pranav Shyam, Girish Sastry, Amanda language models. Preprint, arXiv:2405.06691.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Shunyu Yao, Howard Chen, John Yang, and Karthik
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- Narasimhan. 2022. Webshop: Towards scalable real-
moyer, Nicola Cancedda, and Thomas Scialom. 2023. world web interaction with grounded language agents.
Toolformer: Language models can teach themselves In NeurIPS, pages 20744–20757.
to use tools. In NeurIPS.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom
Bilgehan Sel, Ahmad Tawaha, Vanshaj Khattar, Ruoxi Griffiths, Yuan Cao, and Karthik Narasimhan. 2024.
Jia, and Ming Jin. 2024. Algorithm of thoughts: Tree of thoughts: Deliberate problem solving with
Enhancing exploration of ideas in large language large language models. NeurIPS, 36.
models. In ICML.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Noah Shinn, Federico Cassano, Ashwin Gopinath, Shafran, Karthik R Narasimhan, and Yuan Cao. 2023.
Karthik Narasimhan, and Shunyu Yao. 2023. Re- ReAct: Synergizing Reasoning and Acting in Lan-
flexion: language agents with verbal reinforcement guage Models. In ICLR.
learning. In NeurIPS, pages 8634–8652.
Yang Zhilin, Qi Peng, Zhang Saizheng, Bengio Yoshua,
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, Cohen William, Salakhutdinov Ruslan, and Christo-
and Chao Zhang. 2023. Adaplanner: Adaptive pher D. Manning. 2018. Hotpotqa: A dataset for
planning from feedback with language models. In diverse, explainable multi-hop question answering.
NeurIPS, volume 36, pages 58202–58245. Proceedings of the 2018 Conference on Empirical
Mirac Suzgun and Adam Tauman Kalai. 2024. Methods in Natural Language Processing.
Meta-prompting: Enhancing language models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
with task-agnostic scaffolding. arXiv preprint
Haohan Wang, and Yu-Xiong Wang. 2024. Lan-
arXiv:2401.12954.
guage agent tree search unifies reasoning, acting, and
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier planning in language models. In ICML.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Azhar, et al. 2023a. Llama: Open and efficient foun- Nathan Scales, Xuezhi Wang, Dale Schuurmans,
dation language models. ArXiv, abs/2302.13971. Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H.
Chi. 2023. Least-to-most prompting enables com-
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter plex reasoning in large language models. In ICLR.
Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, A Methods
Shruti Bhosale, et al. 2023b. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
Input-Output (IO). The simplest prompting style
arXiv:2307.09288.
which uses the LLM to directly generate an output,
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, with no intermediate steps.
Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain Chain-of-Thought (CoT). Solves the problem
of thought reasoning in language models. arXiv step by step by decomposing it into a sequence
preprint arXiv:2203.11171. of thoughts (Wei et al., 2022).
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Reflexion. Generates linguistic feedback that is
Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023. Self-consistency improves utilized during subsequent runs (Shinn et al., 2023).
chain of thought reasoning in language models. In Tree-of-Thoughts (ToT). Decomposes the prob-
ICLR.
lem into multiple chain of thoughts, organized in
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten a tree structure. Thought evaluation and search
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, traversal algorithms are utilized to solve the prob-
and Denny Zhou. 2022. Chain-of-thought prompt-
lem (Yao et al., 2024).
ing elicits reasoning in large language models. In
NeurIPS, pages 24824–24837.
B Additional results
Robert West, Joelle Pineau, and Doina Precup. 2009.
Wikispeedia: An online game for inferring seman- Cost analysis. In Fig. 4, we assess the cost-
tic distances between concepts. In IJCAI, page
1598–1603. effectiveness of various prompting methods us-
ing the LLaMA 3.3 70B model. Similar to the
Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, results observed with GPT-4o-mini, both IO and
Minkai Xu, Wentao Zhang, Joseph E. Gonzalez,
and Bin Cui. 2024. Buffer of thoughts: Thought-
CoT prompting strategies demonstrate significantly
augmented reasoning with large language models. In higher cost-efficiency. Notably, CoT maintains a
NeurIPS. substantial lead in tasks such as Game of 24 and
HumanEval. However, for HotpotQA, the ToT ap-
proach slightly outperforms CoT.
Retrial analysis.. In Figs. 5 and 6, we evaluate
the performance of each method as a function of
the number of re-trials, under a fixed budget con-
straint. For Game of 24 and HotpotQA, ToT and
Reflexion exhibit greater sample efficiency, achiev-
ing strong performance with relatively few re-trials.
However, IO and CoT ultimately outperform them
as the number of re-trials increases. In contrast, for
HumanEval, IO and CoT are already more sample-
efficient from the outset.
Temperature analysis. In Fig .3, we present the
cost-effectiveness of CoT and ToT prompting for
LLaMA 3.3 70B across varying temperature set-
tings, under the same constrained budget as be-
fore. Unlike the results observed with GPT-4o-
mini (Fig.2), the performance trends here have not
yet plateaued. We hypothesize that this is due to
LLaMA 3.3 70B being approximately three times
more expensive than GPT-4o-mini, placing the ex-
periment in its early stages and limiting the con-
clusiveness of the results. In future work, we plan
to increase the budget across all experiments to
further investigate this behavior.

C Implementation Details

Platforms. GPT models were were accessed


through the OpenAI API while thhe utilization of
the Llama models was facilitated by the TogetherAI
API.
Model checkpoints and prices. To compute the
costs of our experiments we used the current model
prices indicated OpenAI and Together AI, accord-
ingly to the model. The specific models snapshot
we used, along with their respective prices are pre-
sented in Table 1.

Figure 3: Comparing the cost-quality trade-off of CoT


and ToT, using Llama-3.3-70B as the base model,
across different temperature levels.
US$ per 1M prompt tokens US$ Per 1M completion tokens
gpt-4o-mini 0.15 0.60
LLaMA-3.3-70B 0.88 0.88

Table 1: Model snapshot prices. OpenAI and TogetherAI prices for each model used, during the implementation
of the project.

Figure 4: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using Llama-3.3-70B as the base
model. Within the indicated budget, simpler methods have similar or better performance complex ones while
remaining cost-efficient.

Figure 5: Comparing the sample-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the
base model. Within the indicated budget, simpler methods outperform more complex ones while they remain
sample-efficient only for the case of the HumanEval task.

Figure 6: Comparing the sample-quality trade-off of IO, CoT, ToT, and Reflexion using Llama-3.3-70B as the base
model. Within the indicated budget, simpler methods have better or similar performance than complex ones while
they remain sample-efficient only for the case of the HumanEval task.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy