0% found this document useful (0 votes)

12 views8 pages

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

This paper introduces the concept of 'retrials without feedback' to enhance the reasoning capabilities of large language models (LLMs) by allowing them to retry problem-solving attempts without requiring explicit self-reflection or verbalized feedback. The findings suggest that simpler retrial-based methods often outperform more complex reasoning frameworks, challenging the assumption that intricate strategies yield better performance. The study emphasizes the importance of cost-efficiency in reasoning strategies and proposes further exploration of the retrial mechanism under varying computational budgets.

Uploaded by

xk l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Uploaded by

xk l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Are Retrials All You Need?

Enhancing Large Language Model Reasoning

Without Verbalized Feedback

Nearchos Potamitis Akhil Arora

Aarhus University Aarhus University
nearchos.potamitis@cs.au.dk akhil.arora@cs.au.dk

Abstract on reasoning (Cobbe et al., 2021) and knowledge-

intensive tasks (West et al., 2009), often requiring
Recent advancements in large language mod-
interactions with complex environments, such as
els (LLMs) have catalyzed the development of
arXiv:2504.12951v1 [cs.CL] 17 Apr 2025

general-purpose autonomous agents, demon- playing complex video games (Fan et al., 2022),
strating remarkable performance in complex performing web navigation (Yao et al., 2022), or
reasoning tasks across various domains. This enabling tool-use (Schick et al., 2023).
surge has spurred the evolution of a plethora of Naturally, the rise of LLM-based agents has con-
prompt-based reasoning frameworks. A recent tributed to the prosperity of prompt-based reason-
focus has been on iterative reasoning strate-
ing frameworks (Wei et al., 2022; Wang et al., 2023;
gies that refine outputs through self-evaluation
and verbalized feedback. However, these strate-
Besta et al., 2024; Sel et al., 2024; Yang et al., 2024;
gies require additional computational complex- Yao et al., 2024; Zhou et al., 2024; Shinn et al.,
ity to enable models to recognize and correct 2023; Yao et al., 2023; Potamitis et al., 2024) that
their mistakes, leading to a significant increase further enhance the problem-solving and reasoning
in their cost. In this work, we introduce the abilities of LLMs.
concept of “retrials without feedback”, an em-
barrassingly simple yet powerful mechanism Iterative refinement and its challenges. A bur-
for enhancing reasoning frameworks by allow- geoning area of research in this field involves the
ing LLMs to retry problem-solving attempts exploration of iterative reasoning strategies that re-
upon identifying incorrect answers. Unlike fine and improve responses through self-evaluation.
conventional iterative refinement methods, our Methods such as Self-Refine (Madaan et al., 2023)
method does not require explicit self-reflection and Reflexion (Shinn et al., 2023) employ explicit
or verbalized feedback, simplifying the refine-
verbalized feedback to guide an LLM in correcting
ment process. Our findings indicate that sim-
pler retrial-based approaches often outperform its mistakes and refining its outputs. While effec-
more sophisticated reasoning frameworks, sug- tive, these approaches add an additional layer of
gesting that the benefits of complex methods complexity, requiring the model to both recognize
may not always justify their computational its own errors and articulate useful self-correction
costs. By challenging the prevailing assump- strategies. Moreover, owing to an incremental
tion that more intricate reasoning strategies in- increase in the context window size, verbalized
herently lead to better performance, our work
feedback monotonically increases the cost for ev-
offers new insights into how simpler, more ef-
ficient approaches can achieve optimal results. ery subsequent iteration of the overall reasoning
So, are retrials all you need? pipeline, thereby rendering iterative refinement-
based methods exorbitantly expensive. For in-
1 Introduction stance, with GPT-4 as the base model, RAFA (Liu
et al., 2024), the state-of-the-art refinement strat-
With strong reasoning and problem-solving abili-
egy, requires a whopping ≃ 600$ on the Game of
ties, large language models (LLMs) (Brown et al.,
24 benchmark task (Yao et al., 2024) comprising
2020) such as GPT-4 (Achiam et al., 2024; OpenAI
just 100 examples in the test set.
and et al., 2024), LLaMA (Touvron et al., 2023a,b;
Grattafiori et al., 2024), and PaLM (Anil et al., Present work. In this paper, we introduce the con-
2023), have sparked a new-found interest in build- cept of “retrials without feedback”, an embarrass-
ing general-purpose autonomous agents. LLM- ingly simple yet effective mechanism that enhances
based agents have portrayed excellent performance reasoning frameworks by allowing them to retry
a problem-solving attempt whenever an incorrect employ a uniform, task-independent prompting
answer is identified. Unlike Reflexion (Shinn et al., framework across multiple tasks, enabling a single
2023), which relies on explicit self-generated feed- LLM to iteratively refine its responses and dynami-
back, retrials operate without requiring verbalized cally adapt to diverse input queries. The Buffer of
introspection: an incorrect answer simply triggers Thoughts (BoT) (Yang et al., 2024) framework ex-
another attempt until the correct solution is found tracts task-specific information, uses it to retrieve
or a predefined computational budget is exhausted. relevant thought templates from its meta-buffer,
Our results show that under the retrial and then instantiates them with more task-specific
mechanism, simpler methods such as chain-of- reasoning structures before continuing with the rea-
thoughts (Wei et al., 2022) often outperform more soning process.
sophisticated reasoning frameworks such as tree- Refinement. Closed-loop approaches that allow an
of-thoughts (Yao et al., 2024) or Reflexion (Shinn LLM to interact with an external environment can
et al., 2023). This suggests that, when given the op- help in choosing and potentially revising an action.
portunity to retry, the added complexity of sophis- Notable examples are ReAct (Yao et al., 2023), RE-
ticated reasoning frameworks and self-reflective FINER (Paul et al., 2023) and Self-Refine (Madaan
approaches may not always justify their compu- et al., 2023). Reflexion (Shinn et al., 2023) pro-
tational cost. This raises a fundamental question: vides further linguistic feedback based on previous
Are retrials all you need? By reframing the evalua- attempts while AdaPlanner (Sun et al., 2023) also
tion of reasoning strategies through the lens of cost incorporates positive and negative feedback of an
efficiency, our work challenges the assumption that individual trajectory. Reason for future, act for
more intricate frameworks necessarily lead to bet- now (RAFA) (Liu et al., 2024) develops further
ter performance and highlights the importance of by planning a trajectory, gathering feedback for
reconsidering optimization strategies for reasoning the potential planned actions, and then revising the
with LLMs. trajectory based on the feedback.
2 Related Work Tree search. Thoughts are individual ideas or steps
in reasoning, and when connected together, they
In this section, we review works that overlap can be modeled as a tree data structure. Tree search
closely with our study algorithms can then be used to explore a tree of
Prompt-based reasoning. Recent research focuses thoughts and optimize the search for a final an-
on developing strategies to enhance the reasoning swer. In “Tree of Thoughts” (ToT), the authors
capabilities of LLMs. Few-shot prompting em- utilize a value function that compares different
ploys demonstrations of high-quality input/output branches to describe both DFS and BFS flavors of
samples to exploit the eagerness of the LLMs to im- a guided tree-search (Yao et al., 2024). The closely
itate patterns seen in their context window (Brown related “Graph of Thoughts” (GoT) approach re-
et al., 2020). Algorithm of thoughts (AoT) (Sel laxes the assumption of a strict tree structure (Besta
et al., 2024), goes a step further by including al- et al., 2024). Reasoning via Planning (RAP) (Hao
gorithmic examples within the prompt to propel et al., 2023) augments LLMs with a world model
the LLM through algorithmic reasoning pathways. and employs Monte Carlo Tree Search (MCTS)-
Chain-of-Thought (CoT) prompting (Nye et al., based planning to reduce the search complexity.
2021; Wei et al., 2022; Kojima et al., 2022) as Language Agent Tree Search (LATS) (Zhou et al.,
well as other variants such as Decomposed Prompt- 2024) extends this concept by leveraging environ-
ing (Khot et al., 2023) and Least-to-Most (Zhou ment interactions, thereby eliminating the need for
et al., 2023) guide LLMs to decompose a com- a world model.
plex question into a sequence of thoughts and then
3 Experiments
synthesize an answer by resolving them method-
ically. It has been shown that Self-Consistency In this section, we provide details on the bench-
(CoT-SC) (Wang et al., 2022) can be used to aug- marks and the different types of analyses that we
ment such methods by generating multiple thought performed for our experiments. Please refer to Ap-
sequences and then selecting the most accurate pendix A for more information regarding the rea-
answer through majority voting. Recent meta- soning strategies used in this study. For additional
prompting techniques (Suzgun and Kalai, 2024) results, please see Appendix B.
Figure 1: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the base
model. Within the indicated budget, simpler methods outperform more complex ones while remaining cost-efficient.
3.1 Benchmark tasks Base model. We use GPT-4o-mini (OpenAI and
et al., 2024) and LLaMA-3.3-70B (Grattafiori et al.,
Game of 24. Game of 24 is a mathematical puzzle, 2024) as the base models.
where four numbers are given, and the objective
is to form an arithmetic expression that equals 24 3.3 Analysis
using each number exactly once. The benchmark
data consists of 1362 puzzles. Following ToT (Yao Cost and Number of Retrials Analysis. This
et al., 2024), we use the puzzles indexed 901-1000 experiment aims to evaluate the effectiveness of
as the test set. To evaluate the quality of the meth- different methods under a constrained budget. Each
ods we use success rate, that is, the percentage method follows an iterative approach, attempting to
of solved puzzles. For efficiency, we use cost (in solve all of the samples in the first trial. Unsolved
US$). samples are re-tried in subsequent trials until either
the budget is exhausted or of them all are solved. If
HumanEval. HumanEval is a programming puz-
the budget runs out during an iteration, the method
zle that measures functional correctness for synthe-
halts immediately. To investigate cost-effectiveness
sizing programs from natural language docstrings.
we present the quality and cost for GPT-4o-mini
Following Reflexion (Shinn et al., 2023), we eval-
in Fig. 1. The corresponding plot for LLama 3.3
uate the methods on 161 Python programs. We
70B can be found in the appendix (Fig. 4). To
use the accuracy evaluation pass@1 to measure the
also display sample efficiency, we plot quality and
quality of the benchmark. For efficiency, we use
number of trials.
cost (in US$).
Temperature Analysis. To better understand the
HotpotQA. HotpotQA (Zhilin et al., 2018) is a
cost-effectiveness of each method, we repeated the
large-scale question-answering dataset designed to
previous experiments using different temperature
evaluate multi-hop reasoning. Multi-step methods
values. Due to budget and time restrictions, we
such as ToT, are allowed to use an interactive API
only conduct this analysis for Game of 24 across
environment which allows the agent to search for
the CoT and ToT methods. The results can be
documents and look up specific information within
found in Fig. 2 for GPT-4o-mini and Fig. 3 for
them. Following previous methods (Zhou et al.,
Llama-3.3-70B.
2024; Shinn et al., 2023) we evaluate on 100 ran-
domly selected samples of their choice. We mea- 4 Main results
sure the quality of the answer based on whether
there is an exact match (EM), given an oracle an- Cost efficiency. Our results demonstrate that sim-
swer. For efficiency, we use cost (in US$), ple methods such as IO and CoT prompting are
3.2 Experiment setup significantly more efficient than complex reason-
ing approaches such as ToT (Yao et al., 2024) or
Baselines. We analyze the impact of retrials on four Reflexion (Shinn et al., 2023). Specifically, across
prompting strategies: (1) Standard IO prompting, all benchmarks and models, CoT prompting con-
(2) CoT (Wei et al., 2022), (3) ToT (Yao et al., sistently outperforms alternative methods, often by
2024), and (4) Reflexion (Shinn et al., 2023). a considerable margin. In fact, for Game of 24,
Figure 2: Comparing the cost-quality trade-off of CoT and ToT across different temperature levels using GPT-4o-
mini as the base model. For CoT success rate is strictly increasing as temperature increases and so does for ToT but
not strictly.
CoT achieved a 94% success rate: a result that 5 Discussion and Concluding Insights
methods such as (Liu et al., 2024; Yao et al., 2024)
would need multiple hundreds of dollars (≃ 600$)
Summary of Findings. Our results in Figs. 1
to achieve on GPT-4.
and 4 have shown that methods such as Chain of
Cost efficiency appears to be influenced not only Thought are more cost-efficient than complex rea-
by the method employed but also by the specific soning strategies on a variety of models and tasks.
task. For instance, in the HumanEval and Hot- In fact, for some tasks, we have achieved state-of-
potQA tasks, both CoT and IO methods reach a the-art quality performance with minimal resources
performance plateau earlier, whereas, in the Game in terms of cost and model capabilities. Based
of 24 task, the efficiency peak is more gradual. In on the results we have also shown that the cost-
addition, the base model plays a crucial role in efficiency of a method is affected not only by the
performance disparities. This is particularly evi- task but by the model as well. Finally, we have
dent in the quality gap between GPT-4o-mini and shown that the performance of the retrial concept
Llama-3.3-70B in the Game of 24 and HotpotQA can be improved even further by properly tuning
benchmarks. While the two methods exhibit com- the model’s temperature
parable performance with the former model, CoT Future work. In the future, we would like to ex-
substantially outperforms IO in the latter. These tend our results by showcasing the behavior of the
findings suggest that base models with stronger in- methods when allowing an even bigger budget. For
herent reasoning capabilities (e.g. GPT-4o-mini) the case of HumanEval, it is already apparent that
can simultaneously enhance the cost-quality trade- Reflexion and Tree of Thoughts will surpass the
off of simple methods such as IO prompting. other methods if allowed for more budget. Even
though this does not go against our claims regard-
Temperature analysis. For GPT-4o-mini (Fig. 2), ing cost-efficiency, we would like to investigate
we can clearly see that a higher temperature value further and showcase the full picture. Additionally,
results in a better success rate. This is clearly visi- we aim to explore methods that leverage the oc-
ble for CoT as the gap between each temperature currence of retrials to further optimize reasoning
result is significantly wide. Furthermore, we can processes. Specifically, we seek to develop tech-
even see that for a temperature value of 1.0 the ex- niques that can exploit the iterative nature of retrials
periment already achieves 100% accuracy at half of to improve efficiency, potentially reducing the num-
the allocated budget. For ToT, this is also predomi- ber of attempts needed to reach a correct solution.
nantly true. A possible reason why this is not as ev- Finally, our method uses trivial deterministic veri-
ident in the latter case is that multi-query methods fiers to decide whether an answer has been found
such as ToT, often introduce complex prompting and in this case, no retry is needed. We would aim
schemes that are affected in arbitrary ways by an to extend our method to tasks where there’s not a
increase in the temperature values. trivial deterministic verifier to validate the answer.
Overall, we hope that our work will inspire further Askell, et al. 2020. Language models are few-shot
research into the role of retrials in cost-efficient rea- learners. In NeurIPS, volume 33, pages 1877–1901.
soning and the broader optimization of reasoning Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
frameworks for LLM-based problem-solving Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Limitations Nakano, Christopher Hesse, and John Schulman.
2021. Training verifiers to solve math word prob-
The main limitation of our work is that it cannot lems. arXiv preprint arXiv:2110.14168.
be applied to tasks where the answer cannot be Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Man-
directly verified. For example, in the game of 24, dlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang,
the goal is to find a mathematical formula that eval- De-An Huang, Yuke Zhu, and Anima Anandkumar.
uates to 24 given some input numbers. Once a 2022. Minedojo: Building open-ended embodied
agents with internet-scale knowledge. In NeurIPS:
formula is found, checking whether it equals 24 Datasets and Benchmarks Track.
is straightforward. Similarly, in HumanEval, an
answer is verified based on whether the generated Aaron Grattafiori et al. 2024. The llama 3 herd of mod-
els. arXiv preprint arXiv:2407.21783.
function passes specific tests. This allows us to stop
retrying a puzzle once a correct answer has been Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong,
found. However, in cases similar to HotpotQA, the Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023.
Reasoning with language model is planning with
correct answer is hidden and only used for evalua- world model. In EMNLP.
tion, making direct verification impossible during
the solving process. As discussed in our future Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu,
Kyle Richardson, Peter Clark, and Ashish Sabharwal.
work, we aim to extend our method to tasks where 2023. Decomposed prompting: A modular approach
answers cannot be deterministically verified in a for solving complex tasks. In ICLR.
trivial manner.
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid,
Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large
Acknowledgements language models are zero-shot reasoners. In NeurIPS,
pages 22199–22213.
We thank Chris Schwiegelshohn and Niket Tan-
don for insightful discussions. Arora’s lab is partly Zhihan Liu, Hao Hu, Shenao Zhang, Hongyi Guo, Shuqi
supported by grants from the Novo Nordisk Foun- Ke, Boyi Liu, and Zhaoran Wang. 2024. Reason for
future, act for now: A principled architecture for
dation (NNF24OC0099109), the Pioneer Centre autonomous LLM agents. In ICML.
for AI, and EU Horizon 2020 (101168951).
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
References Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
et al. 2023. Self-refine: Iterative refinement with
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama self-feedback. arXiv preprint arXiv:2303.17651.
Ahmad, Ilge Akkaya, Florencia Leoni Aleman,
Diogo Almeida, Janko Altenschmidt, Sam Altman, Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-
Shyamal Anadkat, et al. 2024. Gpt-4 technical report. Ari, Henryk Michalewski, Jacob Austin, David
arXiv preprint arXiv:2303.08774. Bieber, David Dohan, Aitor Lewkowycz, Maarten
Bosma, David Luan, Charles Sutton, and Augustus
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Odena. 2021. Show your work: Scratchpads for inter-
Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mediate computation with language models. CoRR,
mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng abs/2112.00114.
Chen, et al. 2023. Palm 2 technical report. arXiv OpenAI and Hurst et al. 2024. Gpt-4o system card.
preprint arXiv:2305.10403. Preprint, arXiv:2410.21276.
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beat-
stenberger, Michal Podstawski, Lukas Gianinazzi, riz Borges, Antoine Bosselut, Robert West, and
Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom- Boi Faltings. 2023. Refiner: Reasoning feedback
ski, Piotr Nyczyk, et al. 2024. Graph of thoughts: on intermediate representations. arXiv preprint
Solving elaborate problems with large language mod- arXiv:2304.01904.
els. In AAAI, pages 17682–17690.
Nearchos Potamitis, Lars Klein, Roland Aydin, Caglar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Gulcehre, Robert West, and Akhil Arora. 2024. Fleet
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind of agents: Coordinated problem solving with large
Neelakantan, Pranav Shyam, Girish Sastry, Amanda language models. Preprint, arXiv:2405.06691.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Shunyu Yao, Howard Chen, John Yang, and Karthik
Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- Narasimhan. 2022. Webshop: Towards scalable real-
moyer, Nicola Cancedda, and Thomas Scialom. 2023. world web interaction with grounded language agents.
Toolformer: Language models can teach themselves In NeurIPS, pages 20744–20757.
to use tools. In NeurIPS.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom
Bilgehan Sel, Ahmad Tawaha, Vanshaj Khattar, Ruoxi Griffiths, Yuan Cao, and Karthik Narasimhan. 2024.
Jia, and Ming Jin. 2024. Algorithm of thoughts: Tree of thoughts: Deliberate problem solving with
Enhancing exploration of ideas in large language large language models. NeurIPS, 36.
models. In ICML.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Noah Shinn, Federico Cassano, Ashwin Gopinath, Shafran, Karthik R Narasimhan, and Yuan Cao. 2023.
Karthik Narasimhan, and Shunyu Yao. 2023. Re- ReAct: Synergizing Reasoning and Acting in Lan-
flexion: language agents with verbal reinforcement guage Models. In ICLR.
learning. In NeurIPS, pages 8634–8652.
Yang Zhilin, Qi Peng, Zhang Saizheng, Bengio Yoshua,
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, Cohen William, Salakhutdinov Ruslan, and Christo-
and Chao Zhang. 2023. Adaplanner: Adaptive pher D. Manning. 2018. Hotpotqa: A dataset for
planning from feedback with language models. In diverse, explainable multi-hop question answering.
NeurIPS, volume 36, pages 58202–58245. Proceedings of the 2018 Conference on Empirical
Mirac Suzgun and Adam Tauman Kalai. 2024. Methods in Natural Language Processing.
Meta-prompting: Enhancing language models
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
with task-agnostic scaffolding. arXiv preprint
Haohan Wang, and Yu-Xiong Wang. 2024. Lan-
arXiv:2401.12954.
guage agent tree search unifies reasoning, acting, and
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier planning in language models. In ICML.
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Azhar, et al. 2023a. Llama: Open and efficient foun- Nathan Scales, Xuezhi Wang, Dale Schuurmans,
dation language models. ArXiv, abs/2302.13971. Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H.
Chi. 2023. Least-to-most prompting enables com-
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter plex reasoning in large language models. In ICLR.
Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, A Methods
Shruti Bhosale, et al. 2023b. Llama 2: Open foun-
dation and fine-tuned chat models. arXiv preprint
Input-Output (IO). The simplest prompting style
arXiv:2307.09288.
which uses the LLM to directly generate an output,
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, with no intermediate steps.
Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. 2022. Self-consistency improves chain Chain-of-Thought (CoT). Solves the problem
of thought reasoning in language models. arXiv step by step by decomposing it into a sequence
preprint arXiv:2203.11171. of thoughts (Wei et al., 2022).
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Reflexion. Generates linguistic feedback that is
Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,
and Denny Zhou. 2023. Self-consistency improves utilized during subsequent runs (Shinn et al., 2023).
chain of thought reasoning in language models. In Tree-of-Thoughts (ToT). Decomposes the prob-
ICLR.
lem into multiple chain of thoughts, organized in
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten a tree structure. Thought evaluation and search
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, traversal algorithms are utilized to solve the prob-
and Denny Zhou. 2022. Chain-of-thought prompt-
lem (Yao et al., 2024).
ing elicits reasoning in large language models. In
NeurIPS, pages 24824–24837.
B Additional results
Robert West, Joelle Pineau, and Doina Precup. 2009.
Wikispeedia: An online game for inferring seman- Cost analysis. In Fig. 4, we assess the cost-
tic distances between concepts. In IJCAI, page
1598–1603. effectiveness of various prompting methods us-
ing the LLaMA 3.3 70B model. Similar to the
Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, results observed with GPT-4o-mini, both IO and
Minkai Xu, Wentao Zhang, Joseph E. Gonzalez,
and Bin Cui. 2024. Buffer of thoughts: Thought-
CoT prompting strategies demonstrate significantly
augmented reasoning with large language models. In higher cost-efficiency. Notably, CoT maintains a
NeurIPS. substantial lead in tasks such as Game of 24 and
HumanEval. However, for HotpotQA, the ToT ap-
proach slightly outperforms CoT.
Retrial analysis.. In Figs. 5 and 6, we evaluate
the performance of each method as a function of
the number of re-trials, under a fixed budget con-
straint. For Game of 24 and HotpotQA, ToT and
Reflexion exhibit greater sample efficiency, achiev-
ing strong performance with relatively few re-trials.
However, IO and CoT ultimately outperform them
as the number of re-trials increases. In contrast, for
HumanEval, IO and CoT are already more sample-
efficient from the outset.
Temperature analysis. In Fig .3, we present the
cost-effectiveness of CoT and ToT prompting for
LLaMA 3.3 70B across varying temperature set-
tings, under the same constrained budget as be-
fore. Unlike the results observed with GPT-4o-
mini (Fig.2), the performance trends here have not
yet plateaued. We hypothesize that this is due to
LLaMA 3.3 70B being approximately three times
more expensive than GPT-4o-mini, placing the ex-
periment in its early stages and limiting the con-
clusiveness of the results. In future work, we plan
to increase the budget across all experiments to
further investigate this behavior.

C Implementation Details

Platforms. GPT models were were accessed

through the OpenAI API while thhe utilization of
the Llama models was facilitated by the TogetherAI
API.
Model checkpoints and prices. To compute the
costs of our experiments we used the current model
prices indicated OpenAI and Together AI, accord-
ingly to the model. The specific models snapshot
we used, along with their respective prices are pre-
sented in Table 1.

Figure 3: Comparing the cost-quality trade-off of CoT

and ToT, using Llama-3.3-70B as the base model,
across different temperature levels.
US$ per 1M prompt tokens US$ Per 1M completion tokens
gpt-4o-mini 0.15 0.60
LLaMA-3.3-70B 0.88 0.88

Table 1: Model snapshot prices. OpenAI and TogetherAI prices for each model used, during the implementation
of the project.

Figure 4: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using Llama-3.3-70B as the base
model. Within the indicated budget, simpler methods have similar or better performance complex ones while
remaining cost-efficient.

Figure 5: Comparing the sample-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the
base model. Within the indicated budget, simpler methods outperform more complex ones while they remain
sample-efficient only for the case of the HumanEval task.

Figure 6: Comparing the sample-quality trade-off of IO, CoT, ToT, and Reflexion using Llama-3.3-70B as the base
model. Within the indicated budget, simpler methods have better or similar performance than complex ones while
they remain sample-efficient only for the case of the HumanEval task.

The Illusion of Thinking
No ratings yet
The Illusion of Thinking
30 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
Advanced Prompt Engineering Techniques
No ratings yet
Advanced Prompt Engineering Techniques
2 pages
Rema
No ratings yet
Rema
35 pages
ART Automatic Multi-Step Reasoning and Tool-Use For
No ratings yet
ART Automatic Multi-Step Reasoning and Tool-Use For
26 pages
Meta-Reasoner: Dynamic Guidance For Optimized Inference-Time Reasoning in Large Language Models
No ratings yet
Meta-Reasoner: Dynamic Guidance For Optimized Inference-Time Reasoning in Large Language Models
17 pages
Thought-Like-Pro - Enhancing Reasoning of Large Language Models Through Self-Driven Prolog-Based Chain-of-Thought
No ratings yet
Thought-Like-Pro - Enhancing Reasoning of Large Language Models Through Self-Driven Prolog-Based Chain-of-Thought
15 pages
$RHZJ1VK
No ratings yet
$RHZJ1VK
15 pages
A Survey of Slow Thinking-Based Reasoning Llms Using Reinforced Learning and Inference-Time Scaling Law
No ratings yet
A Survey of Slow Thinking-Based Reasoning Llms Using Reinforced Learning and Inference-Time Scaling Law
33 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
30 pages
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
No ratings yet
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
27 pages
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
No ratings yet
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
36 pages
Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
No ratings yet
Reason From Future: Reverse Thought Chain Enhances LLM Reasoning
14 pages
Unlocking Recursive Thinking of LLMs
No ratings yet
Unlocking Recursive Thinking of LLMs
14 pages
Vision-Language Models Can Self-Improve Reasoning Via Reflection
No ratings yet
Vision-Language Models Can Self-Improve Reasoning Via Reflection
17 pages
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
No ratings yet
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
9 pages
Code To Think, Think To Code - A Survey On Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
No ratings yet
Code To Think, Think To Code - A Survey On Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
28 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
Treeof Code
No ratings yet
Treeof Code
13 pages
LLM+技巧总结+ +Prompt+Engineering指南
No ratings yet
LLM+技巧总结+ +Prompt+Engineering指南
25 pages
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
No ratings yet
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
27 pages
Complex LLM Planning Via Automated Heuristics Discovery
No ratings yet
Complex LLM Planning Via Automated Heuristics Discovery
22 pages
Solving Elaborate Problems With Large Language Models.
No ratings yet
Solving Elaborate Problems With Large Language Models.
61 pages
THOUGHTSCULPT - Reasoning With Intermediate Revision and Search
No ratings yet
THOUGHTSCULPT - Reasoning With Intermediate Revision and Search
24 pages
思维算法
No ratings yet
思维算法
46 pages
T LLM: G I F T G: Hinking S Eneral Nstruction Ollowing With Hought Eneration
No ratings yet
T LLM: G I F T G: Hinking S Eneral Nstruction Ollowing With Hought Eneration
28 pages
Language Agents With Verbal Reinforcement Learning
No ratings yet
Language Agents With Verbal Reinforcement Learning
19 pages
LLM Powered Autonomous Agents - Lil'Log
No ratings yet
LLM Powered Autonomous Agents - Lil'Log
24 pages
Part 5
No ratings yet
Part 5
3 pages
Part 2
No ratings yet
Part 2
3 pages
Google REST
No ratings yet
Google REST
19 pages
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
No ratings yet
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
24 pages
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
No ratings yet
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
9 pages
L4 Auto Got
No ratings yet
L4 Auto Got
6 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
Language Agent Tree Search
No ratings yet
Language Agent Tree Search
26 pages
React - Synergizing Reasoning and Acting in Language Models
No ratings yet
React - Synergizing Reasoning and Acting in Language Models
33 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
Towards A Deeper Understanding of Reasoning Capabilities in Large Language Models
No ratings yet
Towards A Deeper Understanding of Reasoning Capabilities in Large Language Models
8 pages
Large Language Models As Analogical Reasoners
No ratings yet
Large Language Models As Analogical Reasoners
25 pages
L A T S U R - A P L M: Anguage Gent REE Earch Nifies Eason ING Cting and Lanning in Anguage Odels
No ratings yet
L A T S U R - A P L M: Anguage Gent REE Earch Nifies Eason ING Cting and Lanning in Anguage Odels
24 pages
Re-Reading Improves Reasoning in Large Language Models
No ratings yet
Re-Reading Improves Reasoning in Large Language Models
25 pages
Reflexion: Language Agents With Verbal Reinforcement Learning
No ratings yet
Reflexion: Language Agents With Verbal Reinforcement Learning
18 pages
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
No ratings yet
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
25 pages
Week 11 Chats
No ratings yet
Week 11 Chats
5 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
3 pages
Tree of Thoughts Deliberate Problem Solving
No ratings yet
Tree of Thoughts Deliberate Problem Solving
11 pages
Coulomb-Mediated Single-Electron Heat Transfer Statistics Across Capacitively Coupled Silicon Nanodots
No ratings yet
Coulomb-Mediated Single-Electron Heat Transfer Statistics Across Capacitively Coupled Silicon Nanodots
13 pages
Channel Capacity of Small Modular Quantum Networks in The Ultrastrongly Coupled Regime
No ratings yet
Channel Capacity of Small Modular Quantum Networks in The Ultrastrongly Coupled Regime
9 pages
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
No ratings yet
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
7 pages
Teaching LLM
No ratings yet
Teaching LLM
24 pages
Theoretical Evaluation of Decay Mode of in Solid Samples: TH 8 4 Ev
No ratings yet
Theoretical Evaluation of Decay Mode of in Solid Samples: TH 8 4 Ev
8 pages
Selective Decoupling in Multi-Level Quantum Systems by The SU (2) Sign Anomaly
No ratings yet
Selective Decoupling in Multi-Level Quantum Systems by The SU (2) Sign Anomaly
8 pages
A Superinductor in A Deep Sub-Micron Integrated Circuit: Alberto@quantummotion - Tech Fernando@quantummotion - Tech
No ratings yet
A Superinductor in A Deep Sub-Micron Integrated Circuit: Alberto@quantummotion - Tech Fernando@quantummotion - Tech
8 pages
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Probing Nontrivial Fusion of Majorana Zero Modes Via Near-Adiabatic Coupling
No ratings yet
Probing Nontrivial Fusion of Majorana Zero Modes Via Near-Adiabatic Coupling
7 pages
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
No ratings yet
Graph of Thoughts: Solving Elaborate Problems With Large Language Models
13 pages
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
No ratings yet
Chain of Draft: Thinking Faster by Writing Less: Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
6 pages
Large Language Models As Analogical Reasoners
No ratings yet
Large Language Models As Analogical Reasoners
25 pages
Optimal Calibration of Qubit Detuning and Crosstalk: H T H T F T T
No ratings yet
Optimal Calibration of Qubit Detuning and Crosstalk: H T H T F T T
5 pages
Comment On "Properties and Dynamics of Generalized Squeezed States"
No ratings yet
Comment On "Properties and Dynamics of Generalized Squeezed States"
4 pages
Press Tool Tech
No ratings yet
Press Tool Tech
48 pages
Analyzing The Higgs-Confinement Transition With Non-Local Operators On The Lattice
No ratings yet
Analyzing The Higgs-Confinement Transition With Non-Local Operators On The Lattice
11 pages
Eigenspectra of Minimally Doubled Fermions: Abhijeet Kishore, Subhasish Basak and Dipankar Chakrabarti
No ratings yet
Eigenspectra of Minimally Doubled Fermions: Abhijeet Kishore, Subhasish Basak and Dipankar Chakrabarti
9 pages
Kinematically-Enhanced Interpolating Operators For Boosted Hadrons
No ratings yet
Kinematically-Enhanced Interpolating Operators For Boosted Hadrons
10 pages
A Time-Reversal Invariant Vortex in Topological Superconductors and Gravitational Topology
No ratings yet
A Time-Reversal Invariant Vortex in Topological Superconductors and Gravitational Topology
8 pages
A Coordination-Based Model For The Prediction of Surface Energies and The Shape of Metal Particles
No ratings yet
A Coordination-Based Model For The Prediction of Surface Energies and The Shape of Metal Particles
18 pages
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
No ratings yet
Forest-of-Thought: Scaling Test-Time Compute For Enhancing LLM Reasoning
13 pages
H I I A D I H C A: Urricane Mpact Ndex For Ssessing Irect and Ndirect Azards in Entral Merica
No ratings yet
H I I A D I H C A: Urricane Mpact Ndex For Ssessing Irect and Ndirect Azards in Entral Merica
14 pages
04-Tree of Thoughts White Papers
No ratings yet
04-Tree of Thoughts White Papers
11 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Benchmarks For Protocol Control in Nonequilibrium Statistical Mechanics
No ratings yet
Benchmarks For Protocol Control in Nonequilibrium Statistical Mechanics
14 pages
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
No ratings yet
Tree of Thoughts: Deliberate Problem Solving With Large Language Models
14 pages
Aluminium Fast Neutron Leakage Spectrum Validation: A A A A A A A A A B
No ratings yet
Aluminium Fast Neutron Leakage Spectrum Validation: A A A A A A A A A B
11 pages
Non-Linear In-Plane Spin Current in Spin-Orbit Coupled 2D Hole Gases
No ratings yet
Non-Linear In-Plane Spin Current in Spin-Orbit Coupled 2D Hole Gases
10 pages
Controlled Spherulitic Crystal Growth From Salt Mixtures: A Universal Mechanism For Complex Crystal Self-Assembly
No ratings yet
Controlled Spherulitic Crystal Growth From Salt Mixtures: A Universal Mechanism For Complex Crystal Self-Assembly
15 pages
Anisotropic Quantum Polytropes: Electronic Address: Electronic Address: Electronic Address
No ratings yet
Anisotropic Quantum Polytropes: Electronic Address: Electronic Address: Electronic Address
12 pages
Observation of Coherent Perfect Acoustic Absorption at An Exceptional Point
No ratings yet
Observation of Coherent Perfect Acoustic Absorption at An Exceptional Point
12 pages
Moir E-Polaritons in A Dark Bose-Einstein Condensate: Introduction.
No ratings yet
Moir E-Polaritons in A Dark Bose-Einstein Condensate: Introduction.
8 pages
Universality On Thermodynamic Relation With Corrections in Einstein-Bel-Robinson Gravity Black Hole
No ratings yet
Universality On Thermodynamic Relation With Corrections in Einstein-Bel-Robinson Gravity Black Hole
9 pages
DSR-relativistic Spacetime Picture and The Phenomenology of Planck-Scale-Modified Time Dilation
No ratings yet
DSR-relativistic Spacetime Picture and The Phenomenology of Planck-Scale-Modified Time Dilation
8 pages
Constraints On The Progenitor Models of Fast Radio Bursts From Population Synthesis With The First CHIME/FRB Catalog
No ratings yet
Constraints On The Progenitor Models of Fast Radio Bursts From Population Synthesis With The First CHIME/FRB Catalog
9 pages
Present Value BMAT
No ratings yet
Present Value BMAT
11 pages
Planck's Law From A Classical Free Energy Extremum Involving Fisher Information
No ratings yet
Planck's Law From A Classical Free Energy Extremum Involving Fisher Information
8 pages
HTML Sop4 (Link) Journal Writeup
No ratings yet
HTML Sop4 (Link) Journal Writeup
4 pages
ABB محولات القدرة
No ratings yet
ABB محولات القدرة
113 pages
Tdi On The Fly: Extreme Gravity Institute, Department of Physics, Montana State University, Bozeman, Montana 59717, Usa
No ratings yet
Tdi On The Fly: Extreme Gravity Institute, Department of Physics, Montana State University, Bozeman, Montana 59717, Usa
6 pages
Random Walk With Multiple Memory Channels: A New Paradigm
No ratings yet
Random Walk With Multiple Memory Channels: A New Paradigm
4 pages
Nematic Ordering in Active Fluids Driven by Substrate Deformations: Mechanisms and Patterning Regimes
No ratings yet
Nematic Ordering in Active Fluids Driven by Substrate Deformations: Mechanisms and Patterning Regimes
6 pages
L L M C S - I: Arge Anguage Odels AN ELF Mprove
No ratings yet
L L M C S - I: Arge Anguage Odels AN ELF Mprove
19 pages
FIP Corrosion Protection of Prestressing Steels
No ratings yet
FIP Corrosion Protection of Prestressing Steels
79 pages
Bill Format Dhanu
No ratings yet
Bill Format Dhanu
17 pages
Caterpillar Installation
100% (1)
Caterpillar Installation
297 pages
Motion of Ferrodark Solitons in Harmonically Trapped Superfluids: Spin Corrections and Emergent Quartic Potentials Exhibiting Symmetry Breaking
No ratings yet
Motion of Ferrodark Solitons in Harmonically Trapped Superfluids: Spin Corrections and Emergent Quartic Potentials Exhibiting Symmetry Breaking
6 pages
1.introduction of BMS
No ratings yet
1.introduction of BMS
94 pages
Stress Distribution in Elastic Disks With A Hole Under Uniaxial Compression
No ratings yet
Stress Distribution in Elastic Disks With A Hole Under Uniaxial Compression
4 pages
Report 2 Final
No ratings yet
Report 2 Final
40 pages
Score Prediction Method
No ratings yet
Score Prediction Method
2 pages
Why Use Spring Boot
No ratings yet
Why Use Spring Boot
12 pages
Certificado UL1703 Celda Solar
No ratings yet
Certificado UL1703 Celda Solar
70 pages
Chemical Engineering Buyers Guide 2018 - Processing Equipment
100% (1)
Chemical Engineering Buyers Guide 2018 - Processing Equipment
25 pages
MCA Rtu Syllabuss
No ratings yet
MCA Rtu Syllabuss
6 pages
Teradata DBMS Quick Reference Guide
No ratings yet
Teradata DBMS Quick Reference Guide
66 pages
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effectiveness Factor
No ratings yet
Effectiveness Factor
13 pages
Chapitre 2 Mécanique Linéaire de La Rupture (1) .FR - en
No ratings yet
Chapitre 2 Mécanique Linéaire de La Rupture (1) .FR - en
11 pages
The Effect of Pyrite Particle Size On The Electrochemical Dissolution of Gold During Cyanidation
No ratings yet
The Effect of Pyrite Particle Size On The Electrochemical Dissolution of Gold During Cyanidation
9 pages
Magnesium
No ratings yet
Magnesium
9 pages
Sag Curve
No ratings yet
Sag Curve
14 pages
Product Data Sheet: APC Smart-UPS RT 6000VA, 230V, 8x IEC 60320 C13 & 4x IEC Jumpers & 2x IEC 60320 C19 Outlets
No ratings yet
Product Data Sheet: APC Smart-UPS RT 6000VA, 230V, 8x IEC 60320 C13 & 4x IEC Jumpers & 2x IEC 60320 C19 Outlets
4 pages
Sample Problem #18
100% (1)
Sample Problem #18
8 pages
Circulating Currents Control For Parallel Grid-Connected Three-Phase Inverters
No ratings yet
Circulating Currents Control For Parallel Grid-Connected Three-Phase Inverters
5 pages
Switch Beam Antenna 28ghz
No ratings yet
Switch Beam Antenna 28ghz
4 pages
Study On The Residual Stress of Bar With Straightening by Two Rolls
No ratings yet
Study On The Residual Stress of Bar With Straightening by Two Rolls
6 pages
WI-INSP-11 R0 Work Instruction-Fastener Insp
No ratings yet
WI-INSP-11 R0 Work Instruction-Fastener Insp
3 pages
Slab Punching H11
No ratings yet
Slab Punching H11
10 pages
Electric Arc Furnace Modeling
No ratings yet
Electric Arc Furnace Modeling
6 pages
Methodology: Research Method and Design
100% (2)
Methodology: Research Method and Design
7 pages
Haramaya University
No ratings yet
Haramaya University
29 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Uploaded by

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Uploaded by

Are Retrials All You Need?

Enhancing Large Language Model Reasoning

Nearchos Potamitis Akhil Arora

Abstract on reasoning (Cobbe et al., 2021) and knowledge-

Platforms. GPT models were were accessed

Figure 3: Comparing the cost-quality trade-off of CoT

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.