0% found this document useful (0 votes)
34 views15 pages

Thinking Machines: A Survey of LLM Based Reasoning Strategies

The document surveys reasoning strategies in Large Language Models (LLMs), highlighting the gap between their language proficiency and reasoning abilities. It discusses advancements like chain-of-thought prompting and various methods including reinforcement learning, test-time computation, and self-training to enhance reasoning in LLMs. The paper provides a comprehensive overview of current techniques, challenges, and the importance of reasoning for complex problem-solving in AI applications.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

Thinking Machines: A Survey of LLM Based Reasoning Strategies

The document surveys reasoning strategies in Large Language Models (LLMs), highlighting the gap between their language proficiency and reasoning abilities. It discusses advancements like chain-of-thought prompting and various methods including reinforcement learning, test-time computation, and self-training to enhance reasoning in LLMs. The paper provides a comprehensive overview of current techniques, challenges, and the importance of reasoning for complex problem-solving in AI applications.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Thinking Machines: A Survey of LLM based Reasoning Strategies

Dibyanayan Bandyopadhyay1 , Soham Bhattacharjee1 , Asif Ekbal1,2


1
Department of Computer Science and Engineering, IIT Patna
2
School of AI and Data Science, IIT Jodhpur
dibyanayan_2321cs14@iitp.ac.in, sohambhattacharjeenghss@gmail.com, asif@iitj.ac.in

Abstract ChatGPT (GPT-3.5) and GPT-4, achieving state-of-


Large Language Models (LLMs) are highly the-art performance across many tasks. The intro-
proficient in language-based tasks. Their lan- duction of chain-of-thought prompting (Wei et al.,
arXiv:2503.10814v1 [cs.CL] 13 Mar 2025

guage capabilities have positioned them at the 2023)—which decomposes complex problems into
forefront of the future AGI (Artificial General manageable steps—further enhanced these mod-
Intelligence) race. However, on closer inspec- els, prompting claims that GPT-4 shows “sparks of
tion, Valmeekam et al. (2024); Zečević et al. AGI” (Bubeck et al., 2023).
(2023); Wu et al. (2024) highlight a significant However, research has shown that although
gap between their language proficiency and rea-
LLMs can emulate reasoning on structured bench-
soning abilities. Reasoning in LLMs and Vi-
sion Language Models (VLMs) aims to bridge marks, they do not truly reason as humans do
this gap by enabling these models to think and (Valmeekam et al., 2024; Wei et al., 2023). For
re-evaluate their actions and responses. Rea- example, solving a complex mathematical proof
soning is an essential capability for complex requires decomposing the problem into subprob-
problem-solving and a necessary step toward lems and iteratively refining a solution—an ability
establishing trust in Artificial Intelligence (AI). that LLMs often struggle with without additional
This will make AI suitable for deployment in
guidance. This insight has spurred efforts to evoke
sensitive domains, such as healthcare, banking,
law, defense, security etc. In recent times, with genuine reasoning in LLMs using techniques be-
the advent of powerful reasoning models like yond simple scaling. Notably, OpenAI’s o1 mod-
OpenAI O1 and DeepSeek R1, reasoning en- els—despite having the same parameter count as
dowment has become a critical research topic GPT-3.5—outperform GPT-4 (10 times larger) on
in LLMs. In this paper, we provide a detailed mathematical and reasoning tasks, indicating that
overview and comparison of existing reasoning scaling pre-training has diminishing returns. Con-
techniques and present a systematic survey of sequently, researchers are turning to innovative
reasoning-imbued language models. We also
test-time techniques that optimize pre-trained mod-
study current challenges and present our find-
ings. els.
Classical methods. such as Monte Carlo Tree
"Thinking is the hardest work there is, which is Search (MCTS) and reward modeling, as used in
probably why so few engage in it." AlphaGo (Silver et al., 2016), are now re-imagined
~Henry Ford for LLMs. For instance, Reflexion (Shinn et al.,
2023) allows an LLM to generate multiple chains
1 Introduction
of thought, evaluates them with a reward model,
According to epistemology—the philosophy of and iteratively refines its reasoning. This approach
knowledge—“reasoning” is the ability to draw in- not only searches the model’s internal thought
ferences from evidence or premises (Steup and space for better conclusions but also enables self-
Neta, 2024). This capacity enables humans to training techniques that use generated reasoning
think, acquire knowledge, and make informed de- traces as extra training data. These advances mark
cisions. While reasoning comes naturally to hu- a shift from brute-force scaling toward leveraging
mans, emulating genuine reasoning in language innovative inference-time strategies for superior
models remains challenging. GPT-3 (Brown et al., reasoning.
2020) first exhibited low-level reasoning and in- In this paper, we present a comprehensive review
struction following, which was later advanced in of current strategies for logical and critical think-

1
ing in LLMs. We subdivide the field with respect 2.2 Basics of Reinforcement Learning for
three key paradigms—reinforcement learning, test LLMs
time computation, and self-training—applied dur- Reinforcement learning (RL) trains an agent to
ing both training and inference. Our work stands interact with an environment to maximize a cumu-
out because: lative reward. In our framework, the agent is the
1. Unlike previous surveys like (Qiao et al., LLM and the environment can be an external tool,
2023; Huang and Chang, 2023), our work a trained model, or the same LLM. Sometimes the
is up-to-date with the latest advancements in environment is replaced with a world model, which
LLM reasoning. is model’s internal representation of the environ-
ment that enables the agent to predict future states
2. It improves upon the existing surveys like (Ku-
and rewards.
mar et al., 2025; Li et al., 2025b) by providing
At time step t, the agent observes a state st
a clear taxonomy that spans the entire field of
as a collection of previously generated tokens
reasoning.
x1 , . . . , xt−1 or reasoning steps, takes an action,
3. Furthermore, our work is beginner-friendly which can be the next token or the next reason-
and offers a top-down view of the key ideas ing step xt , and the environment in turn returns a
alongside visual representations and detailed reward rt . This agent-environment interaction is
discussion of specific papers. depicted in the block diagram below (Figure 1):
This survey not only organizes diverse approaches
into one coherent framework but also explains why
advanced LLM reasoning is essential for complex Environment/
problem solving, as demonstrated by tasks like World Model
3 2
intricate mathematical proofs.

2 Preliminary Concepts
reward
Generated next state
This section provides an overview of the funda-
mental concepts underlying our work, including LLM
language model (LM) objectives, reinforcement
learning (RL) principles, and Monte Carlo Tree 1
Search (MCTS).
Prompt
2.1 Language Model Objectives and
Sampling
Language models generate text by predicting the Figure 1: Basic Reinforcement learning setup
next token given a prompt. Formally, given a consisting of an environment and an LLM.
prompt x = (x1 , x2 , . . . , xn ) (where xi is a token
at ith index, a causal language model estimates the
2.2.1 Policy Optimization
probability of the next token xn+1 as
The LLM token output distribution is called a pol-
Pθ (xn+1 | x1 , x2 , . . . , xn ). icy πθ (a | s) (parameterized by θ), where the ob-
,where θ is the parameter of the language model. jective of RL is to learn a policy such that the fol-
These models are trained using the causal language lowing expected cumulative reward is maximized.
modeling loss: "∞ #
T
X
t
X J(πθ ) = Eπθ γ rt ,
LLM (θ) = − log Pθ (xt | x1 , . . . , xt−1 ), t=0
t=1
where T is the total number of tokens in a sequence. where γ ∈ [0, 1) is the discount factor.
During inference, given a prompt x, the model A simple approach to optimize the policy is the
generates new tokens by sampling vanilla policy gradient:

xn+1 ∼ Pθ (· | x1 , . . . , xn ) ∇θ J(πθ ) = Eπθ [∇θ log πθ (at | st ) Gt ] ,

2
P∞ t′ −t
with the return Gt = t′ =t γ rt′ . More ad- The associated position of the newly ex-
vanced methods such as Proximal Policy Optimiza- panded node is evaluated using a value net-
tion (PPO) (Schulman et al., 2017) and GRPO work.
(Shao et al., 2024) build upon this by incorporating
constraints to ensure stable and efficient updates. 3. Simulation: Perform a rollout from the new
The goal of RL is to obtain the optimal policy πθ∗ node using random or policy-guided sampling
that obtains the highest return by iteratively opti- to estimate the outcome.
mizing J(πθ ). 4. Backpropagation: Update the statistics (e.g.,
Value Function. Value functions are of two visit count, average reward or Q values) of
kinds, i) State value function (V (s)) which mea- nodes along the path based on the simulation
sures the expected returns if we start from state result.
s and act according to the optimal policy, ii) Ac-
tion value function (Q(s, a)) which measures the MCTS efficiently explores the search space by con-
expected returns if we start from state s and take centrating computational resources on the most
action a (can be any action) and then follow the promising branches.
optimal policy. In summary, value function is a
function of a state (thus can be parameterized by a 3 Methods
neural network) which outputs the importance of The overall method landscape of achieving rea-
that state. The rule of thumb is to select the state soning in large language models (LLMs) is bro-
with the highest value. ken down into three overall methods: i) Reinforce-
2.3 Preference Optimization ment Learning, ii) Test Time Compute and iii) Self-
Training Methods. Each of them is described in
Preference Optimization (PO) directly aligns detail below.
model behavior with preference data. Given an in-
put x and two candidate outputs y + and y − (where Reasoning

y + is preferred over y − ), DPO (Rafailov et al.,


2024) adjusts the model so that the probability of Reinforcement Learning (RL) Test Time Compute (TTC) Self-Training
generating y + exceeds that of y − . The objective
(based on Bradley-Terry model) is defined as: Figure 2: Taxonomy of methods for reasoning.

+
X er(y ,x)
LPO (θ) = − 3.1 Reinforcement Learning
er(y ,x) + er(y− ,x)
+
(x,y + ,y − )
In any reasoning-based system where the objective
is to reach a goal from a stated starting point, the
where r(y, x) denotes the reward assigned to
optimal policy tends to select the path with the
output y with x as the prompt. This loss encour-
highest expected reward. This path consists of in-
ages the model to assign higher likelihood to pre-
termediate steps—commonly referred to as reason-
ferred outputs, effectively refining its reasoning
ing steps. For instance, in AlphaGo (Silver et al.,
process based on observed preferences.
2016) the optimal next moves are considered as rea-
2.4 Monte Carlo Tree Search (MCTS) soning steps, whereas in solving a math problem,
the intermediate steps constitute the reasoning pro-
MCTS is a search algorithm that uses random sam-
cess. Consequently, reinforcement learning (RL)
pling to explore a decision tree and guide action
can be a powerful strategy for eliciting reasoning
selection. The key steps of MCTS are:
in language models.
1. Selection: Starting at the root node, recur- These strategies can be broadly divided into
sively select child nodes based on a policy three categories: i) Verbal Reinforcement, ii)
(e.g., Upper Confidence Bound for Trees Reward-based Reinforcement, which can be fur-
(UCT), or maximum action-value Q or a com- ther subdivided into process supervision and out-
bination thereof) until reaching a leaf node. come supervision, and iii) Search/Planning.
In recent years, hybrid approaches that combine
2. Expansion: If the leaf node is non-terminal, two or more of these components have also been
expand it by adding one or more child nodes. explored. For example, verbal RL can be combined

3
World Model LLM-MCTS (Zhao et al., 2023) ; AlphaLLM (Tian et al., 2024); RAP (Hao et al., 2023)

Search/Planning

MCTS SC-MCTS* (Gao et al., 2024); LLaMA-Berry (Zhang et al., 2024a)

Value based tuning DVO (Zhang et al., 2025) ; Rewarding Progress (Setlur et al., 2024)

Process Supervision
Reinforcement Learning (RL)
Preference tuning Step DPO (Lai et al., 2024) ; SVPO (Chen et al., 2024b)
Reward Based

Outcome Supervision DeepSeek-R1 (DeepSeek-AI et al., 2025) ; Chu et al. (2025) ; GLoRe (Havrilla et al., 2024)

Verbal Reinforcement ReAct (Yao et al., 2023b) ; Reflexion (Shinn et al., 2023)

Figure 3: Taxonomy of Reasoning with Reinforcement Learning.

with tree search as in RAP (Hao et al., 2023), or


process supervision can be combined with Monte Actor LLM Synergy Memory
Carlo tree search as demonstrated in rStar-Math
(Guan et al., 2025).
i) Verbal Reinforcement: In verbal reinforce- Action
Trajectory
ment, a base language model (LM) connected to
a memory (either in-context or external) produces
Evaluator (LM)
an action trajectory given a prompt. This action
(i.e., a sequence of natural language tokens) is
self reflector
fed to an environment, where two LM-based mod- (LM)

ules—referred to as the Evaluator and the Self-


Reflector—provide feedback in natural language World (Environment)

on the generated reasoning trace. This feedback


is stored in the memory and subsequently used
to guide further generation. An overall diagram Figure 4: Verbal reinforcement objective that relies on
of this approach is shown in Figure 4. For in- generation followed by NL feedback. Often the
stance, (Yao et al., 2023b) introduced a strategy feedback consists of just observing the impact of actor
that employs a thought-action-observation triplet LLM’s own action to the environment, as in (Yao et al.,
to enhance reasoning. Similarly, the Reflexion 2023b)
framework (Shinn et al., 2023) explores the use of
verbal feedback in language agents.
tion into multiple smaller dialogue-based
ii) Reward based Reinforcement: These meth-
sub-questions that each receive rewards, ef-
ods can be broadly divided into two categories:
fectively transforming PPO into a process-
1. Process Supervision: In this approach, each supervised method. Building on this idea,
reasoning step (note that the reasoning step VinePPO (Kazemnejad et al., 2024) replaces
can either be the next token or the next step) is the value-based network in PPO with Monte
individually rewarded based on its relevance Carlo simulations of intermediate steps, mak-
to forming an optimal reasoning chain. Since, ing the algorithm more computationally ef-
the intermediate steps are rewarded instead ficient and less error-prone. In Chen et al.
of only the final outcome, the model moves (2024a), the authors demonstrate that process
towards a more coherent final answer. supervision can be achieved without explicit
In DialCoT (Han et al., 2023), PPO was first preference data by training a value network
employed to invoke chain-of-thought reason- to predict the value of intermediate states us-
ing in language models. Although PPO typi- ing Monte Carlo estimates of the final reward.
cally relies on outcome supervision—because Likewise, (Luo et al., 2024) also combines
it updates the policy only after seeing a fi- process supervision with MCTS for search.
nal outcome, DialCoT divides each ques- Acquiring process-supervised data is often

4
challenging, and training a value-based net-
Root
work is computationally expensive. To ad- World
Model
Prompt
dress this, researchers introduced the con-

Best Node
Feedback
Value
0.7 estimator LLM
cept of Direct Preference Optimization (DPO)
in Rafailov et al. (2024). DPO employs an 0.8 Tree search

outcome-supervised strategy, where positive Rollout

and negative preference data pairs are directly if answer is correct Output actions

if answer is in-correct
used to update the model’s policy without re- Monte-carlo tree search for correct node selction Given a prompt, LLM performs tree-search

lying on an external value or reward model. (Value estimator predicts the value of a node) with world model feedback.

While DPO gained popularity for its simplic-


ity and effectiveness in chat benchmarks, it Figure 5: Left: Monte Carlo tree search where the state
values are evaluated using a value estimator network
struggled with reasoning-based tasks due to
which is trained using RL via preference data or
the lack of granular process supervision, as monte-carlo estimate of the reward. Right: World
noted in Lai et al. (2024). Researchers have model as a feedback mechanism for tree-search.
successfully tackled this problem by combin-
ing process supervision with preference op-
timization in Lai et al. (2024), where step- scores during policy updates. See Figure 6
level preference data is used for policy upda- for a comparison between GRPO and PPO.
tion. The quality of these self-generated pairs In Chu et al. (2025), the authors demonstrate
can be enhanced by performing MCTS roll- that outcome-based rewards can even outper-
outs from the root node s0 and constructing form supervised fine-tuning (SFT) for rea-
intermediate-state preference pairs (si ≻ sj ), soning tasks. For disciplines such as math-
where si is preferred over sj if it leads to the ematics, science, or coding—where refined
correct answer more frequently. Using MCTS reasoning is essential—traditional outcome
also gets rid of the expensive human or lan- reward models (ORM) or preference reward
guage model supervision in annotating prefer- models (PRM) suffer from either sparse re-
ence pairs. These ideas are further explored in ward signals or the need for extensive hu-
Jiao et al. (2024); Xie et al. (2024); Chen et al. man annotation. To address these issues,
(2024b), with an excellent summary provided Havrilla et al. (2024) presents a method that
in Li et al. (2025a). Due to the labor-intensive trains a step-wise ORM (s-ORM) using syn-
nature of human supervision for constructing thetic data. This approach not only allevi-
preference data, Direct Value Optimization ates the data-hungry nature of PRMs but also
(DVO) (Zhang et al., 2025) has been intro- more accurately detects incorrect reasoning
duced. Much of the current research focuses steps, thereby mitigating the sparse reward
on enhancing process rewards through bet- and credit assignment challenges inherent in
ter credit assignment strategies (Setlur et al., standard ORMs.
2024; Cui et al., 2025; Zhang et al., 2024b).
Figure 7 illustrate classic DPO and its im- iii) Search: Search-based techniques in re-
provements for handling step level granular inforcement learning leverage tree search meth-
preference pairs which aid in better reasoning. ods—most notably Monte Carlo Tree Search
(MCTS) and its variants—to improve decision-
2. Outcome Supervision: While (Lightman et al., making. This approach was pioneered in systems
2023) originally showed that process super- like AlphaGo and its subsequent variants, where
vision outperforms outcome supervision, this MCTS is integrated with a value and policy net-
view is challenged by superior performance of work (the LLM) to explore potential moves effi-
models such as Deepseek-R1 (DeepSeek-AI ciently (Gao et al., 2024). The same underlying
et al., 2025), which rely solely on outcome- idea—using search to simulate future outcomes
based rewards. This is also supported by and guide action selection—has been extended to
the GRPO (Group Relative Policy Optimiza- domains requiring verifiable results, such as mathe-
tion) algorithm (Shao et al., 2024), which matics. The value network that estimates the value
improves upon PPO by replacing the expen- of potential moves in tree search and can be trained
sive value-based network with group relative using: i) preference data collected from MCTS roll-

5
out. Concretely, si is considered preferred than sj various critique modules (e.g., program exe-
(denoted by si ≻ sj ) if si leads to the correct an- cution modules, external knowledge bases, or
swer more often than sj . (Zhang et al., 2024a; Ma trained models).
et al., 2025), ii) Using Monte-Carlo estimate of
the overall reward achieved from the current state 2. Scaling/Adaptation: In this approach, the
(Chen et al., 2024a). compute allocated at test time is scaled up to
More recently, generic problem solving has ben- enhance the quality of the output.
efited from frameworks that treat large language
3.2.1 Feedback Guided Improvement
models (LLMs) as world models. In these setups,
an LLM generates predictions about the environ- Depending on the timing of the feedback, this
ment, and this world model is augmented with paradigm can be further grouped into two subdo-
MCTS or similar search strategies to systemati- mains:
cally explore potential solutions (Zhao et al., 2023;
Feng et al., 2024; Tian et al., 2024; Zhou et al., 1. Generation Time Feedback: In this case, the
2024). Figure 5 explains the overall details. LLM receives feedback during the generation
process, which can be in the form of numeri-
cal scores assessing either partial or complete
Reference
Model outputs.
Policy Reward
Question output Action
Model Model
(a) Step-Feedback (SF): Here, beam search
Value Model
or Monte Carlo Tree Search (MCTS) is
employed in which a critique module
output 1 Reference
Model
provides feedback at each step of the rea-
soning process. Only the highest scoring
Policy Action
Question output 2
Model reasoning chains are retained. Figure 9
... Reward
Model illustrates how an external verifier scores
output G
intermediate states in both beam search
and MCTS.
Figure 6: Improvement of GRPO over PPO. Up: (b) Outcome-Feedback (OF): In this ap-
DialCot (Han et al., 2023) uses PPO for policy proach, multiple generations are pro-
updation. Down: In Shao et al. (2024) GRPO foregoes duced and then scored by an external
the value model and calculates the final reward from
critique module. Only the generations
group scores of multiple outputs, significantly reducing
training resources with the highest scores are retained. Fig-
ure 10 depicts a feedback mechanism
where the input prompt is first enriched
3.2 Test Time Compute with feedback from an external knowl-
edge base or verifier, and then multiple
Modern LLMs have demonstrated impressive rea-
branches are explored with the highest
soning abilities; however, their fixed parameters
scoring branch being selected.
and static inference can limit their capacity to adapt
to new or complex tasks on the fly. Test time com- Depending on the application, multiple
pute is motivated by the need to overcome these feedback-based models have been proposed.
limitations by allowing models to refine or extend
their reasoning capabilities during deployment. In Code Execution as Critique: For example,
test time compute setting, the pretrained LLM is CodeT (Chen et al., 2022) employs the same
never trained by either supervised or reinforcement language model to generate multiple code can-
learning-based techniques. didates along with test cases, selecting the
Test-time capabilities can be broadly categorized final code only if all test cases are passed.
into: Similarly, LEVER (Ni et al., 2023) uses a pro-
gram executor to verify the correctness of the
1. Feedback Guided Improvement: In this cat- generated code and provide feedback. Both
egory, the generation from the model is re- methods operate as outcome-feedback based
fined based on external feedback provided by critiques. For using various kinds of tools as

6
Prompt Data Pool
Policy from Last
Iteration
Prompt Data Pool
intermediate
reasoning steps MCTS
Verifier Policy from Last
Iteration `
Preference Learning
Policy from Last
Iteration

Preference Learning
Preference Data Step-Level Preferences
Preference Learning
Step Wise Preferences

(a) DPO (Rafailov et al., 2024) directly (c) MCTS combined with DPO (Xie
optimizes the policy based on instance (b) Step-DPO (Lai et al., 2024) uses et al., 2024) also uses step level
level preferences. As a result it misses step level reasoning preferences to preferences from actions estimated by
out on granular step level preferences improve upon DPO MCTS to assign the preferences

Figure 7: DPO and its variants which use step-wise preference data for granular reasoning tasks.

Score based search Deductive Beam Search Zhu et al. (2024) ; AlphaLLM (Tian et al., 2024)
Self Feedback
Natural Language Self-Refine (Madaan et al., 2023) ; Weng et al. (2023)
Scaling Adaptation

Compute Scaling Chain-of-Thought (Wei et al., 2023) ; Forest-of-Thought (Bi et al., 2025); Graph-of-Thought (Besta et al., 2024) ; Buffer-of-Thought (Yang et al., 2024)

Test Time Compute (TTC) Trained Model Feedback DeCRIM (Ferraz et al., 2024) ; Re3 (Yang et al., 2022)
Post-hoc Feedback
LLM based Debate Debate(Du et al., 2023); CMD (Wang et al., 2024a); LMvLM (Cohen et al., 2023); ReConcile (Chen et al., 2024c)
Feedback Guided Improvement
Outcome Feedback LEVER (Ni et al., 2023) ; CodeT (Chen et al., 2022)
Generation time feedback
Step Feedback GRACE (Khalifa et al., 2023) ; CoRE (Zhu et al., 2023)

Figure 8: Taxonomy of Reasoning with Test Time Compute (TTC)

feedback, (Gou et al., 2024) proposes interac- search with step-feedback, guiding the chain-
tive tool use as a critiquing method. Similarly, of-thought (CoT) process using a discrimi-
Chen et al. (2023) demonstrates using results nator. CoRE (Zhu et al., 2023) uses MCTS
of generated code and a few in-context exam- for math problem solving, where the selec-
ple as feedback, large language model learns tion operation at each node is informed by
to debug its predicted program. a predicted reward from a trained model
critique. DIVERSE (Li et al., 2023) uses
External Knowledge as Critique: In this setup, binary-valued outcome-feedback based on
an external knowledge base K is consulted a trained DeBERTa model, significantly im-
before generating an answer. The feedback proving the problem-solving rate on GSM8K
from K, denoted as K(x), may be provided from 17.9% to 58.1%. Additionally, (Zhang
as a natural language sentence or as discrete et al., 2023) proposes a transformer decoding
scores. For example, MemPrompt (Madaan mechanism based on MCTS for code gener-
et al., 2022) utilizes a pool of prior user feed- ation, with reward signals obtained from ex-
back (serving as a knowledge base) to guide ecuting generated code on public test cases.
text generation based on the current prompt. Similarly, (Hong et al., 2023) employs MCTS
In another example, Varshney et al. lever- for common-sense reasoning by constructing
age self-inquiry and web-search to retrieve an entailment tree.
relevant information to correct hallucinated
statements, where the knowledge base com- 2. Post-hoc Feedback: After the base LLM gen-
prises both the language model and external erates its outputs, these outputs can be refined
web data. in a post-hoc manner using separate models.
(a) LLM-Based Debate: Two
Trained Model as Critique: A trained model
LLMs—framed as an examinee
can be used as a critique mechanism in either
and an examiner—engage in a multi-
a step-feedback or outcome-feedback config-
turn debate where the original answer
uration.
is discussed and refined. Originally
GRACE (Khalifa et al., 2023) employs beam proposed in (Du et al., 2023), this

7
Expanded
Beam Search (step-feedback) MCTS (step-feedback)
Node

Verifier

0.7 0.3 0.2

0.1 0.1 0.8 0.5 0.9 0.1 0.8

Rollout
Node Selection with
Beam search guided by a verifier score. Each node refers Expansion and Backpropagation
to a token in our case. Rollout

(Khalifa et al., 2023) (Zhu et al., 2023; Zhang et al., 2023; Hong et al., 2023)

Figure 9: Step-feedback illustration using Beam search and MCTS. In Beam search, an external verifier scores
intermediate nodes to guide selection. In MCTS, a node is expanded and a rollout begins from that node; here, the
verifier assigns it a value of 0.8.

approach has been adopted in several LM


Final Answer
subsequent works (Cohen et al., 2023;
Chen et al., 2024c). Although promising, LM
Input Prompt
this approach has also faced criticism Verfier

(Wang et al., 2024a). KB 0.7 0.5 0.8 0.9

(b) Trained Model Feedback and Refine-


ment: In this method, feedback in the Knowledge base outcome feedback Outcome feedback from a trained verifier model

form of either scalar values or natu-


ral language is used to revise the gen- Figure 10: Left: The input prompt is passed to a
knowledge base that provides feedback, which is then
erated response in an inference-only
fed back into the LM for further output. Right:
setting (Yang et al., 2022). Recently, Multiple branches are explored from a start node
generate-critique-refine pipelines have (generated by the LM), and only the branch with the
gained popularity (Ferraz et al., 2024; highest estimated verifier feedback score is selected.
Wadhwa et al., 2024).

Both of these post-hoc feedback mechanisms


Scaling Up Token Compute: Scaling up token-
are described comprehensively in the survey
level compute involves generating multiple inter-
of (Pan et al., 2024) (see Sections 3.2 and
mediate token outputs (i.e., reasoning steps) in par-
3.3).
allel, allowing the exploration of various plausible
reasoning pathways. This strategy is typically im-
3.2.2 Scaling Test-Time Computation plemented via chain-of-thought (CoT) prompting
Recent works have shown that scaling test-time and its subsequent variants. For instance, CoT (Wei
computation—such as best-of-N sampling—can et al., 2023) generates reasoning in a linear, step-
outperform scaling during training. In this section, by-step manner, whereas tree-of-thought (ToT)
we focus on strategies that increase reasoning capa- (Yao et al., 2023a) generalizes this by exploring
bilities at test time by investing more computation, multiple reasoning branches simultaneously and
either by scaling up token-level compute, employ- selecting the best one based on an LM-generated
ing test-time training, or using self-feedback. Note heuristic score. Due to the inherent noise in LM-
that self-feedback differs from self-teaching: in generated scores, pairwise comparisons have been
self-feedback, the original model output is refined proposed (Zhang et al., 2024c) to more reliably
without further training of the base model. Scal- identify promising intermediate thoughts.
ing up token compute and Self-feedback are two Further generalizations include forest-of-
mainstream strategies. thought (FoT) (Bi et al., 2025), which employs

8
sparse activation to select the most relevant Self-feedback can also be leveraged as a re-
reasoning paths among multiple trees, and ward or scoring signal within search-based
graph-of-thought (GoT) (Besta et al., 2024), which techniques. For instance, Zhu et al. (2024)
uses graph-based operations to aggregate and propose a decoding algorithm that employs
refine thoughts. Recently, buffer-of-thoughts beam search for reasoning based on self-
(Yang et al., 2024) has been introduced, where feedback. Other studies have demonstrated
a meta-buffer stores high-level meta-thoughts that MCTS combined with LLM-guided self-
that are dynamically instantiated during problem feedback can enhance the reasoning process
solving. (Tian et al., 2024; Hao et al., 2023). Fur-
An alternative to pure scaling is integrating thermore, some methods reframe the LLM
search with CoT. For example, CoT with search as a world model to steer an MCTS-based
(Zhu et al., 2024) uses a verifier at each reason- reasoning process (Zhao et al., 2023). The
ing step to check deductibility and mitigate error underlying technical framework in these self-
accumulation. In inductive reasoning problems, an- feedback search techniques is analogous to
other approach constructs a set of natural language that shown in Figure 9, with the distinction
hypotheses that are then implemented as verifiable that the external step-feedback is replaced by
Python programs, as explored in hypothesis search self-generated feedback.
(Wang et al., 2024b).
ReVISE (Lee et al., 2025) exemplifies a hybrid
Self-Feedback: Self-feedback can be utilized in
approach by using a structured, curriculum-based
two forms:
preference learning method to both self-teach the
1. Natural Language Self-Feedback: Self- base model and integrate test-time scaling via self-
refine (Madaan et al., 2023) is a pedagogical verification and correction. This approach is fur-
work that demonstrates how natural language ther enhanced by a confidence-aware decoding
feedback generated by the same LLM can be mechanism that leverages self-feedback.
used to refine its reasoning process. This ap- There are, however, criticisms suggesting
proach follows three steps: i) Answer Genera- that LLMs may not effectively aid in planning
tion, ii) Feedback Generation, ii) Refinement. (Valmeekam et al., 2023) or self-correct their rea-
All steps are executed in a loop using the same soning (Huang et al., 2024).
LLM. As an extension, Weng et al. (2023) pro-
posed a scoring-based approach where multi- 3.3 Self-Training Methods
ple sampled answers are scored by the same Pre-trained language models have revolutionized
LLM, and the answer with the highest score natural language processing; however, their fixed
is selected as the final answer. In this method, weights can become a bottleneck when faced
separate conclusions are generated for each with the dynamic nature of real-world data. Self-
answer, followed by backward verification on training methods overcome this limitation by fine-
those conclusions. Another strategy (Chen tuning the pre-trained LLMs using curated self-
et al., 2025) iteratively samples and applies generated reasoning traces, thereby updating the
self-verification until all sampled answers are model weights—unlike test-time methods where
confirmed as correct by the model; the final reasoning is induced via in-context learning with-
answer is then determined via majority vot- out weight updates. This approach, first illustrated
ing among the verified responses. Addition- in Deepseek-R1 and fundamentally pioneered by
ally, self-collaboration has been introduced STaR, has shown significant improvements in rea-
by repurposing a single LLM into a cognitive soning performance.
synergist with multiple personas (Wang et al., Assume that for a given problem, we have input-
2024c). Here, multiple personas engage in output pairs (xi , yi ), where xi is a prompt and yi
a brainstorming session by generating sepa- is the correct answer, and let MP T denote the pre-
rate answers, which are subsequently refined trained language model. The self-training process
through inter-persona collaboration. The over- typically involves an arbitrary combination of the
all ideas of these approaches are summarized following techniques:
in Figure 11.
1. Supervised Fine-Tuning: Fine-tune MP T
2. Self-Feedback as a Score for Tree-Search: on the (xi , yi ) pairs using the causal language

9
Prompt
Prompt
Prompt

Prompt LLM
LLM
1. Sample
Answers
LLM 1. Sample
1. Brainstorm
Answers
2. Forward to

2. Feedback
3. Refine

same LLM
1. Answer 2. Forward to
same LLM LLM
LLM

Iterate n times
LLM
3. self-verify
2. Collaboration
Answer
3. Give score
Final to each answer
Answer
LLM
Final
Answer
Final
Answer
SELF_REFINE
Wang et al., 2024
Majority Vote
Weng et al., 2023
Final
Answer
SETS

Figure 11: From left to right, we elaborate four main papers based on self-refinement objective. aij refers to the ith
iteration and j the sample answer. Green color denotes the particular answer is self-verified as correct.

modeling loss to obtain MSF T . MSF T (obtained from MP T via supervised fine-
tuning), while the Self-Explore (Hwang et al.,
2. Rejection Fine-Tuning: For each prompt 2024) method begins with supervised fine-tuning
xi , generate a rationale (i.e. reasoning trace) to generate MSF T and subsequently refines it
r̂i and then an answer ŷi using either MP T through rejection-tuning and preference-tuning to
or MSF T . The generated triple (xi , r̂i , ŷi ) is obtain MReason . Figure 12 shows the overall train-
used for further fine-tuning only if ŷi = yi . ing process of these four representative techniques
for self-training.
3. Preference Tuning: Construct a preference
dataset by generating two pairs (xi , r̂i , ŷi ) 4 Challenges
and (xi , r̂j , ŷj ) from the same model on in-
put xi . If ŷj = yi but ŷi ̸= yi , then the pair 1. Challenges regarding Automation of Pro-
(xi , r̂j , ŷj ) is preferred. This preference data cess Supervision Signals: Developing PRMs
is then used—often via Direct Preference Op- (Process Reward Models) is hampered by
timization (DPO)—to fine-tune the model so the need for detailed process supervision sig-
that it favors generating reasoning traces that nals—labels for every reasoning step—that
lead to the correct answer. currently depend on costly, labor-intensive hu-
man annotation (Lightman et al., 2023). Au-
4. Reinforcement Learning: Build a dataset tomating this labeling process is essential for
of tuples (xi , r̂i , ŷi , Ri ), where the reward Ri scalability and efficiency.
is 1 if ŷi = yi and 0 or −1 otherwise. The
model is then updated using RL techniques to 2. Computational Overhead and Overthink-
maximize the expected cumulative reward. ing: While MCTS addresses issues with
value-based networks in process supervision,
STaR (Zelikman et al., 2022) uses rejection fine- it struggles with a vast search space and often
tuning on MP T to produce a reasoning model overthinks, leading to unnecessary computa-
MReason . However, as rejection fine-tuning alone tional complexity and inefficiency (Luo et al.,
ignores negative training pairs, v-STaR (Hosseini 2024).
et al., 2024) first obtains a model MRF T via re-
jection fine-tuning and then applies DPO to yield 3. Expensive Step Level Preference Optimiza-
MReason . Other approaches, such as ReFT (Lu- tion: Step level preference learning solves
ong et al., 2024), use reinforcement learning on the issues DPO (Rafailov et al., 2024) faces

10
Generate SFT dataset Generate RL dataset

Inference

Training
MPT MPT MReason MPT MSFT MSFT MReason

Model
trained PPO

STaR ReFT

Generate SFT and Preference dataset Generate SFT and Preference dataset

MSFT MSFT MRFT MReason


MPT MPT MSFT MReason

DPO

v-STaR Self-Explore

Figure 12: Four representative self-training methods (i.e. STaR, v-STaR, ReFT, Self-Explore) with training details.

with reasoning task. However, it also presents 5 Conclusion


significant challenges. Step level preference
annotation is much more expensive compared This survey provides a bird’s-eye view of the key
to instance level annotation. It also requires techniques used to elucidate reasoning in language
fine-grained judgements that can lead to in- models. We deliberately kept the discussion acces-
consistent and subjective labels. sible by omitting extensive mathematical deriva-
tions and detailed benchmark comparisons, which
can often overwhelm newcomers to the field. Un-
4. Test-Time Compute Scaling depends on Ro- like many long surveys that are densely packed
bust Pre-training: Test-Time Compute un- with technical details, our paper is relatively brief
locks a model’s best performance. This en- and focused, making it a less intimidating entry
ables smaller models to outperform bigger point for researchers new to the area. Our focus
models on easier questions (Snell et al., 2024). has been on presenting only the most essential in-
However, a bottleneck arrives for more diffi- formation related to reasoning in LLMs. For those
cult questions. If the base model lacks ro- interested in a deeper exploration of specific as-
bust pre-training, additional inference com- pects, we recommend consulting more comprehen-
pute may not compensate for its deficiencies. sive surveys such as (Kumar et al., 2025).

5. Test-Time Scaling Limitations: Test-time References


scaling techniques such as Chain-of-Thought
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger-
(Wei et al., 2023) gained popularity because stenberger, Michal Podstawski, Lukas Gianinazzi,
of its effectiveness and interpretable nature. Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom-
However, Wei et al. (2023) also showed that ski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph
Cot proved to be significantly effective only of thoughts: Solving elaborate problems with large
language models. Proceedings of the AAAI Confer-
for LLMs with more than 100B parameters. ence on Artificial Intelligence, 38(16):17682–17690.
For Smaller Language Models with less than
10B parameters, it also proved to be detrimen- Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and
tal in certain cases. Yunhe Wang. 2025. Forest-of-thought: Scaling test-

11
time compute for enhancing llm reasoning. Preprint, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan
arXiv:2412.09078. Liu, Maosong Sun, Bowen Zhou, and Ning Ding.
2025. Process reinforcement through implicit re-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie wards. Preprint, arXiv:2502.01456.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Gretchen Krueger, Tom Henighan, Rewon Child, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong
Clemens Winter, Christopher Hesse, Mark Chen, Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Bingxuan Wang, Bochao Wu, Bei Feng, Chengda
Chess, Jack Clark, Christopher Berner, Sam Mc- Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang,
Candlish, Alec Radford, Ilya Sutskever, and Dario Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji,
Amodei. 2020. Language models are few-shot learn- Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo,
ers. Preprint, arXiv:2005.14165. Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang,
Han Bao, Hanwei Xu, Haocheng Wang, Honghui
Sébastien Bubeck, Varun Chandrasekaran, Ronen El- Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li,
dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Pe- Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang
ter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L.
Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai
and Yi Zhang. 2023. Sparks of artificial general in- Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai
telligence: Early experiments with gpt-4. Preprint, Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong
arXiv:2303.12712. Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zhang, Minghua Zhang, Minghui Tang, Meng Li,
Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Miaojun Wang, Mingming Li, Ning Tian, Panpan
2022. Codet: Code generation with generated tests. Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen,
Preprint, arXiv:2207.10397. Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan,
Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen,
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Shanghao Lu, Shangyan Zhou, Shanhuang Chen,
2024a. Alphamath almost zero: Process supervision Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shun-
without process. Preprint, arXiv:2405.03553. feng Zhou, Shuting Pan, S. S. Li, Shuang Zhou,
Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei,
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao,
Fan. 2024b. Step-level value preference opti- Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu,
mization for mathematical reasoning. Preprint, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu,
arXiv:2406.10858. Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin
Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu
Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q.
Ruoxi Sun, and Sercan Ö Arık. 2025. Sets: Leverag- Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xi-
ing self-verification and self-correction for improved aowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi
test-time scaling. Preprint, arXiv:2501.19306. Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q.
Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu,
Bansal. 2024c. Reconcile: Round-table conference Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He,
improves reasoning via consensus among diverse Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma,
llms. Preprint, arXiv:2309.13007. Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan
Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu,
Denny Zhou. 2023. Teaching large language models Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping
to self-debug. Preprint, arXiv:2304.05128. Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren,
Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda
Sergey Levine, and Yi Ma. 2025. Sft memorizes, Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma,
rl generalizes: A comparative study of foundation Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zi-
model post-training. Preprint, arXiv:2501.17161. jun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng
Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and
Roi Cohen, May Hamri, Mor Geva, and Amir Glober- Zhen Zhang. 2025. Deepseek-r1: Incentivizing rea-
son. 2023. Lm vs lm: Detecting factual errors via soning capability in llms via reinforcement learning.
cross examination. Preprint, arXiv:2305.13281. Preprint, arXiv:2501.12948.

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yilun Du, Shuang Li, Antonio Torralba, Joshua B.
Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Tenenbaum, and Igor Mordatch. 2023. Improving
Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu factuality and reasoning in language models through
Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, multiagent debate. Preprint, arXiv:2305.14325.

12
Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus Jie Huang, Xinyun Chen, Swaroop Mishra,
McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Huaixiu Steven Zheng, Adams Wei Yu, Xiny-
2024. Alphazero-like tree-search can guide large ing Song, and Denny Zhou. 2024. Large language
language model decoding and training. Preprint, models cannot self-correct reasoning yet. Preprint,
arXiv:2309.17179. arXiv:2310.01798.

Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Hyeonbin Hwang, Doyoung Kim, Seungone Kim,
Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Seonghyeon Ye, and Minjoon Seo. 2024. Self-
Subramanian, Tagyoung Chung, Mohit Bansal, and explore: Enhancing mathematical reasoning in lan-
Nanyun Peng. 2024. Llm self-correction with de- guage models with fine-grained rewards. In Findings
crim: Decompose, critique, and refine for enhanced of the Association for Computational Linguistics:
following of instructions with multiple constraints. EMNLP 2024, pages 1444–1466, Miami, Florida,
Preprint, arXiv:2410.06458. USA. Association for Computational Linguistics.

Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Fangkai Jiao, Chengwei Qin, Zhengyuan Liu,
Hongzhang Liu, Aiwei Liu, Xuming Hu, and Li- Nancy F. Chen, and Shafiq Joty. 2024. Learn-
jie Wen. 2024. Interpretable contrastive monte carlo ing planning-based reasoning by trajectories col-
tree search reasoning. Preprint, arXiv:2410.01707. lection and process reward synthesizing. Preprint,
arXiv:2402.00658.
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong
Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Amirhossein Kazemnejad, Milad Aghajohari, Eva
2024. Critic: Large language models can self- Portelance, Alessandro Sordoni, Siva Reddy, Aaron
correct with tool-interactive critiquing. Preprint, Courville, and Nicolas Le Roux. 2024. Vineppo: Un-
arXiv:2305.11738. locking rl potential for llm reasoning through refined
credit assignment. Preprint, arXiv:2410.01679.
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang,
Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. Muhammad Khalifa, Lajanugen Logeswaran, Moon-
2025. rstar-math: Small llms can master math rea- tae Lee, Honglak Lee, and Lu Wang. 2023. Grace:
soning with self-evolved deep thinking. Preprint, Discriminator-guided chain-of-thought reasoning.
arXiv:2501.04519. Preprint, arXiv:2305.14934.

Komal Kumar, Tajamul Ashraf, Omkar Thawakar,


Chengcheng Han, Xiaowei Du, Che Zhang, Yixin Lian,
Rao Muhammad Anwer, Hisham Cholakkal,
Xiang Li, Ming Gao, and Baoyuan Wang. 2023. Di-
Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr,
alCoT meets PPO: Decomposing and exploring rea-
Salman Khan, and Fahad Shahbaz Khan. 2025. Llm
soning paths in smaller language models. In Proceed-
post-training: A deep dive into reasoning large lan-
ings of the 2023 Conference on Empirical Methods
guage models. Preprint, arXiv:2502.21321.
in Natural Language Processing, pages 8055–8068,
Singapore. Association for Computational Linguis- Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xi-
tics. angru Peng, and Jiaya Jia. 2024. Step-dpo: Step-wise
preference optimization for long-chain reasoning of
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, llms. Preprint, arXiv:2406.18629.
Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023.
Reasoning with language model is planning with Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jin-
world model. Preprint, arXiv:2305.14992. woo Shin, and Jihoon Tack. 2025. Revise: Learning
to refine at test-time via intrinsic self-verification.
Alex Havrilla, Sharath Raparthy, Christoforus Nalmpan- Preprint, arXiv:2502.14565.
tis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric
Hambro, and Roberta Raileanu. 2024. Glore: When, Shuangtao Li, Shuaihao Dong, Kexin Luan, Xinhan
where, and how to improve llm reasoning via global Di, and Chaofan Ding. 2025a. Enhancing reasoning
and local refinements. Preprint, arXiv:2402.10963. through process supervision with monte carlo tree
search. Preprint, arXiv:2501.01478.
Ruixin Hong, Hongming Zhang, Hong Zhao, Dong
Yu, and Changshui Zhang. 2023. Faithful ques- Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen,
tion answering with monte-carlo planning. Preprint, Jian-Guang Lou, and Weizhu Chen. 2023. Making
arXiv:2305.02556. language models better reasoners with step-aware
verifier. In Proceedings of the 61st Annual Meet-
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron ing of the Association for Computational Linguistics
Courville, Alessandro Sordoni, and Rishabh Agar- (Volume 1: Long Papers), pages 5315–5333, Toronto,
wal. 2024. V-star: Training verifiers for self-taught Canada. Association for Computational Linguistics.
reasoners. Preprint, arXiv:2402.06457.
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Ji-
Jie Huang and Kevin Chen-Chuan Chang. 2023. To- axin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu,
wards reasoning in large language models: A survey. Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying
Preprint, arXiv:2212.10403. Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song,

13
and Cheng-Lin Liu. 2025b. From system 1 to sys- 2024. Direct preference optimization: Your lan-
tem 2: A survey of reasoning large language models. guage model is secretly a reward model. Preprint,
Preprint, arXiv:2502.17419. arXiv:2305.18290.

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri John Schulman, Filip Wolski, Prafulla Dhariwal,
Edwards, Bowen Baker, Teddy Lee, Jan Leike, Alec Radford, and Oleg Klimov. 2017. Prox-
John Schulman, Ilya Sutskever, and Karl Cobbe. imal policy optimization algorithms. Preprint,
2023. Let’s verify step by step. Preprint, arXiv:1707.06347.
arXiv:2305.20050.
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh
Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Agarwal, Jonathan Berant, and Aviral Kumar. 2024.
Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rewarding progress: Scaling automated process ver-
Rastogi. 2024. Improve mathematical reasoning in ifiers for llm reasoning. Preprint, arXiv:2410.08146.
language models by automated process supervision.
Preprint, arXiv:2406.06592. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu,
Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024.
Peng Sun, Xiaoran Jin, and Hang Li. 2024. Reft: Deepseekmath: Pushing the limits of mathemati-
Reasoning with reinforced fine-tuning. Preprint, cal reasoning in open language models. Preprint,
arXiv:2401.08967. arXiv:2402.03300.
Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Noah Shinn, Federico Cassano, Edward Berman, Ash-
Zitao Liu, and Weiqi Luo. 2025. What are step- win Gopinath, Karthik Narasimhan, and Shunyu Yao.
level reward models rewarding? counterintuitive 2023. Reflexion: Language agents with verbal rein-
findings from mcts-boosted mathematical reasoning. forcement learning. Preprint, arXiv:2303.11366.
Preprint, arXiv:2412.15904.
David Silver, Aja Huang, Chris J. Maddison, Arthur
Aman Madaan, Niket Tandon, Peter Clark, and Yim-
Guez, Laurent Sifre, George van den Driessche, Ju-
ing Yang. 2022. Memory-assisted prompt editing
lian Schrittwieser, Ioannis Antonoglou, Veda Pan-
to improve GPT-3 after deployment. In Proceed-
neershelvam, Marc Lanctot, Sander Dieleman, Do-
ings of the 2022 Conference on Empirical Methods
minik Grewe, John Nham, Nal Kalchbrenner, Ilya
in Natural Language Processing, pages 2833–2861,
Sutskever, Timothy Lillicrap, Madeleine Leach, Ko-
Abu Dhabi, United Arab Emirates. Association for
ray Kavukcuoglu, Thore Graepel, and Demis Hass-
Computational Linguistics.
abis. 2016. Mastering the game of Go with deep neu-
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler ral networks and tree search. Nature, 529(7587):484–
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, 489.
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
Shashank Gupta, Bodhisattwa Prasad Majumder, Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku-
Katherine Hermann, Sean Welleck, Amir Yazdan- mar. 2024. Scaling llm test-time compute optimally
bakhsh, and Peter Clark. 2023. Self-refine: It- can be more effective than scaling model parameters.
erative refinement with self-feedback. Preprint, Preprint, arXiv:2408.03314.
arXiv:2303.17651.
Matthias Steup and Ram Neta. 2024. Epistemology.
Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, In Edward N. Zalta and Uri Nodelman, editors, The
Wen tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. Stanford Encyclopedia of Philosophy, Winter 2024
Lever: Learning to verify language-to-code genera- edition. Metaphysics Research Lab, Stanford Univer-
tion with execution. Preprint, arXiv:2302.08468. sity.

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian
Nathani, Xinyi Wang, and William Yang Wang. 2024. Yu, Haitao Mi, and Dong Yu. 2024. Toward self-
Automatically correcting large language models: Sur- improvement of llms via imagination, searching, and
veying the landscape of diverse automated correction criticizing. Preprint, arXiv:2404.12253.
strategies. Transactions of the Association for Com-
putational Linguistics, 12:484–506. Karthik Valmeekam, Sarath Sreedharan, Matthew Mar-
quez, Alberto Olmo, and Subbarao Kambhampati.
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, 2023. On the planning abilities of large language
Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, models (a critical investigation with a proposed
and Huajun Chen. 2023. Reasoning with lan- benchmark). Preprint, arXiv:2302.06706.
guage model prompting: A survey. Preprint,
arXiv:2212.09597. Karthik Valmeekam, Kaya Stechly, and Subbarao
Kambhampati. 2024. Llms still can’t plan; can lrms?
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano a preliminary evaluation of openai’s o1 on planbench.
Ermon, Christopher D. Manning, and Chelsea Finn. Preprint, arXiv:2409.13373.

14
Manya Wadhwa, Xinyu Zhao, Junyi Jessy Li, and Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D.
Greg Durrett. 2024. Learning to refine with Goodman. 2022. Star: Bootstrapping reasoning with
fine-grained natural language feedback. Preprint, reasoning. Preprint, arXiv:2203.14465.
arXiv:2407.02397.
Matej Zečević, Moritz Willig, Devendra Singh Dhami,
Qineng Wang, Zihao Wang, Ying Su, Hanghang Tong, and Kristian Kersting. 2023. Causal parrots: Large
and Yangqiu Song. 2024a. Rethinking the bounds of language models may talk causality but are not
llm reasoning: Are multi-agent discussions the key? causal. Preprint, arXiv:2308.13067.
Preprint, arXiv:2402.18272.
Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong
Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco
Pu, Nick Haber, and Noah D. Goodman. 2024b. Hy- Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan
pothesis search: Inductive reasoning with language Zhou. 2024a. Llama-berry: Pairwise optimization
models. Preprint, arXiv:2309.05660. for o1-like olympiad-level mathematical reasoning.
Preprint, arXiv:2410.02884.
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao
Ge, Furu Wei, and Heng Ji. 2024c. Unleashing the Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong
emergent cognitive synergy in large language mod- Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo
els: A task-solving agent through multi-persona self- Molchanov, and Tong Zhang. 2024b. Entropy-
collaboration. Preprint, arXiv:2307.05300. regularized process reward model. Preprint,
arXiv:2412.11006.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi
Denny Zhou. 2023. Chain-of-thought prompting Yang, Jun Wang, and Yue Zhang. 2025. Direct
elicits reasoning in large language models. Preprint, value optimization: Improving chain-of-thought
arXiv:2201.11903. reasoning in llms with refined values. Preprint,
arXiv:2502.13723.
Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He,
Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu
2023. Large language models are better reasoners Ding, Joshua B. Tenenbaum, and Chuang Gan. 2023.
with self-verification. Preprint, arXiv:2212.09561. Planning with large language models for code gener-
ation. Preprint, arXiv:2303.05510.
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek,
Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Zhen-Yu Zhang, Siwei Han, Huaxiu Yao, Gang Niu,
Andreas, and Yoon Kim. 2024. Reasoning or recit- and Masashi Sugiyama. 2024c. Generating chain-
ing? exploring the capabilities and limitations of lan- of-thoughts with a pairwise-comparison approach
guage models through counterfactual tasks. Preprint, to searching for the most promising intermediate
arXiv:2307.02477. thought. Preprint, arXiv:2402.06918.

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Zirui Zhao, Wee Sun Lee, and David Hsu. 2023.
Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Large language models as commonsense knowl-
Michael Shieh. 2024. Monte carlo tree search boosts edge for large-scale task planning. Preprint,
reasoning via iterative preference learning. Preprint, arXiv:2305.14078.
arXiv:2405.00451.
Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman,
Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Haohan Wang, and Yu-Xiong Wang. 2024. Lan-
Klein. 2022. Re3: Generating longer stories guage agent tree search unifies reasoning act-
with recursive reprompting and revision. Preprint, ing and planning in language models. Preprint,
arXiv:2210.06774. arXiv:2310.04406.

Ling Yang, Zhaochen Yu, Tianjun Zhang, Shiyi Cao, Tinghui Zhu, Kai Zhang, Jian Xie, and Yu Su. 2024.
Minkai Xu, Wentao Zhang, Joseph E. Gonzalez, Deductive beam search: Decoding deducible ra-
and Bin Cui. 2024. Buffer of thoughts: Thought- tionale for chain-of-thought reasoning. Preprint,
augmented reasoning with large language models. arXiv:2401.17686.
Preprint, arXiv:2406.04271.
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang,
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yu-
Thomas L. Griffiths, Yuan Cao, and Karthik jiu Yang. 2023. Solving math word problems via
Narasimhan. 2023a. Tree of thoughts: Deliber- cooperative reasoning induced language models. In
ate problem solving with large language models. Proceedings of the 61st Annual Meeting of the As-
Preprint, arXiv:2305.10601. sociation for Computational Linguistics (Volume 1:
Long Papers), pages 4471–4485, Toronto, Canada.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Association for Computational Linguistics.
Shafran, Karthik Narasimhan, and Yuan Cao. 2023b.
React: Synergizing reasoning and acting in language
models. Preprint, arXiv:2210.03629.

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy