0% found this document useful (0 votes)
21 views

T L E AIA A R, P, T C: A S: HE Andscape of Merging Gent Rchitectures FOR Easoning Lanning AND OOL Alling Urvey

This survey examines recent advancements in AI agent architectures for reasoning, planning, and tool use. It discusses single and multi-agent architectures, including vertical and horizontal structures. Key considerations for effective agents include reasoning, planning, tool use, and feedback mechanisms.

Uploaded by

Robson Coutinho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

T L E AIA A R, P, T C: A S: HE Andscape of Merging Gent Rchitectures FOR Easoning Lanning AND OOL Alling Urvey

This survey examines recent advancements in AI agent architectures for reasoning, planning, and tool use. It discusses single and multi-agent architectures, including vertical and horizontal structures. Key considerations for effective agents include reasoning, planning, tool use, and feedback mechanisms.

Uploaded by

Robson Coutinho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

T HE L ANDSCAPE OF E MERGING AI AGENT A RCHITECTURES

FOR R EASONING , P LANNING , AND T OOL C ALLING : A S URVEY

Tula Masterman* Sandi Besen*


Neudesic, an IBM Company IBM
tula.masterman@neudesic.com sandi.besen@ibm.com
arXiv:2404.11584v1 [cs.AI] 17 Apr 2024

Mason Sawtell* Alex Chao


Neudesic, an IBM Company Microsoft
mason.sawtell@neudesic.com achao@microsoft.com

* Denotes Equal Contribution

A BSTRACT
This survey paper examines the recent advancements in AI agent implementations, with a focus on
their ability to achieve complex goals that require enhanced reasoning, planning, and tool execution
capabilities. The primary objectives of this work are to a) communicate the current capabilities and
limitations of existing AI agent implementations, b) share insights gained from our observations
of these systems in action, and c) suggest important considerations for future developments in AI
agent design. We achieve this by providing overviews of single-agent and multi-agent architectures,
identifying key patterns and divergences in design choices, and evaluating their overall impact on
accomplishing a provided goal. Our contribution outlines key themes when selecting an agentic
architecture, the impact of leadership on agent systems, agent communication styles, and key phases
for planning, execution, and reflection that enable robust AI agent systems.

Keywords AI Agent · Agent Architecture · AI Reasoning · Planning · Tool Calling · Single Agent · Multi Agent ·
Agent Survey · LLM Agent · Autonomous Agent

1 Introduction
Since the launch of ChatGPT, many of the first wave of generative AI applications have been a variation of a chat over
a corpus of documents using the Retrieval Augmented Generation (RAG) pattern. While there is a lot of activity in
making RAG systems more robust, various groups are starting to build what the next generation of AI applications will
look like, centralizing on a common theme: agents.
Beginning with investigations into recent foundation models like GPT-4 and popularized through open-source projects
like AutoGPT and BabyAGI, the research community has experimented with building autonomous agent-based systems
[19, 1].
As opposed to zero-shot prompting of a large language model where a user types into an open-ended text field and gets
a result without additional input, agents allow for more complex interaction and orchestration. In particular, agentic
systems have a notion of planning, loops, reflection and other control structures that heavily leverage the model’s
inherent reasoning capabilities to accomplish a task end-to-end. Paired with the ability to use tools, plugins, and
function calling, agents are empowered to do more general-purpose work.
Among the community, there is a current debate on whether single or multi-agent systems are best suited for solving
complex tasks. While single agent architectures excel when problems are well-defined and feedback from other
The opinions expressed in this paper are solely those of the authors and do not necessarily reflect the views or policies of their
respective employers.
agent-personas or the user is not needed, multi-agent architectures tend to thrive more when collaboration and multiple
distinct execution paths are required.

Figure 1: A visualization of single and multi-agent architectures with their underlying features and abilities

1.1 Taxonomy

Agents. AI agents are language model-powered entities able to plan and take actions to execute goals over multiple
iterations. AI agent architectures are either comprised of a single agent or multiple agents working together to solve a
problem.
Typically, each agent is given a persona and access to a variety of tools that will help them accomplish their job either
independently or as part of a team. Some agents also contain a memory component, where they can save and load
information outside of their messages and prompts. In this paper, we follow the definition of agent that consists of
“brain, perception, and action” [31]. These components satisfy the minimum requirements for agents to understand,
reason, and act on the environment around them.
Agent Persona. An agent persona describes the role or personality that the agent should take on, including any other
instructions specific to that agent. Personas also contain descriptions of any tools the agent has access to. They make
the agent aware of their role, the purpose of their tools, and how to leverage them effectively. Researchers have found
that “shaped personality verifiably influences Large Language Model (LLM) behavior in common downstream (i.e.
subsequent) tasks, such as writing social media posts” [21]. Solutions that use multiple agent personas to solve problems
also show significant improvements compared to Chain-of-Thought (CoT) prompting where the model is asked to break
down its plans step by step [28, 29].
Tools. In the context of AI agents, tools represent any functions that the model can call. They allow the agent to interact
with external data sources by pulling or pushing information to that source. An example of an agent persona and
associated tools is a professional contract writer. The writer is given a persona explaining their role and the types of
tasks it must accomplish. It is also given tools related to adding notes to a document, reading an existing document, or
sending an email with a final draft.
Single Agent Architectures. These architectures are powered by one language model and will perform all the reasoning,
planning, and tool execution on their own. The agent is given a system prompt and any tools required to complete their

2
task. In single agent patterns there is no feedback mechanism from other AI agents; however, there may be options for
humans to provide feedback that guides the agent.
Multi-Agent Architectures. These architectures involve two or more agents, where each agent can utilize the same
language model or a set of different language models. The agents may have access to the same tools or different tools.
Each agent typically has their own persona.
Multi-agent architectures can have a wide variety of organizations at any level of complexity. In this paper, we divide
them into two primary categories: vertical and horizontal. It is important to keep in mind that these categories represent
two ends of a spectrum, where most existing architectures fall somewhere between these two extremes.
Vertical Architectures. In this structure, one agent acts as a leader and has other agents report directly to them.
Depending on the architecture, reporting agents may communicate exclusively with the lead agent. Alternatively, a
leader may be defined with a shared conversation between all agents. The defining features of vertical architectures
include having a lead agent and a clear division of labor between the collaborating agents.
Horizontal Architectures. In this structure, all the agents are treated as equals and are part of one group discussion
about the task. Communication between agents occurs in a shared thread where each agent can see all messages from
the others. Agents also can volunteer to complete certain tasks or call tools, meaning they do not need to be assigned
by a leading agent. Horizontal architectures are generally used for tasks where collaboration, feedback and group
discussion are key to the overall success of the task [2].

2 Key Considerations for Effective Agents


2.1 Overview

Agents are designed to extend language model capabilities to solve real-world challenges. Successful implementations
require robust problem-solving capabilities enabling agents to perform well on novel tasks. To solve real-world problems
effectively, agents require the ability to reason and plan as well as call tools that interact with an external environment.
In this section we explore why reasoning, planning, and tool calling are critical to agent success.

2.2 The Importance of Reasoning and Planning

Reasoning is a fundamental building block of human cognition, enabling people to make decisions, solve problems, and
understand the world around us. AI agents need a strong ability to reason if they are to effectively interact with complex
environments, make autonomous decisions, and assist humans in a wide range of tasks. This tight synergy between
“acting” and “reasoning” allows new tasks to be learned quickly and enables robust decision making or reasoning, even
under previously unseen circumstances or information uncertainties [32]. Additionally, agents need reasoning to adjust
their plans based on new feedback or information learned.
If agents lacking reasoning skills are tasked with acting on straightforward tasks, they may misinterpret the query,
generate a response based on a literal understanding, or fail to consider multi-step implications.
Planning, which requires strong reasoning abilities, commonly falls into one of five major approaches: task decomposi-
tion, multi-plan selection, external module-aided planning, reflection and refinement and memory-augmented planning
[12]. These approaches allow the model to either break the task down into sub tasks, select one plan from many
generated options, leverage a preexisting external plan, revise previous plans based on new information, or leverage
external information to improve the plan.
Most agent patterns have a dedicated planning step which invokes one or more of these techniques to create a plan
before any actions are executed. For example, Plan Like a Graph (PLaG) is an approach that represents plans as directed
graphs, with multiple steps being executed in parallel [15, 33]. This can provide a significant performance increase over
other methods on tasks that contain many independent subtasks that benefit from asynchronous execution.

2.3 The Importance of Effective Tool Calling

One key benefit of the agent abstraction over prompting base language models is the agents’ ability to solve complex
problems by calling multiple tools. These tools enable the agent to interact with external data sources, send or retrieve
information from existing APIs, and more. Problems that require extensive tool calling often go hand in hand with
those that require complex reasoning.
Both single-agent and multi-agent architectures can be used to solve challenging tasks by employing reasoning and tool
calling steps. Many methods use multiple iterations of reasoning, memory, and reflection to effectively and accurately

3
complete problems [16, 23, 32]. They often do this by breaking a larger problem into smaller subproblems, and then
solving each one with the appropriate tools in sequence.
Other works focused on advancing agent patterns highlight that while breaking a larger problem into smaller subproblems
can be effective at solving complex tasks, single agent patterns often struggle to complete the long sequence required
[22, 6].
Multi-agent patterns can address the issues of parallel tasks and robustness since individual agents can work on
individual subproblems. Many multi-agent patterns start by taking a complex problem and breaking it down into several
smaller tasks. Then, each agent works independently on solving each task using their own independent set of tools.

3 Single Agent Architectures

3.1 Overview

In this section, we highlight some notable single agent methods such as ReAct, RAISE, Reflexion, AutoGPT + P, and
LATS. Each of these methods contain a dedicated stage for reasoning about the problem before any action is taken to
advance the goal. We selected these methods based on their contributions to the reasoning and tool calling capabilities
of agents.

3.2 Key Themes

We find that successful goal execution by agents is contingent upon proper planning and self-correction [32, 16, 23, 1].
Without the ability to self-evaluate and create effective plans, single agents may get stuck in an endless execution loop
and never accomplish a given task or return a result that does not meet user expectations [32]. We find that single agent
architectures are especially useful when the task requires straightforward function calling and does not need feedback
from another agent [22].

3.3 Examples

ReAct. In the ReAct (Reason + Act) method, an agent first writes a thought about the given task. It then performs
an action based on that thought, and the output is observed. This cycle can repeat until the task is complete [32].
When applied to a diverse set of language and decision-making tasks, the ReAct method demonstrates improved
effectiveness compared to zero-shot prompting on the same tasks. It also provides improved human interoperability and
trustworthiness because the entire thought process of the model is recorded. When evaluated on the HotpotQA dataset,
the ReAct method only hallucinated 6% of the time, compared to 14% using the chain of thought (CoT) method [29,
32].
However, the ReAct method is not without its limitations. While intertwining reasoning, observation, and action
improves trustworthiness, the model can repetitively generate the same thoughts and actions and fail to create new
thoughts to provoke finishing the task and exiting the ReAct loop. Incorporating human feedback during the execution
of the task would likely increase its effectiveness and applicability in real-world scenarios.
RAISE. The RAISE method is built upon the ReAct method, with the addition of a memory mechanism that mirrors
human short-term and long-term memory [16]. It does this by using a scratchpad for short-term storage and a dataset of
similar previous examples for long-term storage.
By adding these components, RAISE improves upon the agent’s ability to retain context in longer conversations. The
paper also highlights how fine-tuning the model results in the best performance for their task, even when using a smaller
model. They also showed that RAISE outperforms ReAct in both efficiency and output quality.
While RAISE significantly improves upon existing methods in some respects, the researchers also highlighted several
issues. First, RAISE struggles to understand complex logic, limiting its usefulness in many scenarios. Additionally,
RAISE agents often hallucinated with respect to their roles or knowledge. For example, a sales agent without a clearly
defined role might retain the ability to code in Python, which may enable them to start writing Python code instead of
focusing on their sales tasks. These agents might also give the user misleading or incorrect information. This problem
was addressed by fine-tuning the model, but the researchers still highlighted hallucination as a limitation in the RAISE
implementation.
Reflexion. Reflexion is a single-agent pattern that uses self-reflection through linguistic feedback [23]. By utilizing
metrics such as success state, current trajectory, and persistent memory, this method uses an LLM evaluator to provide

4
Figure 2: An example of the ReAct method compared to other methods [32]

RAISE Framework Agent Loop


User Query

Examples Retirval

Memory Update

LLMs Working Memory Tool Pool Example Pool

• API-based LLMs • System Prompt


• Profile • Database Access
• GPT-4、 GPT-3.5 • < Q1, A1 >
• Claude • Task Instruction • Scripting and Programming
• ... • ... • < Q2, A2 >
Tools
• Open-sourced LLMs • Conversation History • < Q3, A3 >
• Knowledge Bases and
• Llama • Scratchpad • ...
• Qwen Information Repositories
• Baichuan • Retrieved Examples • < QN, AN >
• AI and Machine Learning Tools
• ... • Task Trajectory

Figure 3: A diagram showing the RAISE method [16]

specific and relevant feedback to the agent. This results in an improved success rate as well as reduced hallucination
compared to Chain-of-Thought and ReAct.
Despite these advancements, the Reflexion authors identify various limitations of the pattern. Primarily, Reflexion
is susceptible to “non-optimal local minima solutions”. It also uses a sliding window for long-term memory, rather
than a database. This means that the volume of long-term memory is limited by the token limit of the language model.
Finally, the researchers identify that while Reflexion surpasses other single-agent patterns, there are still opportunities
to improve performance on tasks that require a significant amount of diversity, exploration, and reasoning.
AUTOGPT + P. AutoGPT + P (Planning) is a method that addresses reasoning limitations for agents that command
robots in natural language [1]. AutoGPT+P combines object detection and Object Affordance Mapping (OAM) with
a planning system driven by a LLM. This allows the agent to explore the environment for missing objects, propose
alternatives, or ask the user for assistance with reaching its goal.

5
AutoGPT+P starts by using an image of a scene to detect the objects present. A language model then uses those objects
to select which tool to use, from four options: Plan Tool, Partial Plan Tool, Suggest Alternative Tool, and Explore Tool.
These tools allow the robot to not only generate a full plan to complete the goal, but also to explore the environment,
make assumptions, and create partial plans.
However, the language model does not generate the plan entirely on its own. Instead, it generates goals and steps to
work aside a classical planner which executes the plan using Planning Domain Definition Language (PDDL). The
paper found that “LLMs currently lack the ability to directly translate a natural language instruction into a plan for
executing robotic tasks, primarily due to their constrained reasoning capabilities” [1]. By combining the LLM planning
capabilities with a classical planner, their approach significantly improves upon other purely language model-based
approaches to robotic planning.
As with most first of their kind approaches, AutoGPT+P is not without its drawbacks. Accuracy of tool selection varies,
with certain tools being called inappropriately or getting stuck in loops. In scenarios where exploration is required,
the tool selection sometimes leads to illogical exploration decisions like looking for objects in the wrong place. The
framework also is limited in terms of human interaction, with the agent being unable to seek clarification and the user
being unable to modify or terminate the plan during execution.

Figure 4: A diagram of the AutoGPT+P method [1]

LATS. Language Agent Tree Search (LATS) is a single-agent method that synergizes planning, acting, and reasoning
by using trees [36]. This technique, inspired by Monte Carlo Tree Search, represents a state as a node and taking an
action as traversing between nodes. It uses LM-based heuristics to search for possible options, then selects an action
using a state evaluator.
When compared to other tree-based methods, LATS implements a self-reflection reasoning step that dramatically
improves performance. When an action is taken, both environmental feedback as well as feedback from a language
model is used to determine if there are any errors in reasoning and propose alternatives. This ability to self-reflect
combined with a powerful search algorithm makes LATS perform extremely well on various tasks.
However, due to the complexity of the algorithm and the reflection steps involved, LATS often uses more computational
resources and takes more time to complete than other single-agent methods [36]. The paper also uses relatively simple
question answering benchmarks and has not been tested on more robust scenarios that involve involving tool calling or
complex reasoning.

4 Multi Agent Architectures


4.1 Overview

In this section, we examine a few key studies and sample frameworks with multi-agent architectures, such as Embodied
LLM Agents Learn to Cooperate in Organized Teams, DyLAN, AgentVerse, and MetaGPT. We highlight how these
implementations facilitate goal execution through inter-agent communication and collaborative plan execution. This is
not intended to be an exhaustive list of all agent frameworks, our goal is to provide broad coverage of key themes and
examples related to multi-agent patterns.

4.2 Key Themes

Multi-agent architectures create an opportunity for both the intelligent division of labor based on skill and helpful
feedback from a variety of agent personas. Many multi-agent architectures work in stages where teams of agents are

6
created and reorganized dynamically for each planning, execution, and evaluation phase [2, 9, 18]. This reorganization
provides superior results because specialized agents are employed for certain tasks, and removed when they are no
longer needed. By matching agents roles and skills to the task at hand, agent teams can achieve greater accuracy
and decrease time to meet the goal. Key features of effective multi-agent architectures include clear leadership in
agent teams, dynamic team construction, and effective information sharing between team members so that important
information does not get lost in superfluous chatter.

4.3 Examples

Embodied LLM Agents Learn to Cooperate in Organized Teams. Research by Guo et al. demonstrates the impact
of a lead agent on the overall effectiveness of the agent team [9]. This architecture contains a vertical component
through the leader agent, as well as a horizontal component from the ability for agents to converse with other agents
besides the leader. The results of their study demonstrate that agent teams with an organized leader complete their tasks
nearly 10% faster than teams without a leader.
Furthermore, they discovered that in teams without a designated leader, agents spent most of their time giving orders
to one another (~50% of communication), splitting their remaining time between sharing information, or requesting
guidance. Conversely, in teams with a designated leader, 60% of the leader’s communication involved giving directions,
prompting other members to focus more on exchanging and requesting information. Their results demonstrate that
agent teams are most effective when the leader is a human.

Figure 5: Agent teams with a designated leader achieve superior performance [9]
.

Beyond team structure, the paper emphasizes the importance of employing a “criticize-reflect” step for generating plans,
evaluating performance, providing feedback, and re-organizing the team [9]. Their results indicate that agents with a
dynamic team structure with rotating leadership provide the best results, with both the lowest time to task completion
and the lowest communication cost on average. Ultimately, leadership and dynamic team structures improve the overall
team’s ability to reason, plan, and perform tasks effectively.
DyLAN. The Dynamic LLM-Agent Network (DyLAN) framework creates a dynamic agent structure that focuses on
complex tasks like reasoning and code generation [18]. DyLAN has a specific step for determining how much each
agent has contributed in the last round of work and only moves top contributors the next round of execution. This
method is horizontal in nature since agents can share information with each other and there is no defined leader. DyLAN
shows improved performance on a variety of benchmarks which measure arithmetic and general reasoning capabilities.
This highlights the impact of dynamic teams and demonstrates that by consistently re-evaluating and ranking agent
contributions, we can create agent teams that are better suited to complete a given task.
AgentVerse. Multi-agent architectures like AgentVerse demonstrate how distinct phases for group planning can
improve an AI agent’s reasoning and problem-solving capabilities [2]. AgentVerse contains four primary stages for
task execution: recruitment, collaborative decision making, independent action execution, and evaluation. This can be
repeated until the overall goal is achieved. By strictly defining each phase, AgentVerse helps guide the set of agents to
reason, discuss, and execute more effectively.
As an example, the recruitment step allows agents to be removed or added based on the progress towards the goal. This
helps ensure that the right agents are participating at any given stage of problem solving. The researchers found that
horizontal teams are generally best suited for collaborative tasks like consulting, while vertical teams are better suited
for tasks that require clearer isolation of responsibilities for tool calling.

7
x N rounds

Goal Expert Recruitment Collaborative Decision-Making


Group
Build

Group
x M turns
? ? ?
Feedback
Reward

Outcome Evaluation New State Action Execution

? Agents:
==
Goal New State Actions:

: Architect : Logger : Designer


: Designer : Worker : Worker
: Engineer New State : Engineer New State : Engineer New State
Round 1 Round 2 Round 3

Figure 6: A diagram of the AgentVerse method [2]

MetaGPT. Many multi-agent architectures allow agents to converse with one another while collaborating on a common
problem. This conversational capability can lead to chatter between the agents that is superfluous and does not further
the team goal. MetaGPT addresses the issue of unproductive chatter amongst agents by requiring agents to generate
structured outputs like documents and diagrams instead of sharing unstructured chat messages [11].
Additionally, MetaGPT implements a ”publish-subscribe” mechanism for information sharing. This allows all the
agents to share information in one place, but only read information relevant to their individual goals and tasks. This
streamlines the overall goal execution and reduces conversational noise between agents. When compared to single-agent
architectures on the HumanEval and MBPP benchmarks, MetaGPT’s multi-agent architecture demonstrates significantly
better results.

5 Discussion and Observations

5.1 Overview

In this section we discuss the key themes and impacts of the design choices exhibited in the previously outlined agent
patterns. These patterns serve as key examples of the growing body of research and implementation of AI agent
architectures. Both single and multi-agent architectures seek to enhance the capabilities of language models by giving
them the ability to execute goals on behalf of or alongside a human user. Most observed agent implementations broadly
follow the plan, act, and evaluate process to iteratively solve problems.
We find that both single and multi-agent architectures demonstrate compelling performance on complex goal execution.
We also find that across architectures clear feedback, task decomposition, iterative refinement, and role definition yield
improved agent performance.

5.2 Key Findings

Typical Conditions for Selecting a Single vs Multi-Agent Architecture. Based on the aforementioned agent patterns,
we find that single-agent patterns are generally best suited for tasks with a narrowly defined list of tools and where
processes are well-defined. Single agents are also typically easier to implement since only one agent and set of tools
needs to be defined. Additionally, single agent architectures do not face limitations like poor feedback from other agents
or distracting and unrelated chatter from other team members. However, they may get stuck in an execution loop and
fail to make progress towards their goal if their reasoning and refinement capabilities are not robust.

8
Multi-agent architectures are generally well-suited for tasks where feedback from multiple personas is beneficial in
accomplishing the task. For example, document generation may benefit from a multi-agent architecture where one
agent provides clear feedback to another on a written section of the document. Multi-agent systems are also useful
when parallelization across distinct tasks or workflows is required. Crucially, Wang et. al finds that multi-agent patterns
perform better than single agents in scenarios when no examples are provided [26]. By nature, multi-agent systems are
more complex and often benefit from robust conversation management and clear leadership.
While single and multi-agent patterns have diverging capabilities in terms of scope, research finds that “multi-agent
discussion does not necessarily enhance reasoning when the prompt provided to an agent is sufficiently robust” [26].
This suggests that those implementing agent architectures should decide between single or multiple agents based on the
broader context of their use case, and not based on the reasoning capabilities required.
Agents and Asynchronous Task Execution. While a single agent can initiate multiple asynchronous calls simulta-
neously, its operational model does not inherently support the division of responsibilities across different execution
threads. This means that, although tasks are handled asynchronously, they are not truly parallel in the sense of being
autonomously managed by separate decision-making entities. Instead, the single agent must sequentially plan and
execute tasks, waiting for one batch of asynchronous operations to complete before it can evaluate and move on to the
next step. Conversely, in multi-agent architectures, each agent can operate independently, allowing for a more dynamic
division of labor. This structure not only facilitates simultaneous task execution across different domains or objectives
but also allows individual agents to proceed with their next steps without being hindered by the state of tasks handled
by others, embodying a more flexible and parallel approach to task management.
Impact of Feedback and Human Oversight on Agent Systems. When solving a complex problem, it is extremely
unlikely that one provides a correct, robust solution on their first try. Instead, one might pose a potential solution before
criticizing it and refining it. One could also consult with someone else and receive feedback from another perspective.
The same idea of iterative feedback and refinement is essential for helping agents solve complex problems.
This is partially because language models tend to commit to an answer earlier in their response, which can cause a
‘snowball effect’ of increasing diversion from their goal state [34] . By implementing feedback, agents are much more
likely to correct their course and reach their goal.
Additionally, the inclusion of human oversight improves the immediate outcome by aligning the agent’s responses more
closely with human expectations, mitigating the potential for agents to delve down an inefficient or invalid approach to
solving a task. As of today, including human validation and feedback in the agent architecture yields more reliable and
trustworthy results [4, 9].
Language models also exhibit sycophantic behavior, where they “tend to mirror the user’s stance, even if it means
forgoing the presentation of an impartial or balanced viewpoint” [20]. Specifically, the AgentVerse paper describes how
agents are susceptible to feedback from other agents, even if the feedback is not sound. This can lead the agent team to
generate a faulty plan which diverts them from their objective [2]. Robust prompting can help mitigate this, but those
developing agent applications should be aware of the risks when implementing user or agent feedback systems.
Challenges with Group Conversations and Information Sharing. One challenge with multi-agent architectures
lies in their ability to intelligently share messages between agents. Multi-agent patterns have a greater tendency to get
caught up in niceties and ask one another things like “how are you”, while single agent patterns tend to stay focused on
the task at hand since there is no team dynamic to manage. The extraneous dialogue in multi-agent systems can impair
both the agent’s ability to reason effectively and execute the right tools, ultimately distracting the agents from the task
and decreasing team efficiency. This is especially true in a horizontal architecture, where agents typically share a group
chat and are privy to every agent’s message in a conversation. Message subscribing or filtering improves multi-agent
performance by ensuring agents only receive information relevant to their tasks.
In vertical architectures, tasks tend to be clearly divided by agent skill which helps reduce distractions in the team.
However, challenges arise when the leading agent fails to send critical information to their supporting agents and does
not realize the other agents aren’t privy to necessary information. This failure can lead to confusion in the team or
hallucination in the results. One approach to address this issue is to explicitly include information about access rights in
the system prompt so that the agents have contextually appropriate interactions.
Impact of Role Definition and Dynamic Teams. Clear role definition is critical for both single and multi-agent
architectures. In single-agent architectures role definition ensures that the agent stays focused on the provided task,
executes the proper tools, and minimizes hallucination of other capabilities. Similarly, role definition in multi-agent
architectures ensures each agent knows what it’s responsible for in the overall team and does not take on tasks outside
of their described capabilities or scope. Beyond individual role definition, establishing a clear group leader also
improves the overall performance of multi-agent teams by streamlining task assignment. Furthermore, defining a clear

9
system prompt for each agent can minimize excess chatter by prompting the agents not to engage in unproductive
communication.
Dynamic teams where agents are brought in and out of the system based on need have also been shown to be effective.
This ensures that all agents participating in the planning or execution of tasks are fit for that round of work.

5.3 Summary

Both single and multi-agent patterns exhibit strong performance on a variety of complex tasks involving reasoning and
tool execution. Single agent patterns perform well when given a defined persona and set of tools, opportunities for
human feedback, and the ability to work iteratively towards their goal. When constructing an agent team that needs to
collaborate on complex goals, it is beneficial to deploy agents with at least one of these key elements: clear leader(s), a
defined planning phase and opportunities to refine the plan as new information is learned, intelligent message filtering,
and dynamic teams whose agents possess specific skills relevant to the current sub-task. If an agent architecture employs
at least one of these approaches it is likely to result in increased performance compared to a single agent architecture or
a multi-agent architecture without these tactics.

6 Limitations of Current Research and Considerations for Future Research


6.1 Overview

In this section we examine some of the limitations of agent research today and identify potential areas for improving AI
agent systems. While agent architectures have significantly enhanced the capability of language models in many ways,
there are some major challenges around evaluations, overall reliability, and issues inherited from the language models
powering each agent.

6.2 Challenges with Agent Evaluation

While LLMs are evaluated on a standard set of benchmarks designed to gauge their general understanding and reasoning
capabilities, the benchmarks for agent evaluation vary greatly.
Many research teams introduce their own unique agent benchmarks alongside their agent implementation which makes
comparing multiple agent implementations on the same benchmark challenging. Additionally, many of these new
agent-specific benchmarks include a hand-crafted, highly complex, evaluation set where the results are manually scored
[2]. This can provide a high-quality assessment of a method’s capabilities, but it also lacks the robustness of a larger
dataset and risks introducing bias into the evaluation, since the ones developing the method are also the ones writing
and scoring the results. Agents can also have problems generating a consistent answer over multiple iterations, due
to variability in the models, environment, or problem state. This added randomness poses a much larger problem to
smaller, complex evaluation sets.

6.3 Impact of Data Contamination and Static Benchmarks

Some researchers evaluate their agent implementations on the typical LLM benchmarks. Emerging research indicates
that there is significant data contamination in the model’s training data, supported by the observation that a model’s
performance significantly worsens when benchmark questions are modified [8, 38, 37]. This raises doubts on the
authenticity of benchmark scores for both the language models and language model powered agents.
Furthermore, researchers have found that “As LLMs progress at a rapid pace, existing datasets usually fail to match the
models’ ever-evolving capabilities, because the complexity level of existing benchmarks is usually static and fixed”
[37]. To address this, work has been done to create dynamic benchmarks that are resistant to simple memorization [38,
37]. Researchers have also explored the idea of generating an entirely synthetic benchmark based on a user’s specific
environment or use case [14, 27]. While these techniques can help with contamination, decreasing the level of human
involvement can pose additional risks regarding correctness and the ability to solve problems.

6.4 Benchmark Scope and Transferability

Many language model benchmarks are designed to be solved in a single iteration, with no tool calls, such as MMLU
or GSM8K [3, 10]. While these are important for measuring the abilities of base language models, they are not good
proxies for agent capabilities because they do not account for agent systems’ ability to reason over multiple steps or
access outside information. StrategyQA improves upon this by assessing models’ reasoning abilities over multiple

10
steps, but the answers are limited to Yes/No responses [7]. As the industry continues to pivot towards agent focused
use-cases additional measures will be needed to better assess the performance and generalizability of agents to tasks
involving tools that extend beyond their training data.
Some agent specific benchmarks like AgentBench evaluate language model-based agents in a variety of different
environments such as web browsing, command-line interfaces, and video games [17]. This provides a better indication
for how well agents can generalize to new environments, by reasoning, planning, and calling tools to achieve a given
task. Benchmarks like AgentBench and SmartPlay introduce objective evaluation metrics designed to evaluate the
implementation’s success rate, output similarity to human responses, and overall efficiency [17, 30]. While these
objective metrics are important to understanding the overall reliability and accuracy of the implementation, it is also
important to consider more nuanced or subjective measures of performance. Metrics such as efficiency of tool use,
reliability, and robustness of planning are nearly as important as success rate but are much more difficult to measure.
Many of these metrics require evaluation by a human expert, which can be costly and time consuming compared to
LLM-as-judge evaluations.

6.5 Real-world Applicability

Many of the existing benchmarks focus on the ability of Agent systems to reason over logic puzzles or video games
[17]. While evaluating performance on these types of tasks can help get a sense of the reasoning capabilities of agent
systems, it is unclear whether performance on these benchmarks translates to real-world performance. Specifically,
real-world data can be noisy and cover a much wider breadth of topics that many common benchmarks lack.
One popular benchmark that uses real-world data is WildBench, which is sourced from the WildChat dataset of 570,000
real conversations with ChatGPT [35]. Because of this, it covers a huge breadth of tasks and prompts. While WildBench
covers a wide range of topics, most other real-world benchmarks focus on a specific task. For example, SWE-bench is a
benchmark that uses a set of real-world issues raised on GitHub for software engineering tasks in Python [13]. This
can be very helpful when evaluating agents designed to write Python code and provides a sense for how well agents
can reason about code related problems; however, it is less informative when trying to understand agent capabilities
involving other programming languages.

6.6 Bias and Fairness in Agent Systems

Language Models have been known to exhibit bias both in terms of evaluation as well as in social or fairness terms [5].
Moreover, agents have specifically been shown to be “less robust, prone to more harmful behaviors, and capable of
generating stealthier content than LLMs, highlighting significant safety challenges” [25]. Other research has found “a
tendency for LLM agents to conform to the model’s inherent social biases despite being directed to debate from certain
political perspectives” [24]. This tendency can lead to faulty reasoning in any agent-based implementation.
As the complexity of tasks and agent involvement increases, more research is needed to identify and address biases
within these systems. This poses a very large challenge to researchers, since scalable and novel benchmarks often
involve some level of LLM involvement during creation. However, a truly robust benchmark for evaluating bias in
LLM-based agents must include human evaluation.

7 Conclusion and Future Directions


The AI agent implementations explored in this survey demonstrate the rapid enhancement in language model powered
reasoning, planning, and tool calling. Single and multi-agent patterns both show the ability to tackle complex multi-step
problems that require advanced problem-solving skills. The key insights discussed in this paper suggest that the best
agent architecture varies based on use case. Regardless of the architecture selected, the best performing agent systems
tend to incorporate at least one of the following approaches: well defined system prompts, clear leadership and task
division, dedicated reasoning / planning- execution - evaluation phases, dynamic team structures, human or agentic
feedback, and intelligent message filtering. Architectures that leverage these techniques are more effective across a
variety of benchmarks and problem types.
While the current state of AI-driven agents is promising, there are notable limitations and areas for future improvement.
Challenges around comprehensive agent benchmarks, real world applicability, and the mitigation of harmful language
model biases will need to be addressed in the near-term to enable reliable agents. By examining the progression from
static language models to more dynamic, autonomous agents, this survey aims to provide a holistic understanding of the
current AI agent landscape and offer insight for those building with existing agent architectures or developing custom
agent architectures.

11
References
[1] Timo Birr et al. AutoGPT+P: Affordance-based Task Planning with Large Language Models. arXiv:2402.10778
[cs] version: 1. Feb. 2024. URL: http://arxiv.org/abs/2402.10778.
[2] Weize Chen et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors.
arXiv:2308.10848 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2308.10848.
[3] Karl Cobbe et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs]. Nov. 2021. URL:
http://arxiv.org/abs/2110.14168.
[4] Xueyang Feng et al. Large Language Model-based Human-Agent Collaboration for Complex Task Solving. 2024.
arXiv: 2402.12914 [cs.CL].
[5] Isabel O. Gallegos et al. Bias and Fairness in Large Language Models: A Survey. arXiv:2309.00770 [cs]. Mar.
2024. URL: http://arxiv.org/abs/2309.00770.
[6] Silin Gao et al. Efficient Tool Use with Chain-of-Abstraction Reasoning. arXiv:2401.17464 [cs]. Feb. 2024. URL:
http://arxiv.org/abs/2401.17464.
[7] Mor Geva et al. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.
arXiv:2101.02235 [cs]. Jan. 2021. URL: http://arxiv.org/abs/2101.02235.
[8] Shahriar Golchin and Mihai Surdeanu. Time Travel in LLMs: Tracing Data Contamination in Large Language
Models. arXiv:2308.08493 [cs] version: 3. Feb. 2024. URL: http://arxiv.org/abs/2308.08493.
[9] Xudong Guo et al. Embodied LLM Agents Learn to Cooperate in Organized Teams. 2024. arXiv: 2403.12482
[cs.AI].
[10] Dan Hendrycks et al. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs]. Jan. 2021.
URL: http://arxiv.org/abs/2009.03300.
[11] Sirui Hong et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. 2023. arXiv:
2308.00352 [cs.AI].
[12] Xu Huang et al. Understanding the planning of LLM agents: A survey. 2024. arXiv: 2402.02716 [cs.AI].
[13] Carlos E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
arXiv:2310.06770 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.06770.
[14] Fangyu Lei et al. S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models.
arXiv:2310.15147 [cs]. Oct. 2023. URL: http://arxiv.org/abs/2310.15147.
[15] Fangru Lin et al. Graph-enhanced Large Language Models in Asynchronous Plan Reasoning. arXiv:2402.02805
[cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.02805.
[16] Na Liu et al. From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large
Language Models. arXiv:2401.02777 [cs]. Jan. 2024. URL: http://arxiv.org/abs/2401.02777.
[17] Xiao Liu et al. AgentBench: Evaluating LLMs as Agents. arXiv:2308.03688 [cs]. Oct. 2023. URL: http :
//arxiv.org/abs/2308.03688.
[18] Zijun Liu et al. Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team
Optimization. 2023. arXiv: 2310.02170 [cs.CL].
[19] Yohei Nakajima. yoheinakajima/babyagi. original-date: 2023-04-03T00:40:27Z. Apr. 2024. URL: https://
github.com/yoheinakajima/babyagi.
[20] Peter S. Park et al. AI Deception: A Survey of Examples, Risks, and Potential Solutions. arXiv:2308.14752 [cs].
Aug. 2023. URL: http://arxiv.org/abs/2308.14752.
[21] Greg Serapio-García et al. Personality Traits in Large Language Models. 2023. arXiv: 2307.00184 [cs.CL].
[22] Zhengliang Shi et al. Learning to Use Tools via Cooperative and Interactive Agents. arXiv:2403.03031 [cs]. Mar.
2024. URL: http://arxiv.org/abs/2403.03031.
[23] Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366 [cs]. Oct.
2023. URL: http://arxiv.org/abs/2303.11366.
[24] Amir Taubenfeld et al. Systematic Biases in LLM Simulations of Debates. arXiv:2402.04049 [cs]. Feb. 2024.
URL: http://arxiv.org/abs/2402.04049.
[25] Yu Tian et al. Evil Geniuses: Delving into the Safety of LLM-based Agents. arXiv:2311.11855 [cs]. Feb. 2024.
URL: http://arxiv.org/abs/2311.11855.
[26] Qineng Wang et al. Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?
arXiv:2402.18272 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.18272.
[27] Siyuan Wang et al. Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation.
arXiv:2402.11443 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.11443.

12
[28] Zhenhailong Wang et al. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving
Agent through Multi-Persona Self-Collaboration. 2024. arXiv: 2307.05300 [cs.AI].
[29] Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903
[cs]. Jan. 2023. URL: http://arxiv.org/abs/2201.11903.
[30] Yue Wu et al. SmartPlay: A Benchmark for LLMs as Intelligent Agents. arXiv:2310.01557 [cs]. Mar. 2024. URL:
http://arxiv.org/abs/2310.01557.
[31] Zhiheng Xi et al. The Rise and Potential of Large Language Model Based Agents: A Survey. 2023. arXiv:
2309.07864 [cs.AI].
[32] Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs]. Mar.
2023. URL: http://arxiv.org/abs/2210.03629.
[33] Shunyu Yao et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601
[cs]. Dec. 2023. URL: http://arxiv.org/abs/2305.10601.
[34] Muru Zhang et al. How Language Model Hallucinations Can Snowball. arXiv:2305.13534 [cs]. May 2023. URL:
http://arxiv.org/abs/2305.13534.
[35] Wenting Zhao et al. “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild”. In: The Twelfth In-
ternational Conference on Learning Representations. 2024. URL: https://openreview.net/forum?id=
Bl8u7ZRlbM.
[36] Andy Zhou et al. Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models.
arXiv:2310.04406 [cs]. Dec. 2023. URL: http://arxiv.org/abs/2310.04406.
[37] Kaijie Zhu et al. DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents.
arXiv:2402.14865 [cs]. Feb. 2024. URL: http://arxiv.org/abs/2402.14865.
[38] Kaijie Zhu et al. DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks. arXiv:2309.17167
[cs]. Mar. 2024. URL: http://arxiv.org/abs/2309.17167.

13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy