2024 DuRajivanGonzalez
2024 DuRajivanGonzalez
2024 DuRajivanGonzalez
Title
Large Language Models for Collective Problem-Solving: Insights into Group Consensus
Decision-Making
Permalink
https://escholarship.org/uc/item/6s060914
Journal
Proceedings of the Annual Meeting of the Cognitive Science Society, 46(0)
Authors
Du, Yinuo
Rajivan, Prashanth
Gonzalez, Cleotilde
Publication Date
2024
Peer reviewed
Prashanth Rajivan(prajivan@uw.edu)
Department of Industrial and Systems Engineering, 1410 NE Campus Pkwy
Seattle, WA 98195 USA
Cleotilde Gonzalez(coty@cmu.edu)
Department of Social Decision Science, 4815 Frew Street
Pittsburgh, PA 15213 USA
Abstract cohesion (Fine, 2014). These groups serve as plat-
forms for individual interactions, including virtual meet-
Large Language models (LLM) exhibit human-like proficiency
in various tasks such as translation, question answering, es- ings (Karl, Peluchette, & Aghakhani, 2022), workplace dis-
say writing, and programming. Emerging research explores cussions (Forsell, Forslund Frykedal, & Hammar Chiriac,
the use of LLMs in collective problem-solving endeavors, 2020), recreational activities (Vernham, Granhag, & Mac Gi-
such as tasks where groups try to uncover clues through dis-
cussions. Although prior work has investigated individual olla, 2016), and educational settings (Liu & Tsai, 2008;
problem-solving tasks, leveraging LLM-powered agents for Yadgarovna & Husenovich, 2020). A deeper understanding
group consensus and decision-making remains largely unex- of human conversational dynamics within small groups is es-
plored. This research addresses this gap by (1) proposing an
algorithm to enable free-form conversation in groups of LLM sential to improve teamwork, resolve conflicts, and foster ef-
agents, (2) creating metrics to evaluate the human-likeness of fective problem-solving. However, the limited availability of
the generated dialogue and problem-solving performance, and group corpora (J. P. Chang et al., 2020) poses a significant
(3) evaluating LLM agent groups against human groups using
an open source dataset. Our results reveal that LLM groups challenge to advance research in this area. LLMs, trained on
outperform human groups in problem-solving tasks. LLM human datasets, offer a promising way to address this data
groups also show a greater improvement in scores after par- scarcity (Bommasani et al., 2021).
ticipating in free discussions. In particular, analyses indicate
that LLM agent groups exhibit more disagreements, complex Recent research has explored the potential of LLMs to em-
statements, and a propensity for positive statements compared ulate human-like behavior at the group level. (Aher, Arriaga,
to human groups. The results shed light on the potential of
LLMs to facilitate collective reasoning and provide insight into & Kalai, 2023) examined LLMs in the context of human stud-
the dynamics of group interactions involving synthetic LLM ies and showed that LLMs can replicate various experiments
agents. that span the domains of economic, psycholinguistic, and so-
Keywords: Small Group, Language Model, Simulation cial psychology. Other recent contributions from social sim-
ulation used a prompt chain methodology to generate con-
Introduction cise natural language descriptions of personas and their be-
Large Language Models (LLMs) are gaining widespread haviors (Park et al., 2023). Additionally, (Zhou et al., 2023)
adoption due to their seemingly remarkable reasoning power introduced an open-ended environment designed to simulate
and emergent generalization ability, which have the potential social interactions between language agents, evaluating their
to construct intelligent agents, driving recent advancements in ability to achieve social objectives.
a variety of human language tasks (Ouyang et al., 2022; Wei However, existing research has predominantly focused on
et al., 2022), including tasks such as web surfing (Nakano et evaluating individual agent performance, neglecting to ex-
al., 2021; Yao et al., 2022), complex video games (Y. Chang plore the emergent behavior of the interaction between agents
et al., 2023), and other applications (Ahn et al., 2022). In a within small groups. Our research addresses this gap by intro-
recent work, Zeims et al. found that LLM agents were able ducing a model designed to emulate free-form conversation
to achieve a fair level of performance conducting tasks in- for problem-solving within small groups. This algorithm, in-
volved in computational social science research, for exam- tegrated with LLMs, generates group discussions aimed at
ple, achieving sufficient agreement with human annotators solving complex tasks. Using the publicly available Win-
and providing explanations that surpass those generated by ter Survival Task dataset (Humphreys, Johnson, & Johnson,
crowd workers (Ziems et al., 2023). Despite these achieve- 1982), developed to understand the dynamics of team build-
ments, the current focus of LLM research mainly revolves ing and group problem solving, we propose a mechanism that
around individual tasks, leaving the potential of these models enables free-form discussions among an arbitrary number of
in collective problem-solving tasks largely understudied. agents without imposing predefined interaction rules. We
Small groups play an important role in connecting peo- conducted a comparative analysis between the synthetic cor-
ple within larger social systems and in fostering social pus generated by our model and the human corpus collected
3011
In L. K. Samuelson, S. L. Frank, M. Toneva, A. Mackey, & E. Hazeltine (Eds.), Proceedings of the 46th Annual Conference of the Cognitive
Science Society. ©2024 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY).
by Braley and Murray (2018), focusing on metrics related to Method
performance and efficiency, affect and satisfaction, and group We utilized an existing dataset collected from an exper-
action and airtime. Our findings reveal that LLM groups out- iment conducted using the winter survival task paradigm
perform human groups in the Winter Survival Task, mainly (Humphreys et al., 1982). The dataset was used to model
by participating in more disagreements, complex statements, and analyze LLM’s performance in emulating conversations
and more positive rather than negative statements compared in small groups. First, we briefly describe the winter survival
to human groups. task (WST) and the dataset from the experiment conducted
using WST. Later, we describe the algorithm used to con-
Related Work struct LLM agents to model and emulate the conversations
observed within human teams in the experiment.
Dialogue Systems. Dialogue systems are widely applied in Task and Human Corpus
various big data domains, including computer vision and rec-
Winter Survival Task. The winter survival task (Humphreys
ommender systems (Chen, Liu, Yin, & Tang, 2017). Exist-
et al., 1982) is a group decision-making exercise consisting
ing dialogue systems fall into two categories: task-oriented
of a hypothetical scenario of a plane crash. Participants in ex-
systems and conversational agents. Task-oriented dialogue
periments using this paradigm are told they are stranded in a
systems are characterized by clearly defined goals, struc-
remote place and must survive using 15 items that were sup-
tured dialogue behavior, closed domains, and a focus on ef-
posedly salvaged from the plane they traveled. Examples in-
ficiency (Raux, 2008; Cole et al., 2018). They operate by
clude a compress kit, a fluid-free cigarette lighter, a compass,
tracking dialogue states and generating responses based on
and a family-sized chocolate bar. Participants are presented
them, with their performance assessed mainly by task success
with these 15 items and must work in small groups to discuss
rates and user ratings (Cuayáhuitl, Keizer, & Lemon, 2015;
and rank each item according to its importance for their sur-
Schmitt & Ultes, 2015). In contrast, conversational agents are
vival in that situation. Participants are instructed to indepen-
designed for unstructured, open-domain conversations with
dently rank the 15 items before the group discussion begins.
users (Tulshan & Dhage, 2019). Evaluating conversational
Following individual rankings by each participant within the
dialogue systems remains a challenge (Deriu et al., 2021),
group, each group is given a maximum of 15 minutes to col-
typically relying on metrics such as response appropriateness
lectively deliberate and reach a consensus as a group on the
(e.g., coherence, relevance) and human likeness, measured
final ranking of the items. The group’s conversations and de-
by their ability to mimic humans convincingly. However,
liberations during this task were recorded as conversations.
these metrics focus on individual conversational properties.
Rankings submitted by the individuals and groups are scored
We propose a novel approach for evaluating conversational
according to the human expert ranking.1 Finally, the partic-
agents by assessing their human likeness regarding group be-
ipants answered a questionnaire on five-point Likert scales
havior.
to how strongly they agreed with statements concerning the
Conversation Analysis. Communication or conversation meeting.
analysis involves studying socially organized human inter- Human Corpus. The human dataset comprises 28
action, aiming to understand the shared procedures guiding groups, a total of 84 participants. The group sizes range from
participants in producing and recognizing meaningful actions two to four members. There were 6 groups with 2 members,
(Liddicoat, 2021). Human discourse is studied as a dy- 16 groups with 3 members, and 6 groups with 4 members.
namic interplay driven by informational and relational mo- Speaker-level data includes demographics and answers to the
tives (Yeomans, Schweitzer, & Brooks, 2022). At the core post-experiment questionnaire. Utterance-level data includes
of this process lies turn-taking, marking transitions between text transcription, timestamp, sentiment annotation (positive,
speakers (Seuren, Wherton, Greenhalgh, & Shaw, 2021). negative), and decision annotation. Decision annotation de-
Past work has found that turn-taking is challenging to analyze notes a group decision process, and possible values include
because transitions can happen with or without gaps, turn or- proposal, agreement, disagreement, and confirmation. More
der could vary, and the relative distribution of turn allocation details of the dataset can be found in (Braley & Murray,
cannot be pre-determined or modeled (Sacks, Schegloff, & 2018). Each person belongs to one group, and each group
Jefferson, 1978). has one conversation.
On the basis of these findings, we propose a novel, free-
form conversation algorithm capable of generating locally or- LLM-simulated Corpus
ganized and interactionally managed dialogues. In our ap- Language Agent. Figure 1 illustrates the architecture of the
proach, the ”next speaker” is self-selected, contingent upon language agent. The rest of this section describes the com-
each agent’s individual decision to contribute to the conver- ponents presented in the architecture: Utterance, Conversa-
sation. Since it is impossible to find a decontextualized set tion History, Reflection, and Actions (speak/silent). Ope-
of linguistic forms of turns, our algorithm empowers LLM nAI’s API querying system (Achiam et al., 2023) is used to
agents to autonomously determine speech turns within the 1 https://ed.fnal.gov/arise/guides/bio/1-
conversational flow. Scientific%20Method/1b-WinterSurvivalExercise.pdf
3012
action talk will be triggered to generate natural languages that
convey its opinion.
Four prompts are involved in the simulation. The Task De-
scription prompt is identical to the one used in the human
experiment to describe the task to LLM agents, abbreviated
for the sake of space. The ranking update prompt asks the
agents to consider the propositions by other agents during the
conversation, integrate them, and update their ranking of the
15 items. The floor action prompt reflects humans’ decisions
and actions on the conversation floor during the discussion,
e.g., interject or remain silent. Finally, the talk prompt is used
Figure 1: Language Agent to generate text when the agent decides to speak up. The word
limit is empirically set as 40 (maximum utterance length in
human corpus) to avoid lengthy sentences. An auxiliary re-
drive interactions between agents. To ensure the validity and plyTo attribute is included to improve the coherence of the
reproducibility of our evaluations, we use fixed versions of conversation and to explicitly show whether the speaker is
these models in our experiments. Specifically, we utilized the specifically talking to one of the other agents or broadcasting
0613 version of GPT-4-32k. to the entire group.
The architecture is developed to emulate conversations in The prompts were designed to follow a role-based system
a small group. Each agent in the group observes utterances to differentiate between system roles and user roles (Oren,
made by other agents during the conversation and remembers Sagawa, Hashimoto, & Liang, 2019). The system roles are
the conversation using a conversation history. This conver- used to configure the LLM identity (i.e., survivor x). We
sation history is then used to reflect and make decisions on leave the customization of the model’s tone, style, and per-
whether to interject or remain silent after each utterance made sona for future exploration. The user roles are used to con-
by other agents in the group. figure the task description and task prompt. The reasoning
attribute is an auxiliary one to explicitly show the LLM de-
Utterance and Conversation history Utterance within the
duction process, which has been widely used to improve their
architecture represents text uttered by the agent currently
performance in a variety of applications such as knowledge-
speaking and observed by other agents in the group. We
intensive tasks (Yao et al., 2022) and decision-making tasks
assume the agents could remember the entire conversation
(Shinn, Labash, & Gopinath, 2023).
because the duration of group discussions emulated is rel-
atively short (15 minutes discussion). Thus, each agent is Free-form Conversation Algorithm
programmed to store and maintain the conversation history Following the same procedure in the human-subject experi-
as a data structure that persists across calls/prompts made to ment, each agent is first instructed to complete the WST indi-
LLMs to choose an action during the conversation, akin to vidually. The agents are then assigned to groups of 2, 3, and
working memory. Specifically, this conversation history is 4 members to complete the WST. The agents are instructed to
maintained as a list data structure that consists of a series of collaborate and discuss their individual rankings and come to
(speaker id, text) pairs in the order they appeared during the a consensus on a group ranking. Then, each agent is individ-
conversation. ually prompted to complete the post-task questionnaires.
At each prompt to LLM, the conversation history is pro- Figure 2 illustrates the execution loop that allows free-form
vided as input for decision-making and utterance generation. conversation (FFC) among agents. The speaker who utters
The agents have no episodic memory since they only ”par- the first sentence initiates the conversation and grabs posses-
ticipate” in this group task once, which does not require the sion of the ”conversation floor.” The remaining agents in the
storage of experience from multiple group decision cycles. group observe what is being said by the speaking agent. The
We rely on the implicit knowledge stored in the LLM weights speaker keeps the floor until another agent tries to claim the
and do not initialize the agents with external semantic knowl- floor. Meanwhile, the listening agents monitor conversation
edge support. history, periodically deciding whether to claim the floor or re-
Actions. Actions can be divided into two types: ”Reason- main silent. If no one attempts to claim the floor, the speaker
ing actions” and ”Statement actions”. The reasoning action keeps talking until the agent determines to release the floor
consists of two sub-actions performed in a sequence: rank- to others. If more than one agent attempts to claim the floor,
ing update and floor action. In this sequence the agents are one of them is randomly chosen as the next speaker. When
first prompted to update their ranking (ranking update) of the conversation floor is free, and a consensus has not yet
the 15 items at the end of each utterance they observe. The been reached, the agents are repeatedly prompted to reassess
agents then synthesize their ranking and conversation history the situation and decide whether to speak up. If none of the
to make a floor action decision: grab the conversation floor or agents recognizes the obligation to speak up and continue
release the floor. If the agent determines to talk, the statement the discussion, the conversation is ceased, and the group task
3013
ends in failure. Group Score) is calculated based on the differences be-
tween the group’s ranking and the human expert ranking
of each item [100 − ∑ ∥RankGroup (i) − RankExpert (i)∥].
i∈∥items∥
All members in groups with AGS ≥ 50 can survive. With
a score between (40, 49], one might get frostbite. At most
3 members can survive with AGS ∈ (30, 39]. Groups with
AGS ≤ 30 are in serious danger. As a baseline, The distribu-
tion of random performance was analyzed using a Gaussian
(normal) distribution model. The mean of the fitted Gaussian
distribution was estimated to be µ = 15.34 with a standard
deviation σ = 12.71, R2 = 0.95.
To analyze the efficiency of meetings, we measured the
Meeting Length in terms of the number of words used dur-
ing the conversations instead of the length of time since the
LLM agents are not embedded in the real world and they
can output text as fast as their CPU/GPU will allow. For a
fair comparison between verbal conversation among humans
Figure 2: Flow diagram to describe the process that agents and a text-based interaction among agents, the back channels
follow to generate free-form conversations (e.g., cough, nod, or unclear utterances like ”uh”) in the hu-
man corpus were excluded from the analysis.
Group Decision-Making Annotations. Synthetic conver- Affect and Satisfaction. We measure the affection of the
sation corpus generated from the LLM simulation was an- groups of agents based on the sentiment of each utterance and
notated with the same four group decision-making annota- the peer evaluation in the post-experiment questionnaire. Pos-
tions: Proposal, Agreement, Disagreement, and Confirma- itivity and Negativity are the number of utterances annotated
tion. The annotation process was automated using the ProS- as positive or negative. The Satisfaction is the group average
eqo (Kozareva & Ravi, 2019) method, which currently ranks of the Overall Satisfaction of each agent, which is the average
first on the leaderboard of dialogue act classification based of the five Likert-scale in the post-experiment questionnaire.
on the Switchboard Dialog Act Corpus (Jurafsky, Shriberg, Group Action and Airtime. The decision-making behav-
& Biasca, 1997). We fine-tuned the network on 60% of ior is measured in terms of both high-level speech acts and
the annotated human corpus and achieved 72.4% agreement low-level turn-taking. Group action proportions are the num-
with human annotation on the remaining 40% human corpus. ber of utterances labeled as proposal, agreement, disagree-
72.4% is a satisfactory kappa value. ment, confirmation divided by the total number of utterances
of the group conversation. Airtime proportion is the number
Sentiment Annotations. Synthetic corpus was also anno-
of words uttered by each speaker divided by the total word
tated for sentiment. We follow the same annotation scheme
count of the conversation.
used to annotate the human corpus. The annotation process
for sentiment was also automated. DistilBERT (Sanh, De- Results
but, Chaumond, & Wolf, 2019; Wolf et al., 2019) was used
to automate the sentiment annotation process. We first fine- Performance
tuned the network on 60% of the annotated human corpus and Figure 3 illustrates the group score and meeting length
achieved 81.3% agreement with the human annotation on the of human and agent groups. One-way between-subjects
remaining 40% human corpus. 81.3% is a satisfactory kappa ANOVA with human or agent as the main factor shows that
value. the groups of language agents perform significantly better
than human groups across three group sizes [F(1, 147) =
Corpus Evaluation Metrics 5.121, p < 0.05], while the length of agent meetings is signif-
To systemically evaluate the human likeness of language icantly shorter than human meetings [F(1, 147) = 355.7, p <
agents’ behavior and the potential to use them in group re- 0.0001]. Post hoc comparisons using the Tukey HSD test
search, we propose to evaluate the agents on the following indicated that the meeting length for human groups with
metrics. 4 members (M = 2209.00, SD = 860.28) was significantly
higher than groups with 3 (M = 1484.06, SD = 698.81) or 2
Score and Meeting Length. We measure the task per-
(M = 1349.83, SD = 88.85) members. There is no significant
formance at both individual and group levels using the
difference in meeting length for agent groups with different
task score. AIS (Absolute Individual Score) is calculated
sizes.
based on the differences between the individual’s rank-
Figure.4 further demonstrates the efficiency with which the
ing and the human expert ranking of each item [100 −
agent groups deliberated compared to human groups. One-
∑ ∥RankIndividual (i) − RankExpert (i)∥]. AGS (Absolute
i∈∥items∥ way between-subjects ANOVA was conducted to compare
3014
19.72, p < 0.001]. In summary, agents show more positive
affection toward peers and ”care” more about efficiency.
Affect
Table 1 shows the descriptive statistics of conversation sen-
timent and post-task peer evaluation scores. Both hu-
man and agent conversations have more utterances labeled Figure 5: Distribution of Group Decision-making Actions.
positive than negative. The negativity of agent group Note that one utterance can be labeled with more than one
conversations is significantly lower than that of human action, e.g., ”I agree with the shortening over ski poles. Shall
group conversations[F(1, 147) = 9.29, p < 0.01]. As for we rank flashlight next?” is labeled as Proposal, Agreement.
peer evaluation, the agents report significantly lower scores
in terms of Time Management (Our group used its time Figure 6 shows the distribution of airtime proportion in
wisely)[F(1, 147) = 25.41, p < 0.001] and Efficiency (Our agent and human groups of various sizes. The distributions
group struggled to work together efficiently on this task) of human and agent groups are the most different in groups
[F(1, 147) = 23.08, p < 0.001], and significantly higher of size 4 (Agent: M=0.25, SD=0.097; Human: M=0.26,
scores in terms of Time Expectation (This task took longer SD=0.0.169). In groups of size 4, more than 40% agents
than expected to complete.) [F(1, 147) = 31.52, p < 0.001], occupied 20% to 30% airtime, while only 33% humans oc-
Worked Well Together (Our group worked well together.) cupied 20% to 30% airtime, which indicates that agents par-
[F(1, 147) = 5.966, p < 0.05] and Quality of Work (Over- ticipated in the discussion more equally than humans. As an
all, our group did a good job on this task.) [F(1, 147) = example, Figure.7 demonstrates the timeline of a 4-humans
3015
group conversation, which is significantly dominated by one a publicly available human data set, evaluating them based
group member. on ranking scores, meeting length, and the change in ranking
scores after group discussions. Our results indicate that LLM
agent groups outperform human groups by achieving higher
scores in shorter time frames. Furthermore, LLM agents
enhance their scores after free-form group discussions com-
pared to human groups. Analyses of post-task questionnaires
and conversation dynamics indicate that agents are dissatis-
fied with their time management, perceiving tasks as taking
longer than expected. Agents also exhibit a tendency to make
positive remarks over negative ones, contrasting with human
groups. These differences result from the underlying design
philosophy used to build LLMs. It is possible that LLMs
are intentionally designed to exhibit politeness and humility
to please human users, potentially mitigating displeasure or
Figure 6: Distribution of Airtime Proportion frustration.
Analyses of LLM agent discussions reveal greater dis-
agreement among agents within a group than among human
groups. Agent groups also tend to craft more intricate state-
ments that combine agreement and disagreement. However,
agent discussions exhibit faster progression from one item
to the next than human discussions. Agents achieve this
by quickly proposing subsequent steps after agreement, dis-
agreement, or confirmation. Also, agents engaged in turn-
taking without requiring a predefined order. In contrast, hu-
man groups often have a dominant speaker, reducing some
members to passive observers of the conversation. A possible
explanation is that the human groups consist of different peo-
Figure 7: Conversation timeline of a 4-humans group vs a 4- ple with different background knowledge, biases, and prefer-
agent group ences, while the agent groups can be less diverse.
3018