2024 DuRajivanGonzalez

UC Merced
Proceedings of the Annual Meeting of the Cognitive Science

Society
Title
Large Language Models for Collective Problem-Solving: Insights into Group Consensus
Decision-Making
Permalink
https://escholarship.org/uc/item/6s060914
Journal
Proceedings of the Annual Meeting of the Cognitive Science Society, 46(0)
Authors
Du, Yinuo
Rajivan, Prashanth
Gonzalez, Cleotilde
Publication Date
2024
Peer reviewed
eScholarship.org Powered by the California Digital Library

University of California
Large Language Models for Collective Problem-Solving: Insights into Group
Consensus Decision-Making
Yinuo Du (yinuod@andrew.cmu.edu)
Department of Software and Societal Systems, 4615 Forbes Ave
Pittsburgh, PA 15213 USA
Prashanth Rajivan(prajivan@uw.edu)
Department of Industrial and Systems Engineering, 1410 NE Campus Pkwy
Seattle, WA 98195 USA
Cleotilde Gonzalez(coty@cmu.edu)
Department of Social Decision Science, 4815 Frew Street
Pittsburgh, PA 15213 USA
Abstract cohesion (Fine, 2014). These groups serve as plat-
forms for individual interactions, including virtual meet-
Large Language models (LLM) exhibit human-like proficiency
in various tasks such as translation, question answering, es- ings (Karl, Peluchette, & Aghakhani, 2022), workplace dis-
say writing, and programming. Emerging research explores cussions (Forsell, Forslund Frykedal, & Hammar Chiriac,
the use of LLMs in collective problem-solving endeavors, 2020), recreational activities (Vernham, Granhag, & Mac Gi-
such as tasks where groups try to uncover clues through dis-
cussions. Although prior work has investigated individual olla, 2016), and educational settings (Liu & Tsai, 2008;
problem-solving tasks, leveraging LLM-powered agents for Yadgarovna & Husenovich, 2020). A deeper understanding
group consensus and decision-making remains largely unex- of human conversational dynamics within small groups is es-
plored. This research addresses this gap by (1) proposing an
algorithm to enable free-form conversation in groups of LLM sential to improve teamwork, resolve conflicts, and foster ef-
agents, (2) creating metrics to evaluate the human-likeness of fective problem-solving. However, the limited availability of
the generated dialogue and problem-solving performance, and group corpora (J. P. Chang et al., 2020) poses a significant
(3) evaluating LLM agent groups against human groups using
an open source dataset. Our results reveal that LLM groups challenge to advance research in this area. LLMs, trained on
outperform human groups in problem-solving tasks. LLM human datasets, offer a promising way to address this data
groups also show a greater improvement in scores after par- scarcity (Bommasani et al., 2021).
ticipating in free discussions. In particular, analyses indicate
that LLM agent groups exhibit more disagreements, complex Recent research has explored the potential of LLMs to em-
statements, and a propensity for positive statements compared ulate human-like behavior at the group level. (Aher, Arriaga,
to human groups. The results shed light on the potential of
LLMs to facilitate collective reasoning and provide insight into & Kalai, 2023) examined LLMs in the context of human stud-
the dynamics of group interactions involving synthetic LLM ies and showed that LLMs can replicate various experiments
agents. that span the domains of economic, psycholinguistic, and so-
Keywords: Small Group, Language Model, Simulation cial psychology. Other recent contributions from social sim-
ulation used a prompt chain methodology to generate con-
Introduction cise natural language descriptions of personas and their be-
Large Language Models (LLMs) are gaining widespread haviors (Park et al., 2023). Additionally, (Zhou et al., 2023)
adoption due to their seemingly remarkable reasoning power introduced an open-ended environment designed to simulate
and emergent generalization ability, which have the potential social interactions between language agents, evaluating their
to construct intelligent agents, driving recent advancements in ability to achieve social objectives.
a variety of human language tasks (Ouyang et al., 2022; Wei However, existing research has predominantly focused on
et al., 2022), including tasks such as web surfing (Nakano et evaluating individual agent performance, neglecting to ex-
al., 2021; Yao et al., 2022), complex video games (Y. Chang plore the emergent behavior of the interaction between agents
et al., 2023), and other applications (Ahn et al., 2022). In a within small groups. Our research addresses this gap by intro-
recent work, Zeims et al. found that LLM agents were able ducing a model designed to emulate free-form conversation
to achieve a fair level of performance conducting tasks in- for problem-solving within small groups. This algorithm, in-
volved in computational social science research, for exam- tegrated with LLMs, generates group discussions aimed at
ple, achieving sufficient agreement with human annotators solving complex tasks. Using the publicly available Win-
and providing explanations that surpass those generated by ter Survival Task dataset (Humphreys, Johnson, & Johnson,
crowd workers (Ziems et al., 2023). Despite these achieve- 1982), developed to understand the dynamics of team build-
ments, the current focus of LLM research mainly revolves ing and group problem solving, we propose a mechanism that
around individual tasks, leaving the potential of these models enables free-form discussions among an arbitrary number of
in collective problem-solving tasks largely understudied. agents without imposing predefined interaction rules. We
Small groups play an important role in connecting peo- conducted a comparative analysis between the synthetic cor-
ple within larger social systems and in fostering social pus generated by our model and the human corpus collected
3011
In L. K. Samuelson, S. L. Frank, M. Toneva, A. Mackey, & E. Hazeltine (Eds.), Proceedings of the 46th Annual Conference of the Cognitive
Science Society. ©2024 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY).
by Braley and Murray (2018), focusing on metrics related to Method
performance and efficiency, affect and satisfaction, and group We utilized an existing dataset collected from an exper-
action and airtime. Our findings reveal that LLM groups out- iment conducted using the winter survival task paradigm
perform human groups in the Winter Survival Task, mainly (Humphreys et al., 1982). The dataset was used to model
by participating in more disagreements, complex statements, and analyze LLM’s performance in emulating conversations
and more positive rather than negative statements compared in small groups. First, we briefly describe the winter survival
to human groups. task (WST) and the dataset from the experiment conducted
using WST. Later, we describe the algorithm used to con-
Related Work struct LLM agents to model and emulate the conversations
observed within human teams in the experiment.
Dialogue Systems. Dialogue systems are widely applied in Task and Human Corpus
various big data domains, including computer vision and rec-
Winter Survival Task. The winter survival task (Humphreys
ommender systems (Chen, Liu, Yin, & Tang, 2017). Exist-
et al., 1982) is a group decision-making exercise consisting
ing dialogue systems fall into two categories: task-oriented
of a hypothetical scenario of a plane crash. Participants in ex-
systems and conversational agents. Task-oriented dialogue
periments using this paradigm are told they are stranded in a
systems are characterized by clearly defined goals, struc-
remote place and must survive using 15 items that were sup-
tured dialogue behavior, closed domains, and a focus on ef-
posedly salvaged from the plane they traveled. Examples in-
ficiency (Raux, 2008; Cole et al., 2018). They operate by
clude a compress kit, a fluid-free cigarette lighter, a compass,
tracking dialogue states and generating responses based on
and a family-sized chocolate bar. Participants are presented
them, with their performance assessed mainly by task success
with these 15 items and must work in small groups to discuss
rates and user ratings (Cuayáhuitl, Keizer, & Lemon, 2015;
and rank each item according to its importance for their sur-
Schmitt & Ultes, 2015). In contrast, conversational agents are
vival in that situation. Participants are instructed to indepen-
designed for unstructured, open-domain conversations with
dently rank the 15 items before the group discussion begins.
users (Tulshan & Dhage, 2019). Evaluating conversational
Following individual rankings by each participant within the
dialogue systems remains a challenge (Deriu et al., 2021),
group, each group is given a maximum of 15 minutes to col-
typically relying on metrics such as response appropriateness
lectively deliberate and reach a consensus as a group on the
(e.g., coherence, relevance) and human likeness, measured
final ranking of the items. The group’s conversations and de-
by their ability to mimic humans convincingly. However,
liberations during this task were recorded as conversations.
these metrics focus on individual conversational properties.
Rankings submitted by the individuals and groups are scored
We propose a novel approach for evaluating conversational
according to the human expert ranking.1 Finally, the partic-
agents by assessing their human likeness regarding group be-
ipants answered a questionnaire on five-point Likert scales
havior.
to how strongly they agreed with statements concerning the
Conversation Analysis. Communication or conversation meeting.
analysis involves studying socially organized human inter- Human Corpus. The human dataset comprises 28
action, aiming to understand the shared procedures guiding groups, a total of 84 participants. The group sizes range from
participants in producing and recognizing meaningful actions two to four members. There were 6 groups with 2 members,
(Liddicoat, 2021). Human discourse is studied as a dy- 16 groups with 3 members, and 6 groups with 4 members.
namic interplay driven by informational and relational mo- Speaker-level data includes demographics and answers to the
tives (Yeomans, Schweitzer, & Brooks, 2022). At the core post-experiment questionnaire. Utterance-level data includes
of this process lies turn-taking, marking transitions between text transcription, timestamp, sentiment annotation (positive,
speakers (Seuren, Wherton, Greenhalgh, & Shaw, 2021). negative), and decision annotation. Decision annotation de-
Past work has found that turn-taking is challenging to analyze notes a group decision process, and possible values include
because transitions can happen with or without gaps, turn or- proposal, agreement, disagreement, and confirmation. More
der could vary, and the relative distribution of turn allocation details of the dataset can be found in (Braley & Murray,
cannot be pre-determined or modeled (Sacks, Schegloff, & 2018). Each person belongs to one group, and each group
Jefferson, 1978). has one conversation.
On the basis of these findings, we propose a novel, free-
form conversation algorithm capable of generating locally or- LLM-simulated Corpus
ganized and interactionally managed dialogues. In our ap- Language Agent. Figure 1 illustrates the architecture of the
proach, the ”next speaker” is self-selected, contingent upon language agent. The rest of this section describes the com-
each agent’s individual decision to contribute to the conver- ponents presented in the architecture: Utterance, Conversa-
sation. Since it is impossible to find a decontextualized set tion History, Reflection, and Actions (speak/silent). Ope-
of linguistic forms of turns, our algorithm empowers LLM nAI’s API querying system (Achiam et al., 2023) is used to
agents to autonomously determine speech turns within the 1 https://ed.fnal.gov/arise/guides/bio/1-
conversational flow. Scientific%20Method/1b-WinterSurvivalExercise.pdf
3012
action talk will be triggered to generate natural languages that
convey its opinion.
Four prompts are involved in the simulation. The Task De-
scription prompt is identical to the one used in the human
experiment to describe the task to LLM agents, abbreviated
for the sake of space. The ranking update prompt asks the
agents to consider the propositions by other agents during the
conversation, integrate them, and update their ranking of the
15 items. The floor action prompt reflects humans’ decisions
and actions on the conversation floor during the discussion,
e.g., interject or remain silent. Finally, the talk prompt is used
Figure 1: Language Agent to generate text when the agent decides to speak up. The word
limit is empirically set as 40 (maximum utterance length in
human corpus) to avoid lengthy sentences. An auxiliary re-
drive interactions between agents. To ensure the validity and plyTo attribute is included to improve the coherence of the
reproducibility of our evaluations, we use fixed versions of conversation and to explicitly show whether the speaker is
these models in our experiments. Specifically, we utilized the specifically talking to one of the other agents or broadcasting
0613 version of GPT-4-32k. to the entire group.
The architecture is developed to emulate conversations in The prompts were designed to follow a role-based system
a small group. Each agent in the group observes utterances to differentiate between system roles and user roles (Oren,
made by other agents during the conversation and remembers Sagawa, Hashimoto, & Liang, 2019). The system roles are
the conversation using a conversation history. This conver- used to configure the LLM identity (i.e., survivor x). We
sation history is then used to reflect and make decisions on leave the customization of the model’s tone, style, and per-
whether to interject or remain silent after each utterance made sona for future exploration. The user roles are used to con-
by other agents in the group. figure the task description and task prompt. The reasoning
attribute is an auxiliary one to explicitly show the LLM de-
Utterance and Conversation history Utterance within the
duction process, which has been widely used to improve their
architecture represents text uttered by the agent currently
performance in a variety of applications such as knowledge-
speaking and observed by other agents in the group. We
intensive tasks (Yao et al., 2022) and decision-making tasks
assume the agents could remember the entire conversation
(Shinn, Labash, & Gopinath, 2023).
because the duration of group discussions emulated is rel-
atively short (15 minutes discussion). Thus, each agent is Free-form Conversation Algorithm
programmed to store and maintain the conversation history Following the same procedure in the human-subject experi-
as a data structure that persists across calls/prompts made to ment, each agent is first instructed to complete the WST indi-
LLMs to choose an action during the conversation, akin to vidually. The agents are then assigned to groups of 2, 3, and
working memory. Specifically, this conversation history is 4 members to complete the WST. The agents are instructed to
maintained as a list data structure that consists of a series of collaborate and discuss their individual rankings and come to
(speaker id, text) pairs in the order they appeared during the a consensus on a group ranking. Then, each agent is individ-
conversation. ually prompted to complete the post-task questionnaires.
At each prompt to LLM, the conversation history is pro- Figure 2 illustrates the execution loop that allows free-form
vided as input for decision-making and utterance generation. conversation (FFC) among agents. The speaker who utters
The agents have no episodic memory since they only ”par- the first sentence initiates the conversation and grabs posses-
ticipate” in this group task once, which does not require the sion of the ”conversation floor.” The remaining agents in the
storage of experience from multiple group decision cycles. group observe what is being said by the speaking agent. The
We rely on the implicit knowledge stored in the LLM weights speaker keeps the floor until another agent tries to claim the
and do not initialize the agents with external semantic knowl- floor. Meanwhile, the listening agents monitor conversation
edge support. history, periodically deciding whether to claim the floor or re-
Actions. Actions can be divided into two types: ”Reason- main silent. If no one attempts to claim the floor, the speaker
ing actions” and ”Statement actions”. The reasoning action keeps talking until the agent determines to release the floor
consists of two sub-actions performed in a sequence: rank- to others. If more than one agent attempts to claim the floor,
ing update and floor action. In this sequence the agents are one of them is randomly chosen as the next speaker. When
first prompted to update their ranking (ranking update) of the conversation floor is free, and a consensus has not yet
the 15 items at the end of each utterance they observe. The been reached, the agents are repeatedly prompted to reassess
agents then synthesize their ranking and conversation history the situation and decide whether to speak up. If none of the
to make a floor action decision: grab the conversation floor or agents recognizes the obligation to speak up and continue
release the floor. If the agent determines to talk, the statement the discussion, the conversation is ceased, and the group task
3013
ends in failure. Group Score) is calculated based on the differences be-
tween the group’s ranking and the human expert ranking
of each item [100 − ∑ ∥RankGroup (i) − RankExpert (i)∥].
i∈∥items∥
All members in groups with AGS ≥ 50 can survive. With
a score between (40, 49], one might get frostbite. At most
3 members can survive with AGS ∈ (30, 39]. Groups with
AGS ≤ 30 are in serious danger. As a baseline, The distribu-
tion of random performance was analyzed using a Gaussian
(normal) distribution model. The mean of the fitted Gaussian
distribution was estimated to be µ = 15.34 with a standard
deviation σ = 12.71, R2 = 0.95.
To analyze the efficiency of meetings, we measured the
Meeting Length in terms of the number of words used dur-
ing the conversations instead of the length of time since the
LLM agents are not embedded in the real world and they
can output text as fast as their CPU/GPU will allow. For a
fair comparison between verbal conversation among humans
Figure 2: Flow diagram to describe the process that agents and a text-based interaction among agents, the back channels
follow to generate free-form conversations (e.g., cough, nod, or unclear utterances like ”uh”) in the hu-
man corpus were excluded from the analysis.
Group Decision-Making Annotations. Synthetic conver- Affect and Satisfaction. We measure the affection of the
sation corpus generated from the LLM simulation was an- groups of agents based on the sentiment of each utterance and
notated with the same four group decision-making annota- the peer evaluation in the post-experiment questionnaire. Pos-
tions: Proposal, Agreement, Disagreement, and Confirma- itivity and Negativity are the number of utterances annotated
tion. The annotation process was automated using the ProS- as positive or negative. The Satisfaction is the group average
eqo (Kozareva & Ravi, 2019) method, which currently ranks of the Overall Satisfaction of each agent, which is the average
first on the leaderboard of dialogue act classification based of the five Likert-scale in the post-experiment questionnaire.
on the Switchboard Dialog Act Corpus (Jurafsky, Shriberg, Group Action and Airtime. The decision-making behav-
& Biasca, 1997). We fine-tuned the network on 60% of ior is measured in terms of both high-level speech acts and
the annotated human corpus and achieved 72.4% agreement low-level turn-taking. Group action proportions are the num-
with human annotation on the remaining 40% human corpus. ber of utterances labeled as proposal, agreement, disagree-
72.4% is a satisfactory kappa value. ment, confirmation divided by the total number of utterances
of the group conversation. Airtime proportion is the number
Sentiment Annotations. Synthetic corpus was also anno-
of words uttered by each speaker divided by the total word
tated for sentiment. We follow the same annotation scheme
count of the conversation.
used to annotate the human corpus. The annotation process
for sentiment was also automated. DistilBERT (Sanh, De- Results
but, Chaumond, & Wolf, 2019; Wolf et al., 2019) was used
to automate the sentiment annotation process. We first fine- Performance
tuned the network on 60% of the annotated human corpus and Figure 3 illustrates the group score and meeting length
achieved 81.3% agreement with the human annotation on the of human and agent groups. One-way between-subjects
remaining 40% human corpus. 81.3% is a satisfactory kappa ANOVA with human or agent as the main factor shows that
value. the groups of language agents perform significantly better
than human groups across three group sizes [F(1, 147) =
Corpus Evaluation Metrics 5.121, p < 0.05], while the length of agent meetings is signif-
To systemically evaluate the human likeness of language icantly shorter than human meetings [F(1, 147) = 355.7, p <
agents’ behavior and the potential to use them in group re- 0.0001]. Post hoc comparisons using the Tukey HSD test
search, we propose to evaluate the agents on the following indicated that the meeting length for human groups with
metrics. 4 members (M = 2209.00, SD = 860.28) was significantly
higher than groups with 3 (M = 1484.06, SD = 698.81) or 2
Score and Meeting Length. We measure the task per-
(M = 1349.83, SD = 88.85) members. There is no significant
formance at both individual and group levels using the
difference in meeting length for agent groups with different
task score. AIS (Absolute Individual Score) is calculated
sizes.
based on the differences between the individual’s rank-
Figure.4 further demonstrates the efficiency with which the
ing and the human expert ranking of each item [100 −
agent groups deliberated compared to human groups. One-
∑ ∥RankIndividual (i) − RankExpert (i)∥]. AGS (Absolute
i∈∥items∥ way between-subjects ANOVA was conducted to compare
3014
19.72, p < 0.001]. In summary, agents show more positive
affection toward peers and ”care” more about efficiency.
Table 1: Descriptive Statistics of Sentiment and Peer-

Evaluation Score (One-way ANOVA Significance. codes: 0
‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1)
Human Agent
M=13.785, M=15.218,
#Positive SD=11.767 SD=7.322
M=5.285, M=3.075,
#Negative** SD=6.759 SD=2.046
M=3.059, M=3.647,
Time Expectation*** SD=1.434 SD=0.659
M=4.398, M=4.559,
Worked Well Together* SD=0.684 SD=0.502
Figure 3: Left-panel: group score; Right-panel: meeting M=4.422, M=4.014,
length in word count Time Management*** SD=0.778 SD=0.638
M=4.351, M=3.931,
Efficiency*** SD=0.981 SD=0.641
M=4.315, M=4.615,
human versus agent groups’ improvement in groups of sizes Quality of Work*** SD=0.751 SD=0.498
2, 3, and 4. The agent groups’ improvement is significantly M=3.452, M=3.609,
Leadership. SD=0.974 SD=0.661
higher than humans in groups of size 2 [F(1, 45) = 2.83, p <
0.1] and 4 [F(1, 45) = 1.97, p < 0.1].
Decision-Making
Figure.5 shows the distribution of group decision-making ac-
tions in human and agent groups. In general, agents make
Proposal more often than humans, especially together with
Agreement, Disagreement, and Confirmation. The agents also
express Disagreement significantly more often than humans.
Figure 4: Average AIS before discussion → AGS after dis-

cussion
Affect
Table 1 shows the descriptive statistics of conversation sen-
timent and post-task peer evaluation scores. Both hu-
man and agent conversations have more utterances labeled Figure 5: Distribution of Group Decision-making Actions.
positive than negative. The negativity of agent group Note that one utterance can be labeled with more than one
conversations is significantly lower than that of human action, e.g., ”I agree with the shortening over ski poles. Shall
group conversations[F(1, 147) = 9.29, p < 0.01]. As for we rank flashlight next?” is labeled as Proposal, Agreement.
peer evaluation, the agents report significantly lower scores
in terms of Time Management (Our group used its time Figure 6 shows the distribution of airtime proportion in
wisely)[F(1, 147) = 25.41, p < 0.001] and Efficiency (Our agent and human groups of various sizes. The distributions
group struggled to work together efficiently on this task) of human and agent groups are the most different in groups
[F(1, 147) = 23.08, p < 0.001], and significantly higher of size 4 (Agent: M=0.25, SD=0.097; Human: M=0.26,
scores in terms of Time Expectation (This task took longer SD=0.0.169). In groups of size 4, more than 40% agents
than expected to complete.) [F(1, 147) = 31.52, p < 0.001], occupied 20% to 30% airtime, while only 33% humans oc-
Worked Well Together (Our group worked well together.) cupied 20% to 30% airtime, which indicates that agents par-
[F(1, 147) = 5.966, p < 0.05] and Quality of Work (Over- ticipated in the discussion more equally than humans. As an
all, our group did a good job on this task.) [F(1, 147) = example, Figure.7 demonstrates the timeline of a 4-humans
3015
group conversation, which is significantly dominated by one a publicly available human data set, evaluating them based
group member. on ranking scores, meeting length, and the change in ranking
scores after group discussions. Our results indicate that LLM
agent groups outperform human groups by achieving higher
scores in shorter time frames. Furthermore, LLM agents
enhance their scores after free-form group discussions com-
pared to human groups. Analyses of post-task questionnaires
and conversation dynamics indicate that agents are dissatis-
fied with their time management, perceiving tasks as taking
longer than expected. Agents also exhibit a tendency to make
positive remarks over negative ones, contrasting with human
groups. These differences result from the underlying design
philosophy used to build LLMs. It is possible that LLMs
are intentionally designed to exhibit politeness and humility
to please human users, potentially mitigating displeasure or
Figure 6: Distribution of Airtime Proportion frustration.
Analyses of LLM agent discussions reveal greater dis-
agreement among agents within a group than among human
groups. Agent groups also tend to craft more intricate state-
ments that combine agreement and disagreement. However,
agent discussions exhibit faster progression from one item
to the next than human discussions. Agents achieve this
by quickly proposing subsequent steps after agreement, dis-
agreement, or confirmation. Also, agents engaged in turn-
taking without requiring a predefined order. In contrast, hu-
man groups often have a dominant speaker, reducing some
members to passive observers of the conversation. A possible
explanation is that the human groups consist of different peo-
Figure 7: Conversation timeline of a 4-humans group vs a 4- ple with different background knowledge, biases, and prefer-
agent group ences, while the agent groups can be less diverse.
Limitations & Future Work

Discussion
Conversations encompass various modalities, including non-
In this work, we introduce an algorithm that allows multiple
linguistic activities. Our free-form conversation algorithm
LLM agents to engage in a problem-solving task through free
lacks details such as backchannels, influenced by intonation
conversation. Inspired by research in computational linguis-
and tone, beyond text. Challenges in conversations, such as
tics (Cohen & Perrault, 1979; Traum & Allen, 1994), which
overlapping talk or awkward silences, require a restoration
elucidates human conversational behavior in terms of beliefs,
mechanism. Future research could integrate embodied lan-
goals, intentions, and obligations, our agents are prompted
guage models with sociometers to capture the conversation
to reason about the scenario and determine moments to con-
dynamics at a finer granularity (Parker, Cardenas, Dorr, &
tribute or terminate within the conversational flow. While
Hackett, 2020; Driess et al., 2023).
many conversational humanoids (e.g., (Thórisson, 1999))
LLMs evolve continuously, and the simulation results re-
maintain a conversational plan or agenda, they often aim to
flect GPT-4 behavior. Despite its superior performance, the
optimize dialogue rather than human-like conversations with
inner workings of GPT-4 remain hidden. Automatic anno-
diverse styles. In our approach, LLMs are designed to con-
tation may be biased by pre-training data despite fine-tuning
sider conversation history and make sequential decisions re-
and yielding satisfactory annotations. LLM agents only learn
garding agents’ participation, facilitating the emulation of
by accumulating shared information. Augmenting them with
natural, free-form conversations. We made several design
the ability to learn from peers could foster more human-
choices to minimize explicit instructions, provide zero-shot
like group dynamics. Future work may involve augmenting
prompts to the LLMs, intervene only when the conversation
agents with cognitive mechanisms to enhance social intelli-
ceased, and delegate problem-solving tasks to the agents.
gence and foster believable conversations.
We produced a synthetic corpus of group conversations us-
ing LLM agents configured to work in groups of 2, 3, or 4
members. These agents engaged in the free conversation us-
Acknowledgements
ing our algorithm while tackling the Winter Survival Task. Compute resources and GPT model credits were provided by
We compared the predictions generated by LLM agents with the Microsoft Accelerate Foundation Models Research grant
3016
”Personalized Education with Foundation Models via Cogni- at group level. Small Group Research, 51(1), 87–124.
tive Modeling.” Humphreys, B., Johnson, R. T., & Johnson, D. W. (1982). Ef-
fects of cooperative, competitive, and individualistic learn-
References ing on students’ achievement in science class. Journal of
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., research in science teaching, 19(5), 351–356.
Aleman, F. L., . . . others (2023). Gpt-4 technical report. Jurafsky, D., Shriberg, E., & Biasca, D. (1997). Switch-
arXiv preprint arXiv:2303.08774. board SWBD-DAMSL shallow-discourse-function annota-
Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023). Using large tion coders manual, draft 13 (Tech. Rep. No. 97-02). Boul-
language models to simulate multiple humans and replicate der, CO: University of Colorado, Boulder Institute of Cog-
human subject studies. In International conference on ma- nitive Science.
chine learning (pp. 337–371). Karl, K. A., Peluchette, J. V., & Aghakhani, N. (2022). Vir-
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., tual work meetings during the covid-19 pandemic: The
David, B., . . . others (2022). Do as i can, not as i say: good, bad, and ugly. Small Group Research, 53(3), 343–
Grounding language in robotic affordances. arXiv preprint 365.
arXiv:2204.01691. Kozareva, Z., & Ravi, S. (2019). Proseqo: Projection se-
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., quence networks for on-device text classification. In Pro-
Arora, S., von Arx, S., . . . others (2021). On the op- ceedings of the 2019 conference on empirical methods in
portunities and risks of foundation models. arXiv preprint natural language processing and the 9th international joint
arXiv:2108.07258. conference on natural language processing (emnlp-ijcnlp)
Braley, M., & Murray, G. (2018). The group affect and per- (pp. 3894–3903).
formance (gap) corpus. In Proceedings of the group inter- Liddicoat, A. J. (2021). An introduction to conversation anal-
action frontiers in technology (pp. 1–9). ysis. Bloomsbury Publishing.
Chang, J. P., Chiam, C., Fu, L., Wang, A. Z., Zhang, J., Liu, C.-C., & Tsai, C.-C. (2008). An analysis of peer interac-
& Danescu-Niculescu-Mizil, C. (2020). Convokit: A tion patterns as discoursed by on-line small group problem-
toolkit for the analysis of conversations. arXiv preprint solving activity. Computers & Education, 50(3), 627–639.
arXiv:2005.04246. Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L.,
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Kim, C., . . . others (2021). Webgpt: Browser-assisted
. . . others (2023). A survey on evaluation of large lan- question-answering with human feedback. arXiv preprint
guage models. ACM Transactions on Intelligent Systems arXiv:2112.09332.
and Technology. Oren, Y., Sagawa, S., Hashimoto, T. B., & Liang, P. (2019).
Chen, H., Liu, X., Yin, D., & Tang, J. (2017). A survey on Distributionally robust language modeling. arXiv preprint
dialogue systems: Recent advances and new frontiers. Acm arXiv:1909.02060.
Sigkdd Explorations Newsletter, 19(2), 25–35. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
Cohen, P. R., & Perrault, C. R. (1979). Elements of a plan- Mishkin, P., . . . others (2022). Training language mod-
based theory of speech acts. Cognitive science, 3(3), 177– els to follow instructions with human feedback. Advances
212. in Neural Information Processing Systems, 35, 27730–
Cole, R., Buchenroth-Martin, C., Weston, T., Devine, L., My- 27744.
att, J., Helding, B., . . . others (2018). One-on-one and Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P.,
small group conversations with an intelligent virtual sci- & Bernstein, M. S. (2023). Generative agents: Interac-
ence tutor. Computer Speech & Language, 50, 157–174. tive simulacra of human behavior. In Proceedings of the
Cuayáhuitl, H., Keizer, S., & Lemon, O. (2015). Strategic di- 36th annual acm symposium on user interface software and
alogue management via deep reinforcement learning. arXiv technology (pp. 1–22).
preprint arXiv:1511.08099. Parker, J. N., Cardenas, E., Dorr, A. N., & Hackett, E. J.
Deriu, J., Rodrigo, A., Otegi, A., Echegoyen, G., Rosset, S., (2020). Using sociometers to advance small group re-
Agirre, E., & Cieliebak, M. (2021). Survey on evalua- search. Sociological Methods & Research, 49(4), 1064–
tion methods for dialogue systems. Artificial Intelligence 1102.
Review, 54, 755–810. Raux, A. (2008). Flexible turn-taking for spoken dialog sys-
Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery, A., tems. Language Technologies Institute, CMU Dec, 12.
Ichter, B., . . . others (2023). Palm-e: An embodied multi- Sacks, H., Schegloff, E. A., & Jefferson, G. (1978). A sim-
modal language model. arXiv preprint arXiv:2303.03378. plest systematics for the organization of turn taking for con-
Fine, G. A. (2014). The hinge: Civil society, group cul- versation. In Studies in the organization of conversational
ture, and the interaction order. Social Psychology Quar- interaction (pp. 7–55). Elsevier.
terly, 77(1), 5–26. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distil-
Forsell, J., Forslund Frykedal, K., & Hammar Chiriac, E. bert, a distilled version of bert: smaller, faster, cheaper and
(2020). Group work assessment: Assessing social skills lighter. arXiv preprint arXiv:1910.01108.
3017
Schmitt, A., & Ultes, S. (2015). Interaction quality: assess-
ing the quality of ongoing spoken dialog interaction by ex-
perts—and how it relates to user satisfaction. Speech Com-
munication, 74, 12–36.
Seuren, L. M., Wherton, J., Greenhalgh, T., & Shaw, S. E.
(2021). Whose turn is it anyway? latency and the organiza-
tion of turn-taking in video-mediated interaction. Journal
of pragmatics, 172, 63–78.
Shinn, N., Labash, B., & Gopinath, A. (2023). Reflex-
ion: an autonomous agent with dynamic memory and self-
reflection. arXiv preprint arXiv:2303.11366.
Thórisson, K. R. (1999). Mind model for multimodal com-
municative creatures and humanoids. Applied Artificial In-
telligence, 13(4-5), 449–486.
Traum, D. R., & Allen, J. F. (1994). Discourse obligations in
dialogue processing. arXiv preprint cmp-lg/9407011.
Tulshan, A. S., & Dhage, S. N. (2019). Survey on virtual
assistant: Google assistant, siri, cortana, alexa. In Ad-
vances in signal processing and intelligent recognition sys-
tems: 4th international symposium sirs 2018, bangalore,
india, september 19–22, 2018, revised selected papers 4
(pp. 190–201).
Vernham, Z., Granhag, P.-A., & Mac Giolla, E. (2016). De-
tecting deception within small groups: A literature review.
Frontiers in psychology, 7, 1012.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., . . . others (2022). Chain-of-thought prompting elicits
reasoning in large language models. Advances in Neural
Information Processing Systems, 35, 24824–24837.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C.,
Moi, A., . . . others (2019). Huggingface’s transformers:
State-of-the-art natural language processing. arXiv preprint
arXiv:1910.03771.
Yadgarovna, M. F., & Husenovich, R. T. (2020). Advan-
tages and disadvantages of the method of working in small
groups in teaching higher mathematics. Academy(4 (55)),
65–68.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.,
& Cao, Y. (2022). React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629.
Yeomans, M., Schweitzer, M. E., & Brooks, A. W. (2022).
The conversational circumplex: Identifying, prioritizing,
and pursuing informational and relational motives in con-
versation. Current Opinion in Psychology, 44, 293–302.
Zhou, X., Zhu, H., Mathur, L., Zhang, R., Yu, H., Qi,
Z., . . . others (2023). Sotopia: Interactive evaluation
for social intelligence in language agents. arXiv preprint
arXiv:2310.11667.
Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., & Yang,
D. (2023). Can large language models transform computa-
tional social science? arXiv preprint arXiv:2305.03514.
3018

2024 DuRajivanGonzalez

Uploaded by

Copyright:

Available Formats

2024 DuRajivanGonzalez

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2024 DuRajivanGonzalez

Uploaded by

Copyright:

Available Formats

UC Merced

Proceedings of the Annual Meeting of the Cognitive Science

eScholarship.org Powered by the California Digital Library

Table 1: Descriptive Statistics of Sentiment and Peer-

Figure 4: Average AIS before discussion → AGS after dis-

Limitations & Future Work

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.