Multi-Agent Conversational Online Learning
Multi-Agent Conversational Online Learning
xuchuangwang@umass.edu, huanyuhello@zju.edu.cn
Abstract—The remarkable generative capability of large and only provide a “initiatory set of relatively good responses”
language models (LLMs) has sparked a growing interest in by pre-specified prompt instructions. However, considering the
automatically generating responses for different applications. diversity of responses generated by LLMs and the uncertainty in
Given the dynamic nature of user preferences and the uncertainty
of LLM response performance, it is crucial to design efficient LLM performance, identifying the most suitable LLM response
online learning algorithms to identify optimal LLM responses is inherently challenging [7], [8], as suitable responses are
(i.e., high-quality responses that also meet user preferences). usually unknown in advance and context-dependent. There-
Most existing online algorithms adopt a centralized approach fore, continuous online response adaptation is necessary [9],
and fail to leverage explicit user preferences for more efficient especially in scenarios such as medical diagnosis where highly
and personalized LLM response identification. In contrast, this
paper introduces MACO (Multi-Agent Conversational Online accurate answers are required. Note that the online response
Learning for Adaptive LLM Response Identification): 1) The identification approach can enhance the initiatory set of offline-
online LLM response identification process is accelerated by generated responses so to match the specific context.
multiple local agents (such as smartphones), while enhancing Furthermore, previous research has often overlooked the
data privacy; 2) A novel conversational mechanism is proposed need to address diverse user preferences. It is crucial to not
to adaptively conduct conversations for soliciting user preferences
(e.g., a preference for a humorous tone over a serious one in only ensure the quality of responses generated by LLMs,
generated responses), so to minimize uncertainty in preference but also to tailor them to meet the specific preferences and
estimation. Our theoretical analysis demonstrates that MACO is expectations of different users. For instance, some users may
near-optimal regarding cumulative regret. Additionally, MACO prefer LLM-generated responses to be humorous, while others
offers reduced communication costs and computational complexity might prefer a more formal tone. Although [10] considers
by eliminating the traditional, computing-intensive “G-optimal
design” found in previous works. Extensive experiments with the optimization of preferences for LLMs, it only addresses
the open LLM Llama, coupled with two different embedding the binary case of users’ likes and dislikes. LLM response
models from Google and OpenAI for text vector representation, identification must address the growing demand to cater to
demonstrate that MACO significantly outperforms the current diverse user preferences. To address such needs, one can utilize
state-of-the-art in online LLM response identification. cloud servers to continuously learn and refine LLM response
I. I NTRODUCTION identification by collecting feedback on the assessment of LLM
responses. This feedback can be derived from users’ direct
Large language models (LLMs) have swiftly transformed the
input or measurement of score functions [11], [12]. A response
technological landscape of our society [1], [2]. A significant
that not only meets quality standards but also aligns with user
line of research is the exploration of prompts to identify
preferences is termed an “optimal LLM response.”
optimal responses from LLMs [3]. This approach is compelling
since it does not need to alter the internal parameters of an A. Multi-Agent Conversational Properties
LLM, and can align well with human conversational patterns.
In the context of LLM response identification, we observe
Consequently, there is a growing interest in automatically
two significant properties in typical LLM application scenarios.
identifying LLM responses, e.g., through prompt engineering
These properties inform and motivate our proposed formulation.
methods [4], [5], [6]. These efforts aim to enhance LLMs’
First, in the utilization of LLMs, users commonly access
capability to produce more accurate and relevant responses,
LLM services across multiple devices, such as smartphones,
collectively referred to as “LLM response identification”.
tablets, and desktops, collectively referred to as “local agents.”
Note that these prompt engineering methods are done offline,
For example, the Poe AI chatting platform [13] handles user
Zhuohua Li is the corresponding author. queries originating from various devices. Leveraging this multi-
agent framework, LLM response identification tailored to based contextual bandit algorithms can handle this setting,
specific user preferences can be performed concurrently on each they rely on the computationally intensive G-optimal design
local agent, facilitating data aggregation and enhancing learning procedure [22], [23], [24] to calculate a distribution for
efficiency on a user preference. Moreover, this approach offers arm selection, thus slowing down the online LLM response
an added layer of privacy protection, as sensitive information identification.
remains localized and is neither transmitted nor stored on ❸ Thirdly, existing studies on conversational bandits [19],
central servers. [17] rely on predetermined functions to control conversa-
Second, a key challenge for online LLM methods lies in ad- tion frequency, which typically follow a fixed sequence of
dressing the “cold start” problem, where response identification engagements to initiate a specific number of conversations.
may be inaccurate for new users with limited historical data. To This approach is not suitable for the dynamic nature of LLM
address this, conversational recommendation [14], [15], [16] response identification, as it imposes unnecessary restrictions
has been applied in LLM applications. In this approach, the and could degrade user experience.
cloud server can proactively query users with questions and ❹ Finally, existing literature on conversational bandits solely
obtain feedback, thereby quickly eliciting user preferences. For considers centralized scenarios, neglecting the inherent multi-
example, in OpenAI’s design, when ChatGPT is tasked with agent property of data source of LLM platforms. While there
computing factorials in Python, it may provide two “correct” are works on distributed bandits with finite arms [22], [25],
implementations with different styles: one recursive, the other [26], they either require all local agents to upload user feedback
iterative. During the interaction, the user provides feedback on to the cloud server or share the exactly same arm set. These
their preferred coding style. This “conversation” process allows restrictive settings can leak sensitive information, reduce the
ChatGPT to learn from the user’s code preferences, enabling flexibility of local agents, and increase communication costs.
it to tailor its future responses more effectively to individual This paper makes the following contributions:
users. • Model Formulation: We propose a distributed conversational
bandit model for online LLM response identification. Comple-
B. Challenges and Our Contributions menting existing methods that rely on offline selection from a
To adaptively identify the appropriate LLM responses, which pre-generated pool of LLM responses. Our model emphasizes
were generated from an initiatory set of responses generated “online identification” of the optimal LLM response from
through offline prompt engineering techniques, we propose to the pre-generated arm set with uncertain performance. This
utilize online contextual bandit approaches, where a sequential involves ensuring the quality of the generated response while
decision-making cloud server selects LLM responses (i.e., considering user preferences.
an arms corresponds to a response) for users and receives • Algorithm Design: We propose the Conversational Adaptive
feedback. Besides the arm-level feedback, the cloud server can Distributed Identifier (MACO), comprising MACO-A, which
occasionally prompt users with questions about key terms [17], is executed by local agents, and MACO-S, which is executed
[18]. For example, asking about the user’s preference on a by the cloud server. Unlike previous works with predeter-
category: “Are you interested in news about basketball?”, or mined conversation frequencies, MACO adaptively decides
asking about the user’s preference on an entity: “Do you like when to engage in conversations based on the current context.
to read news related to LeBron James?”. The feedback from Additionally, it enhances collaboration among local agents
key terms like “basketball” and “LeBron James” can reflect to improve the efficiency of LLM response identification.
user preferences, allowing the cloud server to accelerate the • Theoretical Analysis: √ We establish the regret upper bound
learning process. The objective is to develop an online adaptive for√MACO at O( e dM T ), with a lower bound analysis of
strategy that maximizes user satisfaction over the long term. Ω( dM T ), indicating that MACO is near-optimal. Addi-
However, the current works of conversational contextual bandit tionally, we leverage the conversational setting to enhance
algorithms fall short of addressing the unique challenges of efficiency in both computation and communication, compared
online adaptive LLM response identification: to existing work on distributed linear contextual bandits with
❶ Firstly, existing bandit models that account for user finite arm sets. Specifically, we provide the upper bound
preferences are predominantly employed in recommendation of communication cost as O(d2 M log T ). The development
systems [18], [19], [20]. These models typically utilize Singular of distributed conversational bandits in MACO successfully
Value Decomposition (SVD) to extract feature vectors of avoids the computationally intensive G-optimal design, which
comparatively lower dimensions. However, quantifying features is required in previous elimination-based linear bandits.
from LLM text responses, which contain complex semantic • Experimental Evaluation: We conduct extensive experi-
information and lead to much higher dimensional feature spaces, ments using the open LLM Llama to generate responses,
presents significant computational challenges. coupled with two different embedding models from Google
❷ Secondly, previous conversational bandit works primarily and OpenAI for text vector representation. Testing under
follow the framework by [21], which addresses the infinitely various conditions, including different arm pool sizes and
arms. However, the number of LLM responses that need online numbers of local agents, our algorithm consistently outper-
identification from an initiatory set of responses generated forms state-of-the-art methods. Additionally, by eliminating
via prompt engineering is typically finite. While elimination- the time-intensive G-optimal design procedure, our approach
2
Which response do you prefer? B. Multi-Agent User-Personalized Bandits
Your choice will help make ChatGPT better.
We consider a multi-agent conversational bandit setting
Response 1 Response 2
Implementation of a recursive algorithm: Implementation of a recursive algorithm:
involving M agents and a cloud server. At each round t ∈ T , a
Python (key term) C (key term) local agent m ∈ M selects an arm am,t ∈ Am , which denotes
def factorial(n): #include <stdio.h>
one possible LLM response, and receives reward feedback
""" // Function to calculate factorial using recursion
Calculate the factorial using recursion. int factorial(int n) {
rm,t that reflects the corresponding performance. Eliciting user
""" if (n == 0 || n == 1) { // Base case feedback is beyond the scope of this work. Here, the term
if n == 0 or n == 1: # Base case return 1;
return 1 } else { “feedback” broadly encompasses direct user input, data inferred
else: return n * factorial(n - 1); // Recursive case
return n * factorial(n - 1) # Recursive case }}
from techniques that measure user behavior, and preference
simulators [12]. The user’s preference for LLM responses
Offline generated LLM Conversation on
arm candidates key term
is represented by an “unknown” preference feature vector
θ ∗ ∈ Rd , which all local agents aim to learn. For a local agent
Cloud Server m ∈ M, considering both the impact of the LLM response
(i.e., arm am,t ∈ Am ) and the unknown user preference θ ∗ , the
reward can be expressed as a linear combination with a noise
Conversational
LLM
response
Arm
feedback
feedback term ηm,t : ram ,t = ⟨xam ,t , θ ∗ ⟩ + ηm,t , where xam ,t ∈ Rd is
the embedding feature vector the corresponding arm am , to
Loal Response capture the textual information [1], [3]. We will demonstrate
Agent interaction
the generalization of our model using two different open
Loally stored user User embedding approaches in Section V. Our objective is to design
data
a policy that selects arms (i.e., LLM responses) each round to
Fig. 1: An adaptive multi-agent conversational bandit framework minimize cumulative regret, defined as the difference between
for identifying online LLM responses. Local agents handle response the cumulative rewards of our policy and the best unknown
selection (arms), while a central server manages conversation flow
policy across all local agents, tailored to personalized user
through key term selection. The server aggregates interaction data
across multiple agents to accelerate user preference learning. preferences, which is defined as:
M X
X T
∗ ∗
RM (T ) = xT∗
am θ − x T
am ,t θ . (1)
significantly reduces execution time. This reduction does m=1 t=1
not compromise performance, thanks to our conversational where a∗m ∈ arg maxa∈Am xT ∗
a θ denotes the locally optimal
mechanisms design, which enhances the speed of online LLM arm with the highest expected reward at local agent m ∈ M.
response identification and estimation of user preference. This regret definition follows prior works [21], [17], [18].
3
Formally, let K denote the finite set of key terms, with each Algorithm 1: MACO on Local Agent (MACO-A)
element x̃k ∈ Rd representing the feature vector for key term Input: Round horizon T , number of local agent M , input
k ∈ K. Let K denote the finite set of key terms, with each dimension d, arm set Am , arm pool size A,
element x̃k ∈ Rd being a feature vector for the corresponding confidence parameter δ ∈ (0, 1]
key term k ∈ K. Applying the conversational bandits to our Initialization: Let p = 1, Apm = Am
1 while T has not been P reached do
multi-agent framework, a user served by local agent m can be 2 Calculate Mm p
= a∈Apm |A1p | xa xTa
queried with a key term km ∈ Km , where Km ⊆ K is the subset m
p
= dj=1 λvj vj vjT
P
3 Diagonalize Mm
of key terms at local agent m. Considering user preference
4 Upload eigenvector vj , if its corresponding eigenvalue
θ ∗ with a noise term ηem,t , the conversational feedback is satisfies λvj < hp := 4(1−23−2p )d
modeled as: rekm ,t = ⟨x̃km ,t , θ ∗ ⟩ + ηem,t . Note that our model n o
5 Download Km p
and npm,k from the cloud server
diverges from previous conversational bandits [17], [18], [27], p
k∈Km
p
[28], which employ a fixed conversation function, typically 6 foreach k ∈ Km do ▷ Conduct conversations
linear or logarithmic of round t, to regulate the frequency of 7 Querying key term k for npm,k times
8 Receive rewards {erk,t }t∈Te p from direct
conversations. These methods initiate conversations periodically, m,k
agents and referred to as MACO Agent (MACO-A), the online in conversations with these key terms while pulling arms the
process of handling and updating information for LLM response requisite number of times, to ensure robust exploration of LLM
4
responses. During this process, the local agent has the flexibility agents (Line 8). This targeted intervention allows for focused
to intersperse the querying of key terms with arm pulls (Lines 6- exploration and refinement of LLM responses related to these
12). Note that The procedures of conducting conversations and key terms. Finally, the cloud server aggregates the enriched data
Pulling arms are presented sequentially for clarity, but can from all local agents. This aggregated data is used to estimate
be executed in parallel or interleaved without strict ordering. the unknown preference parameter θ ∗ via linear regression,
The local agent then uploads the corresponding information effectively minimizing uncertainty and enhancing the model’s
of pulled arms, key terms, and observed rewards, which are ability to predict and adapt LLM responses tailored to user
stored in the matrices Gpm and Wm p
(Line 13). Finally, the local preferences (Lines 10-11). Moreover, G can also be initialized
agent downloads the updated preference parameter θbp from as an identity matrix to ensure invertibility, especially when
the cloud server, and revises its active arm set, eliminating less the dimension d is large.
effective arms based on the updated user preference estimations C. Comparative Analysis
(Line 15). This adaptive adjustment process allows each local
Generally, as mentioned in Section I, the number of LLM
agent to maintain high responsiveness and accuracy in LLM
responses needing online identification from an initial set
response identification, which caters to user-specific needs
generated by prompt engineering is typically finite. Therefore,
and preferences while preserving data privacy by sharing only
p p we employ phase elimination-based algorithms for linear
aggregated data (Gm and Wm ) with the cloud server.
bandits, referred to as PE-Lin, instead of the classical con-
B. MACO Algorithm on Cloud Server versational bandit framework proposed by [17]. This choice
is motivated by the better performance guarantees of PE-Lin
Algorithm 2: MACO on Cloud Server (MACO-S) under finite arm sets. Our work builds upon and improves
Input: Key term set K, coverage parameter β in Condition 1. the classical PE-Lin [23]. In PE-Lin, a learning agent always
Initialization: Let p = 1, G = 0, W = 0 estimates the unknown preference vector θ ∗ using optimal least
1 while T has not been reached do squares design. Specifically, the algorithm minimizes prediction
2 foreach m ∈ M do variance by implementing the computing-intensive G-optimal
3 Receive all eigenvectors uploaded by local agent m, design, a probability distribution over the arm feature vector
and denote this set as Sm
4
p
Initialize the set of key terms at phase p as Km = ∅ set X ⊂ Rd (represented by distribution policy π : X → [0, 1]),
5 foreach vj ∈ Sm do to ensure minimal variance g(π). The conditions are defined
6 k = arg maxi∈K x̃Ti vj , Km p
= Kmp
∪ {k} as [29]: X X
3 p
7
p
nm,k = 2(1−2−2p )
−2dλvj
log 2AM log T π(x) = 1, Mm (π) = π(x)xxT ,
β 2 2−2p δ
x∈X x∈X (2)
2
n o
8
p
Send Km and nm,k p
to local agent m g(π) = max ∥x∥ M (π)−1 = d.
k∈Km
p x∈X
9 Receive Gpm and Wm p
from local agent m Then the learning agent plays arms according to the policy π
P P p P P p
10 G = p∈[p] m∈M Gm , W = p∈[p] m∈M Wm for local agent m at phase p, estimates the unknown parameter
11
−1
Broadcast θbp = G W to all local agents θ ∗ , and eliminates inferior arms accordingly. As noted in [22],
12 p=p+1 there is currently no efficient algorithm for computing the
G-optimal design in the multi-agent scenario.
We avoid using G-optimal design by leveraging the inherent
Next, we present the part of the MACO algorithm, which multi-agent heterogeneity in LLM application, combined with
is executed on the cloud server, called MACO Server (MACO- an adaptive conversational mechanism to address this issue.
S). As mentioned in Section I, a significant challenge arises MACO eliminates the need for the resource-intensive G-optimal
from the heterogeneity of local agents in the multi-agent design, thereby significantly reducing computation time and
conversational bandits model. This diversity can hinder effective resources. Additionally, merely executing PE-Lin independently
data aggregation, potentially leading to suboptimal estimation on each local agent with subsequent data aggregation by the
of the user preference vector θ ∗ . To address this issue, the server cloud may fail to minimize regret efficiently. This is
cloud server employs a strategic approach using key terms to because different agents may have distinct LLM response
probe and enrich the information in underrepresented directions sets, resulting in a trivial regret bound of O(M √
e dT ), which
of the feature space, thereby enhancing the overall accuracy is equivalent to running PE-Lin on each agent without any
of the estimation process. direct communication. In √ contrast, our algorithm improves the
As detailed in Algorithm 2, the cloud server first receives regret upper bound to O( e dM T ) via efficiently utilizing the
eigenvectors representing directions with insufficient informa- conversation to aggregate the information from different local
tion about the LLM response space from each local agent agents, which will be detailed in Section IV.
(Line 7). Utilizing these insights, the cloud server identifies
and selects key terms by calculating the closest match in terms IV. P ERFORMANCE A NALYSIS
of the inner product with the underexplored directions. The This section presents the theoretical results of MACO,
chosen key term k ∈ K, along with the designated repetition including its cumulative regret, communication costs, and con-
times npm,k , is then communicated back to the respective local versation frequency. In line with common practices in [21], [20],
5
we assume for any arm a and key term k, ∥xa ∥ = ∥x̃k ∥ = 1. [22], where the communication cost scales as O(d2 AM log T ),
The length of preference vector θ ∗ is bounded by 1, and the reflecting a substantial increase with the number of arms.
noise terms ηm,t and ηem,t are modeled as 1-subgaussian. Our approach significantly reduces communication costs by
eliminating the need for each local agent to upload its entire
A. Main Results
active arm set, whose cardinality is O(A). Instead, local agents
We first present a “new technical condition” that addresses independently process their data and transmit only aggregated
general issues related to feature space coverage. results to the cloud server, which also enhances privacy by
Condition 1 (Feature Space Coverage). We say a key term limiting external data sharing in LLM response adaptations.
set K as sufficiently rich for covering the feature space if, for Theorem 3 (Bound on Conversation Frequency). For any local
any unit vector v ∈ Rd , there exists a key term k ∈ K such agent m ∈ M during phase p, let γ = λmin (Mm p
), where λmin
that its feature vector x̃k satisfies x̃T
k v ≥ β, where β ∈ (0, 1] denotes the smallest eigenvalue, we have:
is a positive coverage parameter close to 1.
1) If γ ≥ hp , no conversations will be initiated.
Remark 1. Condition 1 is crucial for ensuring the compre- 2) If γ < hp , the fraction of conversations relative to the
hensive distribution of key terms across the feature space, total phase length is capped at β −2 ( 4(1−23 −2p ) − dγ).
which can facilitate effective uncertainty minimization for each
local agent. This condition is easily met if the key term set K Remark 4. Our approach introduces an “adaptive” method that
includes an orthonormal basis of Rd . Condition 1 enables us to differs significantly from the common deterministic functions
sidestep the G-optimal design procedure, typically employed in b(t), such as linear or logarithmic dependencies on round t, as
traditional elimination-based algorithms to minimize maximum widely employed in existing studies on conversational bandits
prediction variance, as described in [23]. [17], [19]. These traditional methods initiate conversations at
fixed intervals, which can lead to inefficiencies, especially when
For sufficiently rich key term sets, based on Condition 1, user preferences are already well-understood. In contrast, our
we provide the following theorems. model dynamically adjusts the conversation frequency based
Theorem 1 (Regret Bounds). For the cumulative regret defined on the current gaps in user preference information, offering a
in Eq. 1, we have the following upper bound and lower bound: more realistic and responsive interaction paradigm.
1) Upper Bound: With probability
q at least 1 − δ, the regret B. Technical Analysis
is bounded above by O( dM T log AM δlog T ). We now provide an analysis of the upper bound in Theorem
2) Lower Bound: For any policy that selects at most one 1. Proofs for other theorems can be found in Appendices C to E.
key term per round, there exists an instance √ where the Below, we present two critical lemmas related to the design
policy incurs an expected regret of at least Ω( dM T ). of our multi-agent conversational bandit algorithm. Lemma 1
Remark 2. The regret bounds established in Theorem 1 reveal guarantees that for any local agent m, the smallest eigenvalue
important insights into the performance of our approach: of the information matrix, adjusted for conversational feedback,
• When M = 1, the problem simplifies to single-agent
√ con- remains above hp . This supports the design of line 4 in
versational bandits, reducing the regret to O( dT ). This Algorithm 1. Lemma 2 ensures that the algorithm operates
e
reduction within established error limits, which is essential for reliable
√ outperforms previous regret upper bound results LLM response identification.
of O(d T ) from studies such as [19], [17], by leveraging
e
phase elimination on finite arm sets. This improvement is Lemma 1 (Stability of the Information Matrix). For any local
particularly significant in high-dimensional LLM response agent m ∈ M during phase p, we have λmin (Mm p′
) ≥ hp ,
′ h −λ
feature vectors. p := p T
P p
where Mm Mm + k∈Km p
β2 x̃ x̃
k k .
• For multi-agent systems, our upper bound result aligns
with the nearly optimal results described in [22], [24], Proof. Please refer to Appendix A for the proof.
while eliminating the reliance on computationally intensive Lemma 2 (Reliability of Estimation Error Bounds). Define
G-optimal design, thereby speeding up the online process. the “bad” event E where any local agent m at phase p has:
• Collectively, the regret upper and lower bound indicate
2−p
that MACO is minimax optimal up to a logarithmic factor E = {∃m ∈ M, a ∈ Apm , ⟨θbp − θ ∗ , xa ⟩ > √ }.
[23], aligning closely with the theoretical regret bounds M
in multi-agent conversational bandits scenarios. The probability of E is bounded by δ, i.e., Pr[E] ≤ δ.
Theorem 2 (Communication Cost). The total communication Proof. See Appendix B for details.
cost scales in O(d2 M log T ) for MACO algorithm.
Now, consider the “good” event E c for agent m at phase
Remark 3. The communication cost of our algorithm MACO p. Lemma 2 confirms that the discrepancy for any arm a in
−p+1
is notably independent of the arm pool size A, which can Apm : ⟨xa − xa∗m , θbp ⟩ ≤ 2√M . This, combined with line 15
range into thousands based on the diversity of candidate in Algorithm 1, supports the following lemma on the arm
LLM responses. This contrasts with the approach described in preservation and performance bound under good event E c .
6
Lemma 3 (Properties Under Good Event). Under event E c , for preference vectors θ on keywords, we utilize two previously
any local agent m at phase p, two key properties are ensured: mentioned embedding models. We select the top d = 256
1) The locally optimal arm a∗m remains within the active dimensions as the feature representation and normalized
arm set Apm , ensuring it is never eliminated. them into a more concise and efficient dimensional space.
2) The performance gap for any arm a ∈ Apm , defined as The reward is obtained from the cosine similarity between
−p+3
∆m,a ≜ θ ∗ , xa∗m − xa , is bounded by 2√M . a specific user’s preference vector and the feature vector of
the selected arm, and the optimal LLM response is defined
PM P1T − δ,
Finally, with probability the cumulative as the one with the largest reward according to [33].
regret RM (T ) = m=1 t=1 θ ∗ , xa∗m − xam ,t is 2) Prompt engineering is utilized to construct the initiatory
PM PP P −p+3
p 2√
bounded by p n , where set of responses offline. Following [34], we select a set of
m=1 p=1 a∈Am m,a M
P
P denotes the total number of phases. Given that keyword styles (i.e., key term) rich in personal identifiers
a∈Ap
m
npm,a ≤2−2p+1
√d log 2AMδlog T + |A p
m |, we derive to establish a diverse style collection, including terms like
that RM (T ) ≤ O d M log AM δlog T 2P . Furthermore, helpful, and creative use of emojis. Two keyword styles
PP P PP are jointly selected for each query, which forms a style-
T ≥ p=1 a∈Ap
m
npm,a ≥ p=1 2
−2p+1
d log 2AMδlog T , specific question to the LLM, ensuring focused and relevant
2P AM log T
q to T ≥ 2d2
which simplifies log δ . Thus, responses. We utilize Llama-3-8B-Instruct [35] to generate
RM (T ) ≤ O dM T log KM δlog T . corresponding responses. Each prompt triggers a specific
response from the LLM, with each user preference dictating
V. P ERFORMANCE E VALUATION a response styled according to their selected input. For
example, User: ”Tell me a joke.” The response Arm: A
In this section, we conduct extensive experiments to demon- variety of jokes under different styles. Key-term: Different
strate the effectiveness of our algorithm.1 The code is accessible styles. By formulating responses to five different questions,
at the following link: Code Repository. each with two keyword styles, we construct a total arm set
A. Experimental Settings of |A| = 455 responses. This extensive collection allows
for a comprehensive mapping of responses to specific
Embedding Models. We demonstrate our framework’s user preferences, effectively forming a set of 455 user-
generalization capabilities using two open embedding models: preference pairs. Regarding the reward definition, the feature
Google’s text-embedding-preview-0409 and OpenAI’s Text- vector extraction, and subsequent steps, we apply the same
embedding-3-large, which generate the embedding feature procedures described above.
vector xa ∈ Rd for the corresponding arm a (i.e., response)
to capture text information. Comparison Algorithms. The following online learning
algorithms from existing studies are used as baselines, each
1) Text-embedding-preview-0409: Google’s advanced em- executed individually on different local agents.
bedding model, which streamlines synthetic training data
• TRIPLE-SH [8]: Select optimal prompts for LLMs by
creation by generating queries and task descriptions [30].
2) Text-embedding-3-large: OpenAI’s new generation em- adaptively eliminating arms with poor performance, where
bedding model, which surpasses its predecessor, though its we directly set each arm as the corresponding LLM response.
• LinUCB [21]: Online select arms and estimate user preference
technical details remain undisclosed [31].
for infinite arm sets, excluding the conversational setting.
Response Settings. We explore the implementation of two
• Arm-Con [36]: Initiate conversations on user preference about
response settings using the aforementioned embedding models,
arms, and use LinUCB for arm selection.
based on a real-world dataset and an open-source LLM.
• ConUCB [17]: Query key terms if conversations are allowed
1) Following the style classification by [32], we gather a and utilize conversational feedback to accelerate learning.
comprehensive set of 13 keywords representing diverse • ConLinUCB [19]: The series includes three algorithms:
styles such as “humorous” and “helpful”, each representing ConLinUCB-BS calculates the barycentric spanner for con-
a key term. These keyword styles generate 510 unique ducing conversations; ConLinUCB-MCR selects key terms
combinations, each forming an “arm”, where each arm with the largest confidence radius; ConLinUCB-UCB adopts
represents a potential style of LLM response. Users have a LinUCB-like method to choose key terms.
varying priorities for different keyword combinations, and
All results are averaged from five trials, conducted on a
their preference vector θ has the highest cosine similarity
Linux Ubuntu machine (kernel 6.5.0) with a 5.40 GHz 13th
with the feature vector x of their most favored keyword
Gen Intel(R) Core(TM) i7-13700KF CPU and 32GB RAM.
style (which is unknown to the algorithms in advance). To
We set coverage parameter β = 1 and confidence parameter
generate these feature vectors x for LLM responses and user
δ = 0.1, and conduct an ablation study to ensure robustness.
1 Our experimental setup does not assume any prior knowledge of user
preferences or reward distributions, thus requiring more trial rounds. Although B. Evaluation Results
practical scenarios often have pre-existing information that could reduce initial
exploration, our study focuses on the performance of online learning algorithms Regret Across Different Arm Pool Sizes. We initially
without this offline information. compare the cumulative regret of MACO against seven baseline
7
1e4 1e4 1e4 4 1e4
3 4 3
2 2
Regret
Regret
Regret
Regret
2 2
1 1
0 0 0 0
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Round Round Round Round
(a) Armsize: 40 (text-embedding-preview-0409) (b) Armsize: 50 (text-embedding-preview-0409) (c) Armsize: 40 (text-embedding-3-large) (d) Armsize: 50 (text-embedding-3-large)
MACO (ours) ConUCB LinUCB Arm-Con ConLinUCB-UCB ConLinUCB-MCR ConLinUCB-BS TRIPLE-SH
Fig. 2: Cumulative regret of Response Setting 1 on two embedding models from Google and OpenAI across different arm pool sizes A.
Regret
Regret
Regret
2 2
1 1
1
0 0 0 0
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Round Round Round Round
(a) Local agent total: 8 (text-embedding-preview-0409) (b) Local agent total: 12 (text-embedding-preview-0409) (c) Local agent total: 8 (text-embedding-3-large) (d) Local agent total: 12 (text-embedding-3-large)
MACO (ours) ConUCB LinUCB Arm-Con ConLinUCB-UCB ConLinUCB-MCR ConLinUCB-BS TRIPLE-SH
Fig. 3: Cumulative regret of Response Setting 2 on two embedding models from Google and OpenAI across different numbers of agents M .
algorithms under Scenario Setting 1 with M = 4 local local agents, setting M = 8 and M = 12. We consider more
agents, employing the above two embedding models. We agents here because, in practice, platforms often group users
further explore the influence of varying arm pool sizes A, with similar labels to share learning, making M naturally larger.
setting A = 40 and A = 50 under each embedding model Therefore, we aim to explore our algorithm’s performance with
respectively, and selecting A arms at random from A for the larger M for a comprehensive demonstration. Fig. 3 presents
local agent. Fig. 2 demonstrates that algorithms lacking a four subfigures that illustrate consistent trends: in the absence of
conversational mechanism (LinUCB and Arm-Con), exhibit a multi-agent framework, the cumulative regrets of all baseline
the poorest performance. In contrast, our algorithm, MACO, algorithms increase√linearly with the number of local agents,
significantly outperforms all competitors, achieving a minimum following a O(dM
e T ) pattern. Conversely, MACO capitalizes
improvement of 8.29% compared to ConLinUCB-MCR, the on the aggregated data from√all local agents, managing to scale
best-performing baseline. This superior performance originates its regret according to O(
e dM T ). This scaling significantly
from the multi-agent framework employed by MACO, wherein dampens the increase in regret, demonstrating the effectiveness
the cloud server aggregates data from each local agent to more of our algorithm’s multi-agent approach for online LLM
accurately estimate the unknown user preference. Notably, the response identification. A clearer depiction of this regret trend
increase in arm pool size A does not significantly increase the is shown in Fig. 4, where TRIPLE-SH is excluded due to
cumulative regret for MACO, confirming Theorem 1 which its inferior performance, under Scenario Setting 1 with the
states that our algorithm’s regret growth increases at a square- Google’s model and T = 100000.
root logarithmic rate with respect to arm pool size A.
TABLE I: Execution time (s) (± standard deviation) on four settings.
1e5 Algorithm
MACO (w/o G) MACO (w/G) ConLinUCB-BS
Setting
1.2 Setting (a) 2.576 ± 0.047 9.766 ± 2.709 18.124 ± 0.111
1.0 Setting (b) 2.546 ± 0.039 14.272 ± 7.107 18.056 ± 0.065
0.8 Setting (c) 2.576 ± 0.085 6.369 ± 2.832 17.926 ± 0.095
Regret
8
TABLE II: Average reward (± standard deviation) on four settings.
an online budget-limited LLM response optimization using
Algorithm
Setting
MACO (w/o G) MACO (w/G) ConLinUCB-BS various prompts. And [11] focuses on response identification
Setting (a) 61.849 ± 0.558 61.847 ± 0.565 59.811 ± 0.610 over multiple LLM coordination. Nevertheless, these studies
Setting (b) 61.605 ± 0.642 61.591 ± 0.649 59.663 ± 0.671
Setting (c) 47.405 ± 0.977 47.381 ± 1.002 46.104 ± 0.962
ignore the impact of user preferences and the natural multi-
Setting (d) 41.770 ± 0.349 41.858 ± 0.412 40.720 ± 0.349 agent setting in LLM response identification.
VII. C ONCLUSION
from multiple local agents to accelerate the learning process. This paper presents MACO, a multi-agent conversational
Table I further illustrates that MACO w/o G exhibits the online framework designed to identify optimal responses from
p
lowest deviation since the information matrix Mm is no longer LLMs while minimizing cumulative regret and aligning with
dependent on a continuously adjusted distribution policy (see user preferences. The framework consists of local agents
Eq. (2)). Additionally, the results in Table II show that the (MACO-A) that adaptively manage conversations and response
average reward for MACO w/o G matches that of MACO w/G, selection, and a cloud server (MACO-S) that aggregates data to
demonstrating that our conversational approach maintains learn user preferences efficiently. We have proved that MACO
performance while replacing the traditional G-optimal design achieves optimal regret bounds, reduces conversations, and
with a more practical, conversation-based design. This not only enhances computational efficiency. Our extensive evaluations,
sustains robust performance, as supported by Theorem 1, but utilizing open LLMs like Llama and embedding models from
also enhances efficiency, representing an interesting finding. Google and OpenAI, confirm that our approach significantly
Ablation Study. Table III reveals that the introduction of improves performance over traditional methods. Future work
the coverage parameter β in our design has a minimal impact could explore clustering similar user preferences and extend-
on the outcomes, contrasting with the significant influence ing beyond the linear reward model to further enhance the
exerted by the statistical confidence parameter δ, which is adaptability and effectiveness of the MACO framework.
established by convention [23]. This observation underscores
that our framework does not introduce new dependencies on A PPENDIX
parameters beyond those traditionally used in bandit algorithms. A. Proof of Lemma 1
TABLE III: Cumulative regret under T = 100000, A = 40, M = 4.
Proof. Using the eigenvectors as an orthonormal basis, for
Setting
any j ∈ [d],Pany key term’s Pdk feature vector can be expressed
Setting (a) Setting (b) Setting (c) Setting (d) d
Parameter as x̃k = i=1 ci vi = i=1,i̸=j ci vi + cj vj , where x :=
β = 1.0, δ = 0.1 20213.773 16277.413 15033.483 8261.335 Pd
β = 0.9, δ = 0.05 21439.795 17205.540 16039.654 8772.119 c
i=1,i̸=j i iv is orthogonal to vj . According to Line 7 of Al-
β = 0.8, δ = 0.05 21430.625 17215.402 16033.950 8770.108 gorithm 2 and Condition 1, we have x̃T v ≥ β for the selected
β = 0.9, δ = 0.15 19495.106 15734.833 15092.586 7962.415 Pkd j
β = 0.8, δ = 0.15 19492.169 15738.395 15094.809 7961.321 key term k. Therefore, we have ( i=1 ci vi )T vj = cj ≥ β,
and x̃k x̃T T 2 T
k = (cj vj + x)(cj vj + x) = cj vj vj + xx . By
T
9
xD ∈ Rd , E with probability
q at least 1 − 2δ, we have E. Proof of Theorem 3
∗
θs − θ , x ≤ 2∥x∥G−1 log 1δ . Then, by the Courant-
b 2
Proof. 1) follows directly from line 4 of Algorithm 1.
Fischer theorem, with probability at least 1 − AM δlog T , for any For 2), in phase p, the numberP of arms npm
D E pulled by each local agent m m is a∈Ap npm,a =
m ∈ M and all arm a ∈ Apm , we have θbp − θ ∗ , xa ≤ l 2p+1 m
2 d
log 2AMδlog T ≥ 22p+1 d log 2AMδlog T . And
P
a∈Ap Ap
q q
2AM log T 2 2AM log T 2−p m m
2∥xa ∥2G−1 log δ ≤ λmin (G) log δ ≤ √ M
. the number of key terms pulled n epm by local agent m is given
Finally, by the union bound, Pr [E] ≤ M P K AM δlog T ≤ δ p 2d(hp −λj ) 2AM log T
P P
by: p n = 2 −2p log
k∈Km m,k j:λj <hp β 2 δ
is obtained with P ≤ log T (deduced from Section IV-B: Pd d(hp −γ)
2AM log T
T ≥ 2d22P log AM δlog T ≥ 2P ). ≤ j=1 β 2 2−2p−1 log δ . Thus, the ratio between
the number of key terms and arms for any m ∈ M is upper
3
ep
n hp −dγ −dγ
4(1−2−2p ) 1−dγ
C. Proof of Regret Lower Bound in Theorem 1 bounded by np
m
m
≤ β2 = β2 ≤ β2 .
π R EFERENCES
Proof. Define RM,θ (T ) as the expected cumulative regret of
policy π with user preference θ over M local agents and time [1] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar,
horizon T . Assume that for all local agents m, the arms vectors P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial
can span Rd , and {xa }a∈Am = {xk }k∈K = {e1 , e2 , . . . , ed }∪ general intelligence: Early experiments with gpt-4,” arXiv preprint
arXiv:2303.12712, 2023.
{(A − d) arbitrary unit vectors}, where ei is the i-th standard [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
basis vector in Rd . Choose θ = (∆, 0, . . . , 0)T (with ∆ ∈ [0, 12 ] N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open
foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288,
to be determined later). Let random variables Ni (t), N ej (t)
2023.
be the number of times the i-th arm and the j-th key term [3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train,
are selected, by the end of round t. Define another user prompt, and predict: A systematic survey of prompting methods in natural
language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35,
preference θ ′ = (∆, 0, T
n . . . , 2∆, . . . , 0) , where θℓ o = 2∆ and 2023.
ℓ = arg minj>1 max Eθ [Nj (M T )], Eθ [N ej (M T )] . Denote [4] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and
Y. Yang, “Connecting large language models with evolutionary algorithms
Nm,a (t) as the number of times the a-th arm is chosen yields powerful prompt optimizers,” arXiv preprint arXiv:2309.08532,
by local agent m ∈ M after the end of round t. Given 2023.
that the optimal arm for θ is arm 1, pulling other arms [5] R. Pan, S. Xing, S. Diao, X. Liu, K. Shum, J. Zhang, and T. Zhang, “Plum:
Prompt learning using metaheuristic,” arXiv preprint arXiv:2311.08364,
increases the expectedP regret byP∆. Thus, by Lemma 4.5 2023.
π M A
in [23], RM,θ (T ) = m=1 ∆ a=2 Eθ [Nm,a (T )]]. Using [6] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, and M. Zeng, “Automatic
MT prompt optimization with “gradient descent” and beam search,” arXiv
the inequality Eθ [Nj (M T )] ≤ K−1 and Eθ [Nej (M T )] ≤
preprint arXiv:2305.03495, 2023.
MT π
K−1 hand Markov inequality, we i get: RM,θ (T ) ≥ [7] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language
PM models while reducing cost and improving performance,” arXiv preprint
∆ Prθ M T − m=1 Ni,1 (T ) ≥ M2T M2T . arXiv:2305.05176, 2023.
For h θ ′ , similarly, we have π
RM,θ ′ (T ) ≥ [8] C. Shi, K. Yang, J. Yang, and C. Shen, “Best arm identification for prompt
PM i learning under a limited budget,” arXiv preprint arXiv:2402.09723, 2024.
M T M T
∆ Prθ′ m=1 Ni,1 (T ) > 2 2 . Therefore, applying [9] K. Shuster, J. Xu, M. Komeili, D. Ju, E. M. Smith, S. Roller, M. Ung,
M. Chen, K. Arora, J. Lane et al., “Blenderbot 3: a deployed conversa-
the Bretagnolle-Huber theorem (Theorem 14.2 in [23]), tional agent that continually learns to responsibly engage,” arXiv preprint
π π ∆M T
RM,θ (T ) + RM,θ ′ (T ) ≥ exp(−D(Pθ ∥ Pθ′ )). Accord- arXiv:2208.03188, 2022.
4
ing to the properties of Kullback–Leibler (KL) divergence, with [10] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and
C. Finn, “Direct preference optimization: Your language model is secretly
P ∼ N (µ1 , σ 2 ) and Q ∼ N (µ2 , σ 2 ), we have D(Pθ ∥ Pθ′ ) = a reward model,” Advances in Neural Information Processing Systems,
Eθ [Nℓ (M T ) + N eℓ (M T )]D(N (0, 1)
q n ∥ N (2∆,o1)) vol. 36, 2024.
(µ1 −µ2 )2 [11] X. Dai, J. Li, X. Liu, A. Yu, and J. Lui, “Cost-effective online multi-llm
= 2σ 2 . Let ∆ = d−1MT ,
π
max RM,θ π
(T ), RM,θ ′ (T ) ≥ selection with versatile reward models,” arXiv preprint arXiv:2405.16587,
π π
RM,θ (T )+RM,θ′ (T )
√
−4 p 2024.
2 ≥ e8 (d − 1)M T = Ω dM T . [12] V. Dwaracherla, S. M. Asghari, B. Hao, and B. Van Roy, “Efficient
exploration for llms,” arXiv preprint arXiv:2402.00396, 2024.
[13] Poe, https://poe.com/ChatGPT, 2024.03.
D. Proof of Theorem 2 [14] C. Gao, W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Advances and
challenges in conversational recommender systems: A survey,” AI Open,
Proof. At each phase p, each local agent m downloads the vol. 2, pp. 100–126, 2021.
following: (a) The key term vector set, containing at most d [15] Y. Sun and Y. Zhang, “Conversational recommender system,” in ACM
feature vectors of dimension d; (b) The repetition counts for SIGIR Conference, 2018, p. 235–244.
[16] X. Dai, Z. Wang, J. Xie, X. Liu, and J. C. Lui, “Conversational
each key term npm,k , ∀k ∈ Km
p
, totaling at most d integers; And recommendation with online learning and clustering on misspecified
(3) the estimated preference vector θbp , a d-dimensional vector. users,” IEEE Transactions on Knowledge and Data Engineering, vol. 36,
no. 12, pp. 7825–7838, 2024.
On the other hand, the local agent uploads the following: (a) [17] X. Zhang, H. Xie, H. Li, and J. C.S. Lui, “Conversational contextual ban-
At most d eigenvalues and their corresponding eigenvectors; dit: Algorithm and application,” in Proceedings of The Web Conference,
(2) The matrix Gpm and Wm p
, each size of d2 . Considering 2020, p. 662–672.
[18] J. Wu, C. Zhao, T. Yu, J. Li, and S. Li, “Clustering of conversational
that the number of phases is at most log T , the upload and bandits for user preference learning and elicitation,” in Proceedings of
download costs are both O(d2 M log T ). the ACM CIKM, 2021, p. 2129–2139.
10
[19] Z. Wang, X. Liu, S. Li, and J. C. S. Lui, “Efficient explorative key-term
selection strategies for conversational contextual bandits,” Proceedings
of the AAAI, pp. 10 288–10 295, 2023.
[20] X. Liu, H. Zhao, T. Yu, S. Li, and J. C. Lui, “Federated online clustering
of bandits,” in Proceedings of the UAI, 2022, pp. 1221–1231.
[21] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for
linear stochastic bandits,” in Proceedings of the NeurIPS, 2011.
[22] R. Huang, W. Wu, J. Yang, and C. Shen, “Federated linear contextual
bandits,” Advances in neural information processing systems, vol. 34, pp.
27 057–27 068, 2021.
[23] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge
University Press, 2020.
[24] Z. Li, M. Liu, and J. C. S. Lui, “Fedconpe: Efficient federated
conversational bandits with heterogeneous clients,” in Proceedings of
the Thirty-Third International Joint Conference on Artificial Intelligence,
IJCAI-24. International Joint Conferences on Artificial Intelligence
Organization, 8 2024, pp. 4533–4541.
[25] J. Lin and S. Moothedath, “Federated stochastic bandit learning with
unobserved context,” arXiv preprint arXiv:2303.17043, 2023.
[26] Y. Wang, J. Hu, X. Chen, and L. Wang, “Distributed bandit learning:
Near-optimal regret with efficient communication,” in Proceedings of
the ICLR, 2020.
[27] C. Zhao, T. Yu, Z. Xie, and S. Li, “Knowledge-aware conversational
preference elicitation with bandit feedback,” in Proceedings of the ACM
Web Conference 2022, 2022, p. 483–492.
[28] X. Dai, Z. Wang, J. Xie, T. Yu, and J. C. Lui, “Online learning and
detecting corrupted users for conversational recommendation systems,”
IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12,
pp. 8939–8953, 2024.
[29] J. Kiefer and J. Wolfowitz, “The equivalence of two extremum problems,”
Canadian Journal of Mathematics, vol. 12, p. 363–366, 1960.
[30] J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko,
R. Kapadia, W. Ding, Y. Luan, S. M. K. Duddu, G. H. Abrego, W. Shi,
N. Gupta, A. Kusupati, P. Jain, S. R. Jonnalagadda, M.-W. Chang, and
I. Naim, “Gecko: Versatile text embeddings distilled from large language
models,” arXiv preprint arXiv:2403.20327, 2024.
[31] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive
text embedding benchmark,” arXiv preprint arXiv:2210.07316, 2023.
[32] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens,
A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, S. ES, S. Suri,
D. Glushkov, A. Dantuluri, A. Maguire, C. Schuhmann, H. Nguyen,
and A. Mattick, “Openassistant conversations – democratizing large
language model alignment,” arXiv preprint arXiv:2304.07327, 2023.
[33] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
using Siamese BERT-networks,” in Proceedings of the EMNLP-IJCNLP,
2019, pp. 3982–3992.
[34] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha,
“A systematic survey of prompt engineering in large language models:
Techniques and applications,” arXiv preprint arXiv:2402.07927, 2024.
[35] Ollama, https://github.com/jmorganca/ollama/, 2024.06.
[36] K. Christakopoulou, F. Radlinski, and K. Hofmann, “Towards con-
versational recommender systems,” in Proceedings of ACM SIGKDD
International Conference, 2016, p. 815–824.
[37] Z. Zhang, S. Wang, W. Yu, Y. Xu, D. Iter, Q. Zeng, Y. Liu, C. Zhu, and
M. Jiang, “Auto-instruct: Automatic instruction generation and ranking
for black-box language models,” arXiv preprint arXiv:2310.13127, 2023.
[38] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu, N. Kari-
anakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous learning of
video analytics models on edge compute servers,” in Proceedings of the
NSDI, 2022, pp. 119–135.
[39] Y. Xia, F. Kong, T. Yu, L. Guo, R. A. Rossi, S. Kim, and S. Li, “Which llm
to play? convergence-aware online model selection with time-increasing
bandits,” arXiv preprint arXiv:2403.07213, 2024.
11