0% found this document useful (0 votes)

748 views11 pages

Multi-Agent Conversational Online Learning

The paper presents MACO (Multi-Agent Conversational Online Learning), a novel framework designed for adaptive identification of optimal responses from large language models (LLMs) based on user preferences. It emphasizes the use of multiple local agents to enhance data privacy and efficiency while addressing the challenges of diverse user preferences and the 'cold start' problem in online learning. The proposed approach significantly outperforms existing methods in terms of response identification quality and computational efficiency, as demonstrated through extensive experiments.

Uploaded by

duchuy19283746

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

748 views11 pages

Multi-Agent Conversational Online Learning

Uploaded by

duchuy19283746

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Multi-Agent Conversational Online Learning

for Adaptive LLM Response Identification

Xiangxiang Dai† , Yuejin Xie‡ , Maoli Liu† , Xuchuang Wang§ ,
Zhuohua Li† , Huanyu Wang♭ , John C.S. Lui†
† The Chinese University of Hong Kong
‡ Huazhong University of Science and Technology
§ University of Massachusetts Amherst
♭ Huawei Technologies Co., Ltd.

Email:{xxdai23, mlliu, zhli, cslui}@cse.cuhk.edu.hk, yuejinxie@hust.edu.cn,

arXiv:2501.01849v1 [cs.HC] 3 Jan 2025

xuchuangwang@umass.edu, huanyuhello@zju.edu.cn

Abstract—The remarkable generative capability of large and only provide a “initiatory set of relatively good responses”
language models (LLMs) has sparked a growing interest in by pre-specified prompt instructions. However, considering the
automatically generating responses for different applications. diversity of responses generated by LLMs and the uncertainty in
Given the dynamic nature of user preferences and the uncertainty
of LLM response performance, it is crucial to design efficient LLM performance, identifying the most suitable LLM response
online learning algorithms to identify optimal LLM responses is inherently challenging [7], [8], as suitable responses are
(i.e., high-quality responses that also meet user preferences). usually unknown in advance and context-dependent. There-
Most existing online algorithms adopt a centralized approach fore, continuous online response adaptation is necessary [9],
and fail to leverage explicit user preferences for more efficient especially in scenarios such as medical diagnosis where highly
and personalized LLM response identification. In contrast, this
paper introduces MACO (Multi-Agent Conversational Online accurate answers are required. Note that the online response
Learning for Adaptive LLM Response Identification): 1) The identification approach can enhance the initiatory set of offline-
online LLM response identification process is accelerated by generated responses so to match the specific context.
multiple local agents (such as smartphones), while enhancing Furthermore, previous research has often overlooked the
data privacy; 2) A novel conversational mechanism is proposed need to address diverse user preferences. It is crucial to not
to adaptively conduct conversations for soliciting user preferences
(e.g., a preference for a humorous tone over a serious one in only ensure the quality of responses generated by LLMs,
generated responses), so to minimize uncertainty in preference but also to tailor them to meet the specific preferences and
estimation. Our theoretical analysis demonstrates that MACO is expectations of different users. For instance, some users may
near-optimal regarding cumulative regret. Additionally, MACO prefer LLM-generated responses to be humorous, while others
offers reduced communication costs and computational complexity might prefer a more formal tone. Although [10] considers
by eliminating the traditional, computing-intensive “G-optimal
design” found in previous works. Extensive experiments with the optimization of preferences for LLMs, it only addresses
the open LLM Llama, coupled with two different embedding the binary case of users’ likes and dislikes. LLM response
models from Google and OpenAI for text vector representation, identification must address the growing demand to cater to
demonstrate that MACO significantly outperforms the current diverse user preferences. To address such needs, one can utilize
state-of-the-art in online LLM response identification. cloud servers to continuously learn and refine LLM response
I. I NTRODUCTION identification by collecting feedback on the assessment of LLM
responses. This feedback can be derived from users’ direct
Large language models (LLMs) have swiftly transformed the
input or measurement of score functions [11], [12]. A response
technological landscape of our society [1], [2]. A significant
that not only meets quality standards but also aligns with user
line of research is the exploration of prompts to identify
preferences is termed an “optimal LLM response.”
optimal responses from LLMs [3]. This approach is compelling
since it does not need to alter the internal parameters of an A. Multi-Agent Conversational Properties
LLM, and can align well with human conversational patterns.
In the context of LLM response identification, we observe
Consequently, there is a growing interest in automatically
two significant properties in typical LLM application scenarios.
identifying LLM responses, e.g., through prompt engineering
These properties inform and motivate our proposed formulation.
methods [4], [5], [6]. These efforts aim to enhance LLMs’
First, in the utilization of LLMs, users commonly access
capability to produce more accurate and relevant responses,
LLM services across multiple devices, such as smartphones,
collectively referred to as “LLM response identification”.
tablets, and desktops, collectively referred to as “local agents.”
Note that these prompt engineering methods are done offline,
For example, the Poe AI chatting platform [13] handles user
Zhuohua Li is the corresponding author. queries originating from various devices. Leveraging this multi-
agent framework, LLM response identification tailored to based contextual bandit algorithms can handle this setting,
specific user preferences can be performed concurrently on each they rely on the computationally intensive G-optimal design
local agent, facilitating data aggregation and enhancing learning procedure [22], [23], [24] to calculate a distribution for
efficiency on a user preference. Moreover, this approach offers arm selection, thus slowing down the online LLM response
an added layer of privacy protection, as sensitive information identification.
remains localized and is neither transmitted nor stored on ❸ Thirdly, existing studies on conversational bandits [19],
central servers. [17] rely on predetermined functions to control conversa-
Second, a key challenge for online LLM methods lies in ad- tion frequency, which typically follow a fixed sequence of
dressing the “cold start” problem, where response identification engagements to initiate a specific number of conversations.
may be inaccurate for new users with limited historical data. To This approach is not suitable for the dynamic nature of LLM
address this, conversational recommendation [14], [15], [16] response identification, as it imposes unnecessary restrictions
has been applied in LLM applications. In this approach, the and could degrade user experience.
cloud server can proactively query users with questions and ❹ Finally, existing literature on conversational bandits solely
obtain feedback, thereby quickly eliciting user preferences. For considers centralized scenarios, neglecting the inherent multi-
example, in OpenAI’s design, when ChatGPT is tasked with agent property of data source of LLM platforms. While there
computing factorials in Python, it may provide two “correct” are works on distributed bandits with finite arms [22], [25],
implementations with different styles: one recursive, the other [26], they either require all local agents to upload user feedback
iterative. During the interaction, the user provides feedback on to the cloud server or share the exactly same arm set. These
their preferred coding style. This “conversation” process allows restrictive settings can leak sensitive information, reduce the
ChatGPT to learn from the user’s code preferences, enabling flexibility of local agents, and increase communication costs.
it to tailor its future responses more effectively to individual This paper makes the following contributions:
users. • Model Formulation: We propose a distributed conversational
bandit model for online LLM response identification. Comple-
B. Challenges and Our Contributions menting existing methods that rely on offline selection from a
To adaptively identify the appropriate LLM responses, which pre-generated pool of LLM responses. Our model emphasizes
were generated from an initiatory set of responses generated “online identification” of the optimal LLM response from
through offline prompt engineering techniques, we propose to the pre-generated arm set with uncertain performance. This
utilize online contextual bandit approaches, where a sequential involves ensuring the quality of the generated response while
decision-making cloud server selects LLM responses (i.e., considering user preferences.
an arms corresponds to a response) for users and receives • Algorithm Design: We propose the Conversational Adaptive
feedback. Besides the arm-level feedback, the cloud server can Distributed Identifier (MACO), comprising MACO-A, which
occasionally prompt users with questions about key terms [17], is executed by local agents, and MACO-S, which is executed
[18]. For example, asking about the user’s preference on a by the cloud server. Unlike previous works with predeter-
category: “Are you interested in news about basketball?”, or mined conversation frequencies, MACO adaptively decides
asking about the user’s preference on an entity: “Do you like when to engage in conversations based on the current context.
to read news related to LeBron James?”. The feedback from Additionally, it enhances collaboration among local agents
key terms like “basketball” and “LeBron James” can reflect to improve the efficiency of LLM response identification.
user preferences, allowing the cloud server to accelerate the • Theoretical Analysis: √ We establish the regret upper bound
learning process. The objective is to develop an online adaptive for√MACO at O( e dM T ), with a lower bound analysis of
strategy that maximizes user satisfaction over the long term. Ω( dM T ), indicating that MACO is near-optimal. Addi-
However, the current works of conversational contextual bandit tionally, we leverage the conversational setting to enhance
algorithms fall short of addressing the unique challenges of efficiency in both computation and communication, compared
online adaptive LLM response identification: to existing work on distributed linear contextual bandits with
❶ Firstly, existing bandit models that account for user finite arm sets. Specifically, we provide the upper bound
preferences are predominantly employed in recommendation of communication cost as O(d2 M log T ). The development
systems [18], [19], [20]. These models typically utilize Singular of distributed conversational bandits in MACO successfully
Value Decomposition (SVD) to extract feature vectors of avoids the computationally intensive G-optimal design, which
comparatively lower dimensions. However, quantifying features is required in previous elimination-based linear bandits.
from LLM text responses, which contain complex semantic • Experimental Evaluation: We conduct extensive experi-
information and lead to much higher dimensional feature spaces, ments using the open LLM Llama to generate responses,
presents significant computational challenges. coupled with two different embedding models from Google
❷ Secondly, previous conversational bandit works primarily and OpenAI for text vector representation. Testing under
follow the framework by [21], which addresses the infinitely various conditions, including different arm pool sizes and
arms. However, the number of LLM responses that need online numbers of local agents, our algorithm consistently outper-
identification from an initiatory set of responses generated forms state-of-the-art methods. Additionally, by eliminating
via prompt engineering is typically finite. While elimination- the time-intensive G-optimal design procedure, our approach

2
Which response do you prefer? B. Multi-Agent User-Personalized Bandits
Your choice will help make ChatGPT better.
We consider a multi-agent conversational bandit setting
Response 1 Response 2
Implementation of a recursive algorithm: Implementation of a recursive algorithm:
involving M agents and a cloud server. At each round t ∈ T , a
Python (key term) C (key term) local agent m ∈ M selects an arm am,t ∈ Am , which denotes
def factorial(n): #include <stdio.h>
one possible LLM response, and receives reward feedback
""" // Function to calculate factorial using recursion
Calculate the factorial using recursion. int factorial(int n) {
rm,t that reflects the corresponding performance. Eliciting user
""" if (n == 0 || n == 1) { // Base case feedback is beyond the scope of this work. Here, the term
if n == 0 or n == 1: # Base case return 1;
return 1 } else { “feedback” broadly encompasses direct user input, data inferred
else: return n * factorial(n - 1); // Recursive case
return n * factorial(n - 1) # Recursive case }}
from techniques that measure user behavior, and preference
simulators [12]. The user’s preference for LLM responses
Offline generated LLM Conversation on
arm candidates key term
is represented by an “unknown” preference feature vector
θ ∗ ∈ Rd , which all local agents aim to learn. For a local agent
Cloud Server m ∈ M, considering both the impact of the LLM response
(i.e., arm am,t ∈ Am ) and the unknown user preference θ ∗ , the
reward can be expressed as a linear combination with a noise
Conversational
LLM
response
Arm
feedback
feedback term ηm,t : ram ,t = ⟨xam ,t , θ ∗ ⟩ + ηm,t , where xam ,t ∈ Rd is
the embedding feature vector the corresponding arm am , to
Loal Response capture the textual information [1], [3]. We will demonstrate
Agent interaction
the generalization of our model using two different open
Loally stored user User embedding approaches in Section V. Our objective is to design
data
a policy that selects arms (i.e., LLM responses) each round to
Fig. 1: An adaptive multi-agent conversational bandit framework minimize cumulative regret, defined as the difference between
for identifying online LLM responses. Local agents handle response the cumulative rewards of our policy and the best unknown
selection (arms), while a central server manages conversation flow
policy across all local agents, tailored to personalized user
through key term selection. The server aggregates interaction data
across multiple agents to accelerate user preference learning. preferences, which is defined as:
M X
X T
∗ ∗
RM (T ) = xT∗
am θ − x T
am ,t θ . (1)
significantly reduces execution time. This reduction does m=1 t=1
not compromise performance, thanks to our conversational where a∗m ∈ arg maxa∈Am xT ∗
a θ denotes the locally optimal
mechanisms design, which enhances the speed of online LLM arm with the highest expected reward at local agent m ∈ M.
response identification and estimation of user preference. This regret definition follows prior works [21], [17], [18].

C. Conversational Contextual Mechanism

II. S YSTEM M ODEL
In addition to obtaining feedback by selecting arms on
This section formulates the multi-agent conversational bandit suitable LLM responses, the cloud server can occasionally
for online LLM response identification. query users from each local agent for feedback to better
estimate user preferences. However, relying solely on directly
A. Online LLM Response Identification considering all answers can lead to inefficiencies due to the
issue of information dispersion. Specifically, the contextual
We define the set of local agents as M with |M| = M , vectors of different answers may vary significantly, even
which represent devices such as smartphones, laptops, and if they share similarities at an abstract level. For instance,
tablets. For any local agent m ∈ M, the finite arm set of responses about “syntax rules,” “best practices,” or “compiler
LLM responses is denoted as Am , which represents possible optimizations” may all relate to “C/C++,” but their contextual
responses generated from various prompts. Given the hetero- representations can differ greatly. Similarly, responses with a
geneity of agents, different local agents may have different arm “humorous tone” could vary between “lighthearted,” “sarcastic,”
sets, which is different from the assumption in [26] that all or “playful” expressions. To address this issue, we introduce
local agents share the same arm set. As mentioned in Section “key terms” to represent core topics or features of user interests
I, traditional offline techniques (e.g., prompt engineering) can from [17], [18]. A key term groups multiple related arms
help to construct a set of initial responses, but due to the under a single concept. For example, the key term “C/C++” can
diversity of LLM outputs and user preferences, it is essential encompass responses about “syntax rules,” “best practices,” and
to adaptively fine-tune the optimal response online, despite “compiler optimizations,” while the key term “humorous tone”
having an offline initiatory set of LLM responses. Our model might include responses that are “lighthearted,” “sarcastic,” or
adopts a time-slotted approach, denoted by discrete-time rounds “playful.” Feedback on a key term propagates to its related
T = {1, 2, 3, . . . , T }, where each local agent selects one arm, arms, enabling the system to infer preferences across multiple
i.e., LLM response, at each round t ∈ T . responses with minimal interaction.

3
Formally, let K denote the finite set of key terms, with each Algorithm 1: MACO on Local Agent (MACO-A)
element x̃k ∈ Rd representing the feature vector for key term Input: Round horizon T , number of local agent M , input
k ∈ K. Let K denote the finite set of key terms, with each dimension d, arm set Am , arm pool size A,
element x̃k ∈ Rd being a feature vector for the corresponding confidence parameter δ ∈ (0, 1]
key term k ∈ K. Applying the conversational bandits to our Initialization: Let p = 1, Apm = Am
1 while T has not been P reached do
multi-agent framework, a user served by local agent m can be 2 Calculate Mm p
= a∈Apm |A1p | xa xTa
queried with a key term km ∈ Km , where Km ⊆ K is the subset m
p
= dj=1 λvj vj vjT
P
3 Diagonalize Mm
of key terms at local agent m. Considering user preference
4 Upload eigenvector vj , if its corresponding eigenvalue
θ ∗ with a noise term ηem,t , the conversational feedback is satisfies λvj < hp := 4(1−23−2p )d
modeled as: rekm ,t = ⟨x̃km ,t , θ ∗ ⟩ + ηem,t . Note that our model n o
5 Download Km p
and npm,k from the cloud server
diverges from previous conversational bandits [17], [18], [27], p
k∈Km
p
[28], which employ a fixed conversation function, typically 6 foreach k ∈ Km do ▷ Conduct conversations
linear or logarithmic of round t, to regulate the frequency of 7 Querying key term k for npm,k times
8 Receive rewards {erk,t }t∈Te p from direct
conversations. These methods initiate conversations periodically, m,k

regardless of whether user preferences have been sufficiently conversational feedback

estimated, which can negatively impact the user experience. 9 foreach a ∈ Apmldo ▷ Pull
m arms
p d 2AM log T
(A more detailed comparison is provided in Section IV). 10 Set nm,a = 2(−2p−1) |Ap | log δ
m
Conversely, as we will elaborate in Section III, our algorithm 11 Pull a for npm,a times on the targeted LLM
conducts conversations “adaptively”, engaging users only when 12 Receive rewards {ra,t }t∈Tm,a p on the LLM response
necessary to refine the user preference estimation. Upload Gpm =
X
npm,k x̃k x̃Tk +
X p
nm,a xa xTa , and
p p
k∈Km a∈Am
D. Distributed Communication Model 13 X X
p
Wm = rek,t x̃k,t + ra,t xa,t
We consider a distributed model with M local agents and a p
ep
S
t∈ a∈Ap Tm,a
S
t∈ k∈Kp T
cloud server, adopting a synchronous communication paradigm. m m,k m

In this setup, as shown in Fig. 1, each local agent communicates

with the cloud server by uploading and downloading data 14 Download θbp from the cloud server
15 Update the active LLM response set Ap+1
m by eliminating
with negligible latency. Moreover, the local agents do not p+1
sub-optimal LLM responses: Am =
directly communicate with each other. For simplicity, we focus D E 2−p+1

on discrete-slot rounds solely for recording the selected arm. a ∈ Apm : maxp θbp , xa′ − xa ≤ √
a′ ∈Am M
Querying key terms is interspersed with identifying LLM 16 p=p+1
responses, allowing a key term to be queried and an arm
to be pulled simultaneously. This aligns with the practical
operations of conversational LLM systems. Consistent with
[22], we define communication cost as the cumulative count identification within the multi-agent system operates as follows.
of scalar units transmitted between the cloud server and local Initially, the local agent m ∈ M computes the information
p
agents, which include both integers and real numbers. matrix Mm from its active arm set Apm (which is later updated
p
in Line 15) during
P each phase p. Specifically, Mm is calculated
III. A LGORITHM D ESIGN as Mm p := p
1
x x T
, which refines the model’s
a∈Am |Ap m|
a a
We present the design of multi-agent conversational online ability to adapt to LLM responses by analyzing the principal
learning (MACO) algorithms, implemented by local agents and directions in the feature space (Line 2). The eigenvalue λv
a cloud server for adaptive identifying LLM response. Then, we of its eigenvector v represents the variance captured along
compare our design to the traditional phase elimination-based its direction, with higher values indicating richer information,
online learning algorithm [23]. which is essential for the precise estimation of θ ∗ . Following
For any real√vector x and a positive semi-definite matrix M , this, thePlocal agent m diagonalizes its information matrix
p d
let ∥x∥M := xT M x. Denote the cardinality of a set A as Mm = j=1 λvj vj vjT , examining all principal directions in
|A|. We introduce the notation [z] := {1, . . . , z} for ∀z ∈ N+ . the feature space (Line 3). If an eigenvalue λvj falls below
p
Define Tm,a as the set of rounds where local agent m selects the threshold hp := 4(1−23−2p )d , whose value is determined
p
arm a in phase p, Tem,k as the set of rounds when agent m by Lemma 1 in Section IV, the local agent m uploads the
conducts interaction on key term k in the same phase, and A corresponding eigenvector to the cloud server (Line 4). This
(where A ≤ |A|) as the size of actually pulled arms from the mechanism helps to address under-explored areas of the feature
LLM response set at each round. space, enhancing the accuracy in selecting LLM responses.
The cloud server processes the uploaded information and
A. MACO Algorithm on Local Agent returns a set of key terms Km p
along with the required repetition
p
As outlined in Algorithm 1, which is executed by the local times {nm,k }k∈Km (Line 5). The local agent m then engages
p

agents and referred to as MACO Agent (MACO-A), the online in conversations with these key terms while pulling arms the
process of handling and updating information for LLM response requisite number of times, to ensure robust exploration of LLM

4
responses. During this process, the local agent has the flexibility agents (Line 8). This targeted intervention allows for focused
to intersperse the querying of key terms with arm pulls (Lines 6- exploration and refinement of LLM responses related to these
12). Note that The procedures of conducting conversations and key terms. Finally, the cloud server aggregates the enriched data
Pulling arms are presented sequentially for clarity, but can from all local agents. This aggregated data is used to estimate
be executed in parallel or interleaved without strict ordering. the unknown preference parameter θ ∗ via linear regression,
The local agent then uploads the corresponding information effectively minimizing uncertainty and enhancing the model’s
of pulled arms, key terms, and observed rewards, which are ability to predict and adapt LLM responses tailored to user
stored in the matrices Gpm and Wm p
(Line 13). Finally, the local preferences (Lines 10-11). Moreover, G can also be initialized
agent downloads the updated preference parameter θbp from as an identity matrix to ensure invertibility, especially when
the cloud server, and revises its active arm set, eliminating less the dimension d is large.
effective arms based on the updated user preference estimations C. Comparative Analysis
(Line 15). This adaptive adjustment process allows each local
Generally, as mentioned in Section I, the number of LLM
agent to maintain high responsiveness and accuracy in LLM
responses needing online identification from an initial set
response identification, which caters to user-specific needs
generated by prompt engineering is typically finite. Therefore,
and preferences while preserving data privacy by sharing only
p p we employ phase elimination-based algorithms for linear
aggregated data (Gm and Wm ) with the cloud server.
bandits, referred to as PE-Lin, instead of the classical con-
B. MACO Algorithm on Cloud Server versational bandit framework proposed by [17]. This choice
is motivated by the better performance guarantees of PE-Lin
Algorithm 2: MACO on Cloud Server (MACO-S) under finite arm sets. Our work builds upon and improves
Input: Key term set K, coverage parameter β in Condition 1. the classical PE-Lin [23]. In PE-Lin, a learning agent always
Initialization: Let p = 1, G = 0, W = 0 estimates the unknown preference vector θ ∗ using optimal least
1 while T has not been reached do squares design. Specifically, the algorithm minimizes prediction
2 foreach m ∈ M do variance by implementing the computing-intensive G-optimal
3 Receive all eigenvectors uploaded by local agent m, design, a probability distribution over the arm feature vector
and denote this set as Sm
4
p
Initialize the set of key terms at phase p as Km = ∅ set X ⊂ Rd (represented by distribution policy π : X → [0, 1]),
5 foreach vj ∈ Sm do to ensure minimal variance g(π). The conditions are defined
6 k = arg maxi∈K x̃Ti vj , Km p
= Kmp
∪ {k} as [29]: X X
3 p
7
p
nm,k = 2(1−2−2p )
−2dλvj
log 2AM log T π(x) = 1, Mm (π) = π(x)xxT ,
β 2 2−2p δ
x∈X x∈X (2)
2
n o
8
p
Send Km and nm,k p
to local agent m g(π) = max ∥x∥ M (π)−1 = d.
k∈Km
p x∈X
9 Receive Gpm and Wm p
from local agent m Then the learning agent plays arms according to the policy π
P P p P P p
10 G = p∈[p] m∈M Gm , W = p∈[p] m∈M Wm for local agent m at phase p, estimates the unknown parameter
11
−1
Broadcast θbp = G W to all local agents θ ∗ , and eliminates inferior arms accordingly. As noted in [22],
12 p=p+1 there is currently no efficient algorithm for computing the
G-optimal design in the multi-agent scenario.
We avoid using G-optimal design by leveraging the inherent
Next, we present the part of the MACO algorithm, which multi-agent heterogeneity in LLM application, combined with
is executed on the cloud server, called MACO Server (MACO- an adaptive conversational mechanism to address this issue.
S). As mentioned in Section I, a significant challenge arises MACO eliminates the need for the resource-intensive G-optimal
from the heterogeneity of local agents in the multi-agent design, thereby significantly reducing computation time and
conversational bandits model. This diversity can hinder effective resources. Additionally, merely executing PE-Lin independently
data aggregation, potentially leading to suboptimal estimation on each local agent with subsequent data aggregation by the
of the user preference vector θ ∗ . To address this issue, the server cloud may fail to minimize regret efficiently. This is
cloud server employs a strategic approach using key terms to because different agents may have distinct LLM response
probe and enrich the information in underrepresented directions sets, resulting in a trivial regret bound of O(M √
e dT ), which
of the feature space, thereby enhancing the overall accuracy is equivalent to running PE-Lin on each agent without any
of the estimation process. direct communication. In √ contrast, our algorithm improves the
As detailed in Algorithm 2, the cloud server first receives regret upper bound to O( e dM T ) via efficiently utilizing the
eigenvectors representing directions with insufficient informa- conversation to aggregate the information from different local
tion about the LLM response space from each local agent agents, which will be detailed in Section IV.
(Line 7). Utilizing these insights, the cloud server identifies
and selects key terms by calculating the closest match in terms IV. P ERFORMANCE A NALYSIS
of the inner product with the underexplored directions. The This section presents the theoretical results of MACO,
chosen key term k ∈ K, along with the designated repetition including its cumulative regret, communication costs, and con-
times npm,k , is then communicated back to the respective local versation frequency. In line with common practices in [21], [20],

5
we assume for any arm a and key term k, ∥xa ∥ = ∥x̃k ∥ = 1. [22], where the communication cost scales as O(d2 AM log T ),
The length of preference vector θ ∗ is bounded by 1, and the reflecting a substantial increase with the number of arms.
noise terms ηm,t and ηem,t are modeled as 1-subgaussian. Our approach significantly reduces communication costs by
eliminating the need for each local agent to upload its entire
A. Main Results
active arm set, whose cardinality is O(A). Instead, local agents
We first present a “new technical condition” that addresses independently process their data and transmit only aggregated
general issues related to feature space coverage. results to the cloud server, which also enhances privacy by
Condition 1 (Feature Space Coverage). We say a key term limiting external data sharing in LLM response adaptations.
set K as sufficiently rich for covering the feature space if, for Theorem 3 (Bound on Conversation Frequency). For any local
any unit vector v ∈ Rd , there exists a key term k ∈ K such agent m ∈ M during phase p, let γ = λmin (Mm p
), where λmin
that its feature vector x̃k satisfies x̃T
k v ≥ β, where β ∈ (0, 1] denotes the smallest eigenvalue, we have:
is a positive coverage parameter close to 1.
1) If γ ≥ hp , no conversations will be initiated.
Remark 1. Condition 1 is crucial for ensuring the compre- 2) If γ < hp , the fraction of conversations relative to the
hensive distribution of key terms across the feature space, total phase length is capped at β −2 ( 4(1−23 −2p ) − dγ).
which can facilitate effective uncertainty minimization for each
local agent. This condition is easily met if the key term set K Remark 4. Our approach introduces an “adaptive” method that
includes an orthonormal basis of Rd . Condition 1 enables us to differs significantly from the common deterministic functions
sidestep the G-optimal design procedure, typically employed in b(t), such as linear or logarithmic dependencies on round t, as
traditional elimination-based algorithms to minimize maximum widely employed in existing studies on conversational bandits
prediction variance, as described in [23]. [17], [19]. These traditional methods initiate conversations at
fixed intervals, which can lead to inefficiencies, especially when
For sufficiently rich key term sets, based on Condition 1, user preferences are already well-understood. In contrast, our
we provide the following theorems. model dynamically adjusts the conversation frequency based
Theorem 1 (Regret Bounds). For the cumulative regret defined on the current gaps in user preference information, offering a
in Eq. 1, we have the following upper bound and lower bound: more realistic and responsive interaction paradigm.
1) Upper Bound: With probability
q at least 1 − δ, the regret B. Technical Analysis
is bounded above by O( dM T log AM δlog T ). We now provide an analysis of the upper bound in Theorem
2) Lower Bound: For any policy that selects at most one 1. Proofs for other theorems can be found in Appendices C to E.
key term per round, there exists an instance √ where the Below, we present two critical lemmas related to the design
policy incurs an expected regret of at least Ω( dM T ). of our multi-agent conversational bandit algorithm. Lemma 1
Remark 2. The regret bounds established in Theorem 1 reveal guarantees that for any local agent m, the smallest eigenvalue
important insights into the performance of our approach: of the information matrix, adjusted for conversational feedback,
• When M = 1, the problem simplifies to single-agent
√ con- remains above hp . This supports the design of line 4 in
versational bandits, reducing the regret to O( dT ). This Algorithm 1. Lemma 2 ensures that the algorithm operates
e
reduction within established error limits, which is essential for reliable
√ outperforms previous regret upper bound results LLM response identification.
of O(d T ) from studies such as [19], [17], by leveraging
e
phase elimination on finite arm sets. This improvement is Lemma 1 (Stability of the Information Matrix). For any local
particularly significant in high-dimensional LLM response agent m ∈ M during phase p, we have λmin (Mm p′
) ≥ hp ,
′ h −λ
feature vectors. p := p T
P p
where Mm Mm + k∈Km p
β2 x̃ x̃
k k .
• For multi-agent systems, our upper bound result aligns
with the nearly optimal results described in [22], [24], Proof. Please refer to Appendix A for the proof.
while eliminating the reliance on computationally intensive Lemma 2 (Reliability of Estimation Error Bounds). Define
G-optimal design, thereby speeding up the online process. the “bad” event E where any local agent m at phase p has:
• Collectively, the regret upper and lower bound indicate
2−p
that MACO is minimax optimal up to a logarithmic factor E = {∃m ∈ M, a ∈ Apm , ⟨θbp − θ ∗ , xa ⟩ > √ }.
[23], aligning closely with the theoretical regret bounds M
in multi-agent conversational bandits scenarios. The probability of E is bounded by δ, i.e., Pr[E] ≤ δ.
Theorem 2 (Communication Cost). The total communication Proof. See Appendix B for details.
cost scales in O(d2 M log T ) for MACO algorithm.
Now, consider the “good” event E c for agent m at phase
Remark 3. The communication cost of our algorithm MACO p. Lemma 2 confirms that the discrepancy for any arm a in
−p+1
is notably independent of the arm pool size A, which can Apm : ⟨xa − xa∗m , θbp ⟩ ≤ 2√M . This, combined with line 15
range into thousands based on the diversity of candidate in Algorithm 1, supports the following lemma on the arm
LLM responses. This contrasts with the approach described in preservation and performance bound under good event E c .

6
Lemma 3 (Properties Under Good Event). Under event E c , for preference vectors θ on keywords, we utilize two previously
any local agent m at phase p, two key properties are ensured: mentioned embedding models. We select the top d = 256
1) The locally optimal arm a∗m remains within the active dimensions as the feature representation and normalized
arm set Apm , ensuring it is never eliminated. them into a more concise and efficient dimensional space.
2) The performance gap for any arm a ∈ Apm , defined as The reward is obtained from the cosine similarity between
−p+3
∆m,a ≜ θ ∗ , xa∗m − xa , is bounded by 2√M . a specific user’s preference vector and the feature vector of
the selected arm, and the optimal LLM response is defined
PM P1T − δ,
Finally, with probability the cumulative as the one with the largest reward according to [33].
regret RM (T ) = m=1 t=1 θ ∗ , xa∗m − xam ,t is 2) Prompt engineering is utilized to construct the initiatory
PM PP P −p+3
p 2√
bounded by p n , where set of responses offline. Following [34], we select a set of
m=1 p=1 a∈Am m,a M
P
P denotes the total number of phases. Given that keyword styles (i.e., key term) rich in personal identifiers
a∈Ap
m
npm,a ≤2−2p+1
√d log 2AMδlog T + |A p
m |, we derive to establish a diverse style collection, including terms like
that RM (T ) ≤ O d M log AM δlog T 2P . Furthermore, helpful, and creative use of emojis. Two keyword styles
PP P PP are jointly selected for each query, which forms a style-
T ≥ p=1 a∈Ap
m
npm,a ≥ p=1 2
−2p+1
d log 2AMδlog T , specific question to the LLM, ensuring focused and relevant
2P AM log T
q to T ≥ 2d2
which simplifies log δ . Thus, responses. We utilize Llama-3-8B-Instruct [35] to generate
RM (T ) ≤ O dM T log KM δlog T . corresponding responses. Each prompt triggers a specific
response from the LLM, with each user preference dictating
V. P ERFORMANCE E VALUATION a response styled according to their selected input. For
example, User: ”Tell me a joke.” The response Arm: A
In this section, we conduct extensive experiments to demon- variety of jokes under different styles. Key-term: Different
strate the effectiveness of our algorithm.1 The code is accessible styles. By formulating responses to five different questions,
at the following link: Code Repository. each with two keyword styles, we construct a total arm set
A. Experimental Settings of |A| = 455 responses. This extensive collection allows
for a comprehensive mapping of responses to specific
Embedding Models. We demonstrate our framework’s user preferences, effectively forming a set of 455 user-
generalization capabilities using two open embedding models: preference pairs. Regarding the reward definition, the feature
Google’s text-embedding-preview-0409 and OpenAI’s Text- vector extraction, and subsequent steps, we apply the same
embedding-3-large, which generate the embedding feature procedures described above.
vector xa ∈ Rd for the corresponding arm a (i.e., response)
to capture text information. Comparison Algorithms. The following online learning
algorithms from existing studies are used as baselines, each
1) Text-embedding-preview-0409: Google’s advanced em- executed individually on different local agents.
bedding model, which streamlines synthetic training data
• TRIPLE-SH [8]: Select optimal prompts for LLMs by
creation by generating queries and task descriptions [30].
2) Text-embedding-3-large: OpenAI’s new generation em- adaptively eliminating arms with poor performance, where
bedding model, which surpasses its predecessor, though its we directly set each arm as the corresponding LLM response.
• LinUCB [21]: Online select arms and estimate user preference
technical details remain undisclosed [31].
for infinite arm sets, excluding the conversational setting.
Response Settings. We explore the implementation of two
• Arm-Con [36]: Initiate conversations on user preference about
response settings using the aforementioned embedding models,
arms, and use LinUCB for arm selection.
based on a real-world dataset and an open-source LLM.
• ConUCB [17]: Query key terms if conversations are allowed
1) Following the style classification by [32], we gather a and utilize conversational feedback to accelerate learning.
comprehensive set of 13 keywords representing diverse • ConLinUCB [19]: The series includes three algorithms:
styles such as “humorous” and “helpful”, each representing ConLinUCB-BS calculates the barycentric spanner for con-
a key term. These keyword styles generate 510 unique ducing conversations; ConLinUCB-MCR selects key terms
combinations, each forming an “arm”, where each arm with the largest confidence radius; ConLinUCB-UCB adopts
represents a potential style of LLM response. Users have a LinUCB-like method to choose key terms.
varying priorities for different keyword combinations, and
All results are averaged from five trials, conducted on a
their preference vector θ has the highest cosine similarity
Linux Ubuntu machine (kernel 6.5.0) with a 5.40 GHz 13th
with the feature vector x of their most favored keyword
Gen Intel(R) Core(TM) i7-13700KF CPU and 32GB RAM.
style (which is unknown to the algorithms in advance). To
We set coverage parameter β = 1 and confidence parameter
generate these feature vectors x for LLM responses and user
δ = 0.1, and conduct an ablation study to ensure robustness.
1 Our experimental setup does not assume any prior knowledge of user
preferences or reward distributions, thus requiring more trial rounds. Although B. Evaluation Results
practical scenarios often have pre-existing information that could reduce initial
exploration, our study focuses on the performance of online learning algorithms Regret Across Different Arm Pool Sizes. We initially
without this offline information. compare the cumulative regret of MACO against seven baseline

7
1e4 1e4 1e4 4 1e4
3 4 3

2 2
Regret

Regret

Regret
2 2
1 1

0 0 0 0
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Round Round Round Round
(a) Armsize: 40 (text-embedding-preview-0409) (b) Armsize: 50 (text-embedding-preview-0409) (c) Armsize: 40 (text-embedding-3-large) (d) Armsize: 50 (text-embedding-3-large)
MACO (ours) ConUCB LinUCB Arm-Con ConLinUCB-UCB ConLinUCB-MCR ConLinUCB-BS TRIPLE-SH

Fig. 2: Cumulative regret of Response Setting 1 on two embedding models from Google and OpenAI across different arm pool sizes A.

1e4 1e4 3 1e4 1e4

3 4
4
2 3
2
Regret

Regret

Regret
2 2
1 1
1
0 0 0 0
0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000
Round Round Round Round
(a) Local agent total: 8 (text-embedding-preview-0409) (b) Local agent total: 12 (text-embedding-preview-0409) (c) Local agent total: 8 (text-embedding-3-large) (d) Local agent total: 12 (text-embedding-3-large)
MACO (ours) ConUCB LinUCB Arm-Con ConLinUCB-UCB ConLinUCB-MCR ConLinUCB-BS TRIPLE-SH

Fig. 3: Cumulative regret of Response Setting 2 on two embedding models from Google and OpenAI across different numbers of agents M .

algorithms under Scenario Setting 1 with M = 4 local local agents, setting M = 8 and M = 12. We consider more
agents, employing the above two embedding models. We agents here because, in practice, platforms often group users
further explore the influence of varying arm pool sizes A, with similar labels to share learning, making M naturally larger.
setting A = 40 and A = 50 under each embedding model Therefore, we aim to explore our algorithm’s performance with
respectively, and selecting A arms at random from A for the larger M for a comprehensive demonstration. Fig. 3 presents
local agent. Fig. 2 demonstrates that algorithms lacking a four subfigures that illustrate consistent trends: in the absence of
conversational mechanism (LinUCB and Arm-Con), exhibit a multi-agent framework, the cumulative regrets of all baseline
the poorest performance. In contrast, our algorithm, MACO, algorithms increase√linearly with the number of local agents,
significantly outperforms all competitors, achieving a minimum following a O(dM
e T ) pattern. Conversely, MACO capitalizes
improvement of 8.29% compared to ConLinUCB-MCR, the on the aggregated data from√all local agents, managing to scale
best-performing baseline. This superior performance originates its regret according to O(
e dM T ). This scaling significantly
from the multi-agent framework employed by MACO, wherein dampens the increase in regret, demonstrating the effectiveness
the cloud server aggregates data from each local agent to more of our algorithm’s multi-agent approach for online LLM
accurately estimate the unknown user preference. Notably, the response identification. A clearer depiction of this regret trend
increase in arm pool size A does not significantly increase the is shown in Fig. 4, where TRIPLE-SH is excluded due to
cumulative regret for MACO, confirming Theorem 1 which its inferior performance, under Scenario Setting 1 with the
states that our algorithm’s regret growth increases at a square- Google’s model and T = 100000.
root logarithmic rate with respect to arm pool size A.
TABLE I: Execution time (s) (± standard deviation) on four settings.
1e5 Algorithm
MACO (w/o G) MACO (w/G) ConLinUCB-BS
Setting
1.2 Setting (a) 2.576 ± 0.047 9.766 ± 2.709 18.124 ± 0.111
1.0 Setting (b) 2.546 ± 0.039 14.272 ± 7.107 18.056 ± 0.065
0.8 Setting (c) 2.576 ± 0.085 6.369 ± 2.832 17.926 ± 0.095
Regret

Setting (d) 2.661 ± 0.056 6.270 ± 2.013 17.919 ± 0.072

0.6
0.4
0.2
Comparison of Execution Time. We assess the execution
0.0 time of our algorithm, termed MACO w/o G for emphasis,
4 8 12 16
The number of local agents: M against ConLinUCB-BS (previously identified as the fastest
MACO (ours) LinUCB ConLinUCB-UCB ConLinUCB-BS in [19]) under conditions of T = 5000 across 6 phases
ConUCB Arm-Con ConLinUCB-MCR (A = 40, M = 4), and compare it with MACO w/G, which
Fig. 4: Cumulative regret under various number of local agents. continues to employ the traditional G-optimal design. For
clarity, the results on text-embedding-preview-0409 and text-
Regret Across Different Number of Local Agents. We embedding-3-large under Response Settings 1, 2 are abbreviated
next examine the regret under Scenario Setting 2 with arm as Settings (a), (b), (c), and (d). The results, detailed in Table I,
pool size A = 40, while using the embedding models above. show that our algorithm significantly reduces execution time by
Additionally, we assess the impact of varying the number of avoiding the G-optimal design and leveraging data aggregation

8
TABLE II: Average reward (± standard deviation) on four settings.
an online budget-limited LLM response optimization using
Algorithm
Setting
MACO (w/o G) MACO (w/G) ConLinUCB-BS various prompts. And [11] focuses on response identification
Setting (a) 61.849 ± 0.558 61.847 ± 0.565 59.811 ± 0.610 over multiple LLM coordination. Nevertheless, these studies
Setting (b) 61.605 ± 0.642 61.591 ± 0.649 59.663 ± 0.671
Setting (c) 47.405 ± 0.977 47.381 ± 1.002 46.104 ± 0.962
ignore the impact of user preferences and the natural multi-
Setting (d) 41.770 ± 0.349 41.858 ± 0.412 40.720 ± 0.349 agent setting in LLM response identification.
VII. C ONCLUSION
from multiple local agents to accelerate the learning process. This paper presents MACO, a multi-agent conversational
Table I further illustrates that MACO w/o G exhibits the online framework designed to identify optimal responses from
p
lowest deviation since the information matrix Mm is no longer LLMs while minimizing cumulative regret and aligning with
dependent on a continuously adjusted distribution policy (see user preferences. The framework consists of local agents
Eq. (2)). Additionally, the results in Table II show that the (MACO-A) that adaptively manage conversations and response
average reward for MACO w/o G matches that of MACO w/G, selection, and a cloud server (MACO-S) that aggregates data to
demonstrating that our conversational approach maintains learn user preferences efficiently. We have proved that MACO
performance while replacing the traditional G-optimal design achieves optimal regret bounds, reduces conversations, and
with a more practical, conversation-based design. This not only enhances computational efficiency. Our extensive evaluations,
sustains robust performance, as supported by Theorem 1, but utilizing open LLMs like Llama and embedding models from
also enhances efficiency, representing an interesting finding. Google and OpenAI, confirm that our approach significantly
Ablation Study. Table III reveals that the introduction of improves performance over traditional methods. Future work
the coverage parameter β in our design has a minimal impact could explore clustering similar user preferences and extend-
on the outcomes, contrasting with the significant influence ing beyond the linear reward model to further enhance the
exerted by the statistical confidence parameter δ, which is adaptability and effectiveness of the MACO framework.
established by convention [23]. This observation underscores
that our framework does not introduce new dependencies on A PPENDIX
parameters beyond those traditionally used in bandit algorithms. A. Proof of Lemma 1
TABLE III: Cumulative regret under T = 100000, A = 40, M = 4.
Proof. Using the eigenvectors as an orthonormal basis, for
Setting
any j ∈ [d],Pany key term’s Pdk feature vector can be expressed
Setting (a) Setting (b) Setting (c) Setting (d) d
Parameter as x̃k = i=1 ci vi = i=1,i̸=j ci vi + cj vj , where x :=
β = 1.0, δ = 0.1 20213.773 16277.413 15033.483 8261.335 Pd
β = 0.9, δ = 0.05 21439.795 17205.540 16039.654 8772.119 c
i=1,i̸=j i iv is orthogonal to vj . According to Line 7 of Al-
β = 0.8, δ = 0.05 21430.625 17215.402 16033.950 8770.108 gorithm 2 and Condition 1, we have x̃T v ≥ β for the selected
β = 0.9, δ = 0.15 19495.106 15734.833 15092.586 7962.415 Pkd j
β = 0.8, δ = 0.15 19492.169 15738.395 15094.809 7961.321 key term k. Therefore, we have ( i=1 ci vi )T vj = cj ≥ β,
and x̃k x̃T T 2 T
k = (cj vj + x)(cj vj + x) = cj vj vj + xx . By
T

spectral decomposition and line 4 in Algorithm 1, we have

VI. R ELATED W ORK p′
Pd T hp −λj
cj vj vj + xxT .
2 T
P
Mm = i=1 λi vi vi + j:λj <hp C2
Bandits tackle the exploitation-exploration tradeoff of online p′
Pd T
+ j:λj <hp (hp − λj )vj vjT ⪰
P
Then, Mm ⪰ i=1 λm vm vm
decision-making problems [21]. Based on this, conversational Pd 3 T
contextual linear bandits, introduced by [17], allow the cloud i=1 4(1−2−2p )d vm vm . The proof concludes by the Loewner
server to obtain user feedback on key terms to elicit preferences, order property, stating if A ⪰ B, then λj (A) ≥ λj (B).
in addition to arm selection. Later studies introduce clustering B. Proof of Lemma 2
to avoid labeling efforts [18], integrate knowledge graphs for
Proof. For any phase p, given G’s definition in Algorithm 2,
term selection [27], and compute the barycentric spanner as Pp PM p

2AM log T

an efficient exploration basis [19]. Regarding the multi-agent it follows that G = s=1 m=1 G m ⪰ 2d log δ
 
bandit setting under finite arm sets, [26] assumes homogeneous P Xp
1 X xa xaT X hp − λ
M
arm sets, and [22] requires the local agents to upload arm sets, m=1

−2p
( p + x̃k x̃T
k) .

2 p |A m | p β2
increasing costs and privacy concerns, and [24] utilizes the s=1 a∈Am k∈Km
computationally intensive G-optimal design. Unlike existing
| {z }
p
≜Qm
works, we are the first to extend conversational bandits to By the Weyl’s inequality, we have the lower
multi-agent settings for online LLM response adaptation, with bound of the smallest eigenvalue of Qpm:
reduced computation resources, where the theoretical analysis λ (Qp ) ≥Pp 1 p
P hp −λ T
min m s=1 2−2p λmin Mm + k∈Km p
β 2 x̃k x̃k .
can be an independent component. p 1 3
1, λmin (Qpm ) ≥
P
Research on prompt learning for automatically generating By Lemma s=1 2−2p 4(1−2−2p )d ≥
3
Pp 1 1
suitable LLM responses has made significant progress [4], [37]. 4(1−2 )d s=1 2
−2p −2p = d·2 −2p . Based on this, we have
However, offline generating methods face challenges like “data λmin (G) ≥ 22p+1 M log 2AMδlog T . According to the
drift,” emphasizing the need for online approaches to optimize concentration of linear regression in Chapter 20.1 of [23]
LLM responses [38], [7]. [39] introduces an online non- (with the gram matrix refined as G for incorporating
stationary bandit method across different LLMs. [8] proposes information from key terms), for any δ > 0, s ∈ [p],

9
xD ∈ Rd , E with probability
q at least 1 − 2δ, we have E. Proof of Theorem 3
∗
θs − θ , x ≤ 2∥x∥G−1 log 1δ . Then, by the Courant-
b 2
Proof. 1) follows directly from line 4 of Algorithm 1.
Fischer theorem, with probability at least 1 − AM δlog T , for any For 2), in phase p, the numberP of arms npm
D E pulled by each local agent m m is a∈Ap npm,a =
m ∈ M and all arm a ∈ Apm , we have θbp − θ ∗ , xa ≤ l 2p+1 m
2 d
log 2AMδlog T ≥ 22p+1 d log 2AMδlog T . And
P
a∈Ap Ap
q q
2AM log T 2 2AM log T 2−p m m
2∥xa ∥2G−1 log δ ≤ λmin (G) log δ ≤ √ M
. the number of key terms pulled n epm by local agent m is given
Finally, by the union bound, Pr [E] ≤ M P K AM δlog T ≤ δ p 2d(hp −λj ) 2AM log T
P P
by: p n = 2 −2p log
k∈Km m,k j:λj <hp β 2 δ
is obtained with P ≤ log T (deduced from Section IV-B: Pd d(hp −γ)

2AM log T

T ≥ 2d22P log AM δlog T ≥ 2P ). ≤ j=1 β 2 2−2p−1 log δ . Thus, the ratio between
the number of key terms and arms for any m ∈ M is upper
3
ep
n hp −dγ −dγ
4(1−2−2p ) 1−dγ
C. Proof of Regret Lower Bound in Theorem 1 bounded by np
m
m
≤ β2 = β2 ≤ β2 .
π R EFERENCES
Proof. Define RM,θ (T ) as the expected cumulative regret of
policy π with user preference θ over M local agents and time [1] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar,
horizon T . Assume that for all local agents m, the arms vectors P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial
can span Rd , and {xa }a∈Am = {xk }k∈K = {e1 , e2 , . . . , ed }∪ general intelligence: Early experiments with gpt-4,” arXiv preprint
arXiv:2303.12712, 2023.
{(A − d) arbitrary unit vectors}, where ei is the i-th standard [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
basis vector in Rd . Choose θ = (∆, 0, . . . , 0)T (with ∆ ∈ [0, 12 ] N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open
foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288,
to be determined later). Let random variables Ni (t), N ej (t)
2023.
be the number of times the i-th arm and the j-th key term [3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train,
are selected, by the end of round t. Define another user prompt, and predict: A systematic survey of prompting methods in natural
language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35,
preference θ ′ = (∆, 0, T
n . . . , 2∆, . . . , 0) , where θℓ o = 2∆ and 2023.
ℓ = arg minj>1 max Eθ [Nj (M T )], Eθ [N ej (M T )] . Denote [4] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian, and
Y. Yang, “Connecting large language models with evolutionary algorithms
Nm,a (t) as the number of times the a-th arm is chosen yields powerful prompt optimizers,” arXiv preprint arXiv:2309.08532,
by local agent m ∈ M after the end of round t. Given 2023.
that the optimal arm for θ is arm 1, pulling other arms [5] R. Pan, S. Xing, S. Diao, X. Liu, K. Shum, J. Zhang, and T. Zhang, “Plum:
Prompt learning using metaheuristic,” arXiv preprint arXiv:2311.08364,
increases the expectedP regret byP∆. Thus, by Lemma 4.5 2023.
π M A
in [23], RM,θ (T ) = m=1 ∆ a=2 Eθ [Nm,a (T )]]. Using [6] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, and M. Zeng, “Automatic
MT prompt optimization with “gradient descent” and beam search,” arXiv
the inequality Eθ [Nj (M T )] ≤ K−1 and Eθ [Nej (M T )] ≤
preprint arXiv:2305.03495, 2023.
MT π
K−1 hand Markov inequality, we i get: RM,θ (T ) ≥ [7] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language
PM models while reducing cost and improving performance,” arXiv preprint
∆ Prθ M T − m=1 Ni,1 (T ) ≥ M2T M2T . arXiv:2305.05176, 2023.
For h θ ′ , similarly, we have π
RM,θ ′ (T ) ≥ [8] C. Shi, K. Yang, J. Yang, and C. Shen, “Best arm identification for prompt
PM i learning under a limited budget,” arXiv preprint arXiv:2402.09723, 2024.
M T M T
∆ Prθ′ m=1 Ni,1 (T ) > 2 2 . Therefore, applying [9] K. Shuster, J. Xu, M. Komeili, D. Ju, E. M. Smith, S. Roller, M. Ung,
M. Chen, K. Arora, J. Lane et al., “Blenderbot 3: a deployed conversa-
the Bretagnolle-Huber theorem (Theorem 14.2 in [23]), tional agent that continually learns to responsibly engage,” arXiv preprint
π π ∆M T
RM,θ (T ) + RM,θ ′ (T ) ≥ exp(−D(Pθ ∥ Pθ′ )). Accord- arXiv:2208.03188, 2022.
4
ing to the properties of Kullback–Leibler (KL) divergence, with [10] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and
C. Finn, “Direct preference optimization: Your language model is secretly
P ∼ N (µ1 , σ 2 ) and Q ∼ N (µ2 , σ 2 ), we have D(Pθ ∥ Pθ′ ) = a reward model,” Advances in Neural Information Processing Systems,
Eθ [Nℓ (M T ) + N eℓ (M T )]D(N (0, 1)
q n ∥ N (2∆,o1)) vol. 36, 2024.
(µ1 −µ2 )2 [11] X. Dai, J. Li, X. Liu, A. Yu, and J. Lui, “Cost-effective online multi-llm
= 2σ 2 . Let ∆ = d−1MT ,
π
max RM,θ π
(T ), RM,θ ′ (T ) ≥ selection with versatile reward models,” arXiv preprint arXiv:2405.16587,
π π
RM,θ (T )+RM,θ′ (T )
√
−4 p 2024.
2 ≥ e8 (d − 1)M T = Ω dM T . [12] V. Dwaracherla, S. M. Asghari, B. Hao, and B. Van Roy, “Efficient
exploration for llms,” arXiv preprint arXiv:2402.00396, 2024.
[13] Poe, https://poe.com/ChatGPT, 2024.03.
D. Proof of Theorem 2 [14] C. Gao, W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Advances and
challenges in conversational recommender systems: A survey,” AI Open,
Proof. At each phase p, each local agent m downloads the vol. 2, pp. 100–126, 2021.
following: (a) The key term vector set, containing at most d [15] Y. Sun and Y. Zhang, “Conversational recommender system,” in ACM
feature vectors of dimension d; (b) The repetition counts for SIGIR Conference, 2018, p. 235–244.
[16] X. Dai, Z. Wang, J. Xie, X. Liu, and J. C. Lui, “Conversational
each key term npm,k , ∀k ∈ Km
p
, totaling at most d integers; And recommendation with online learning and clustering on misspecified
(3) the estimated preference vector θbp , a d-dimensional vector. users,” IEEE Transactions on Knowledge and Data Engineering, vol. 36,
no. 12, pp. 7825–7838, 2024.
On the other hand, the local agent uploads the following: (a) [17] X. Zhang, H. Xie, H. Li, and J. C.S. Lui, “Conversational contextual ban-
At most d eigenvalues and their corresponding eigenvectors; dit: Algorithm and application,” in Proceedings of The Web Conference,
(2) The matrix Gpm and Wm p
, each size of d2 . Considering 2020, p. 662–672.
[18] J. Wu, C. Zhao, T. Yu, J. Li, and S. Li, “Clustering of conversational
that the number of phases is at most log T , the upload and bandits for user preference learning and elicitation,” in Proceedings of
download costs are both O(d2 M log T ). the ACM CIKM, 2021, p. 2129–2139.

10
[19] Z. Wang, X. Liu, S. Li, and J. C. S. Lui, “Efficient explorative key-term
selection strategies for conversational contextual bandits,” Proceedings
of the AAAI, pp. 10 288–10 295, 2023.
[20] X. Liu, H. Zhao, T. Yu, S. Li, and J. C. Lui, “Federated online clustering
of bandits,” in Proceedings of the UAI, 2022, pp. 1221–1231.
[21] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for
linear stochastic bandits,” in Proceedings of the NeurIPS, 2011.
[22] R. Huang, W. Wu, J. Yang, and C. Shen, “Federated linear contextual
bandits,” Advances in neural information processing systems, vol. 34, pp.
27 057–27 068, 2021.
[23] T. Lattimore and C. Szepesvári, Bandit Algorithms. Cambridge
University Press, 2020.
[24] Z. Li, M. Liu, and J. C. S. Lui, “Fedconpe: Efficient federated
conversational bandits with heterogeneous clients,” in Proceedings of
the Thirty-Third International Joint Conference on Artificial Intelligence,
IJCAI-24. International Joint Conferences on Artificial Intelligence
Organization, 8 2024, pp. 4533–4541.
[25] J. Lin and S. Moothedath, “Federated stochastic bandit learning with
unobserved context,” arXiv preprint arXiv:2303.17043, 2023.
[26] Y. Wang, J. Hu, X. Chen, and L. Wang, “Distributed bandit learning:
Near-optimal regret with efficient communication,” in Proceedings of
the ICLR, 2020.
[27] C. Zhao, T. Yu, Z. Xie, and S. Li, “Knowledge-aware conversational
preference elicitation with bandit feedback,” in Proceedings of the ACM
Web Conference 2022, 2022, p. 483–492.
[28] X. Dai, Z. Wang, J. Xie, T. Yu, and J. C. Lui, “Online learning and
detecting corrupted users for conversational recommendation systems,”
IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12,
pp. 8939–8953, 2024.
[29] J. Kiefer and J. Wolfowitz, “The equivalence of two extremum problems,”
Canadian Journal of Mathematics, vol. 12, p. 363–366, 1960.
[30] J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko,
R. Kapadia, W. Ding, Y. Luan, S. M. K. Duddu, G. H. Abrego, W. Shi,
N. Gupta, A. Kusupati, P. Jain, S. R. Jonnalagadda, M.-W. Chang, and
I. Naim, “Gecko: Versatile text embeddings distilled from large language
models,” arXiv preprint arXiv:2403.20327, 2024.
[31] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive
text embedding benchmark,” arXiv preprint arXiv:2210.07316, 2023.
[32] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens,
A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, S. ES, S. Suri,
D. Glushkov, A. Dantuluri, A. Maguire, C. Schuhmann, H. Nguyen,
and A. Mattick, “Openassistant conversations – democratizing large
language model alignment,” arXiv preprint arXiv:2304.07327, 2023.
[33] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings
using Siamese BERT-networks,” in Proceedings of the EMNLP-IJCNLP,
2019, pp. 3982–3992.
[34] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha,
“A systematic survey of prompt engineering in large language models:
Techniques and applications,” arXiv preprint arXiv:2402.07927, 2024.
[35] Ollama, https://github.com/jmorganca/ollama/, 2024.06.
[36] K. Christakopoulou, F. Radlinski, and K. Hofmann, “Towards con-
versational recommender systems,” in Proceedings of ACM SIGKDD
International Conference, 2016, p. 815–824.
[37] Z. Zhang, S. Wang, W. Yu, Y. Xu, D. Iter, Q. Zeng, Y. Liu, C. Zhu, and
M. Jiang, “Auto-instruct: Automatic instruction generation and ranking
for black-box language models,” arXiv preprint arXiv:2310.13127, 2023.
[38] R. Bhardwaj, Z. Xia, G. Ananthanarayanan, J. Jiang, Y. Shu, N. Kari-
anakis, K. Hsieh, P. Bahl, and I. Stoica, “Ekya: Continuous learning of
video analytics models on edge compute servers,” in Proceedings of the
NSDI, 2022, pp. 119–135.
[39] Y. Xia, F. Kong, T. Yu, L. Guo, R. A. Rossi, S. Kim, and S. Li, “Which llm
to play? convergence-aware online model selection with time-increasing
bandits,” arXiv preprint arXiv:2403.07213, 2024.

My Thesis
No ratings yet
My Thesis
15 pages
Data Leakage Detection Complete Project Report
85% (33)
Data Leakage Detection Complete Project Report
59 pages
Personalization of Large Language Models A Survey
No ratings yet
Personalization of Large Language Models A Survey
60 pages
Introduction To Malbolge (Programming in Malbolge)
No ratings yet
Introduction To Malbolge (Programming in Malbolge)
6 pages
005 Hoklas - Intertek Hong Kong
No ratings yet
005 Hoklas - Intertek Hong Kong
136 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
2506.05598v1
No ratings yet
2506.05598v1
34 pages
30-62 5kva PDF
No ratings yet
30-62 5kva PDF
4 pages
TPT-8020A User Manual V1.4
No ratings yet
TPT-8020A User Manual V1.4
18 pages
personalized preferences driven chatbot
No ratings yet
personalized preferences driven chatbot
44 pages
CP 211 Lecture 1 Introduction To System Administration
No ratings yet
CP 211 Lecture 1 Introduction To System Administration
47 pages
Lab 1: Introduction To Python Programming: 1/20/17 Slide Credits: Nicole Rockweiler!
No ratings yet
Lab 1: Introduction To Python Programming: 1/20/17 Slide Credits: Nicole Rockweiler!
101 pages
Machine Problem Ccs0015
No ratings yet
Machine Problem Ccs0015
13 pages
CMAT
No ratings yet
CMAT
32 pages
CG12 BSP
No ratings yet
CG12 BSP
31 pages
TEAC FD-235HF-C891 Micro Floppy Disk Drive Specification
No ratings yet
TEAC FD-235HF-C891 Micro Floppy Disk Drive Specification
31 pages
2505.14299v1
No ratings yet
2505.14299v1
14 pages
Lead innovative thinking and practice
No ratings yet
Lead innovative thinking and practice
25 pages
2502.11528v1
No ratings yet
2502.11528v1
11 pages
1-s2.0-S2666651025000014-main
No ratings yet
1-s2.0-S2666651025000014-main
12 pages
paper-2
No ratings yet
paper-2
17 pages
2310.01444v3
No ratings yet
2310.01444v3
20 pages
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
No ratings yet
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
24 pages
2023.findings-acl.277
No ratings yet
2023.findings-acl.277
19 pages
LLMs Can Infer Personality From Free Form User Interactions
No ratings yet
LLMs Can Infer Personality From Free Form User Interactions
18 pages
7318 Hybrid LLM Cost Efficient
No ratings yet
7318 Hybrid LLM Cost Efficient
19 pages
Shopify Dropshipping - Extreme Commerce
No ratings yet
Shopify Dropshipping - Extreme Commerce
189 pages
LAMBDA Book Chapter 1
No ratings yet
LAMBDA Book Chapter 1
18 pages
Comf3242 Pss Global
No ratings yet
Comf3242 Pss Global
37 pages
tucker24a
No ratings yet
tucker24a
14 pages
Content IPL 5sem
No ratings yet
Content IPL 5sem
14 pages
More Agents Is All You Eed
No ratings yet
More Agents Is All You Eed
14 pages
Large Language Models for Text Classification Case Study and 2rl2h1dz4onu
No ratings yet
Large Language Models for Text Classification Case Study and 2rl2h1dz4onu
12 pages
LLM paper2
No ratings yet
LLM paper2
12 pages
Voucher-SOLOMON WIFI-12H-up-866-07.06.24
No ratings yet
Voucher-SOLOMON WIFI-12H-up-866-07.06.24
10 pages
Deed Sale
No ratings yet
Deed Sale
11 pages
LSTM Spark
No ratings yet
LSTM Spark
14 pages
I.T 18 Graphics1 Extension
No ratings yet
I.T 18 Graphics1 Extension
17 pages
Do Llms Understand User Preferences? Evaluating Llms On User Rating Prediction
No ratings yet
Do Llms Understand User Preferences? Evaluating Llms On User Rating Prediction
11 pages
Cikm23 Emotion Cai
No ratings yet
Cikm23 Emotion Cai
6 pages
Giga g41mt s2pt Rev.2.0
No ratings yet
Giga g41mt s2pt Rev.2.0
33 pages
Naïve Bayes
No ratings yet
Naïve Bayes
6 pages
Big Ip Access Policy Manager Ds PDF
No ratings yet
Big Ip Access Policy Manager Ds PDF
17 pages
2310.05421v1
No ratings yet
2310.05421v1
4 pages
Satish Yerramsetti
No ratings yet
Satish Yerramsetti
4 pages
Gambi CA
No ratings yet
Gambi CA
28 pages
CV Globant - Camilo Fonseca
No ratings yet
CV Globant - Camilo Fonseca
2 pages
QB-CSDF-Unit 3,4,5 & 6
No ratings yet
QB-CSDF-Unit 3,4,5 & 6
6 pages
SP Range - XF HD Fixed - XP HD PTZ Cameras - Datasheet - A4
No ratings yet
SP Range - XF HD Fixed - XP HD PTZ Cameras - Datasheet - A4
2 pages
202551780404FF_Romanus
No ratings yet
202551780404FF_Romanus
1 page
Keyboard Shortcuts
No ratings yet
Keyboard Shortcuts
5 pages
RP500 Quick Reference Guide 10
No ratings yet
RP500 Quick Reference Guide 10
1 page
PostPaid Carding TUT PT.2
100% (1)
PostPaid Carding TUT PT.2
18 pages
Designing Modular Systems with the Mediator Pattern: Definitive Reference for Developers and Engineers
From Everand
Designing Modular Systems with the Mediator Pattern: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gradio Blocks for Modular Machine Learning Applications: The Complete Guide for Developers and Engineers
From Everand
Gradio Blocks for Modular Machine Learning Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
From Everand
Composite Pattern in Modern Software Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Designing Agentic AI Architecture and Development Strategies
From Everand
Designing Agentic AI Architecture and Development Strategies
Anand Vemula
No ratings yet
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
WLang Essentials: Definitive Reference for Developers and Engineers
From Everand
WLang Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Structured State Management with MobX: Definitive Reference for Developers and Engineers
From Everand
Structured State Management with MobX: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Voiceflow Design and Automation: Definitive Reference for Developers and Engineers
From Everand
Voiceflow Design and Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Intelligent Agents: Definitive Reference for Developers and Engineers
From Everand
Principles of Intelligent Agents: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
From Everand
Object-Relational Mapping Concepts and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Clojure Essentials: Definitive Reference for Developers and Engineers
From Everand
Clojure Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Moleculer for Scalable Microservices: Definitive Reference for Developers and Engineers
From Everand
Moleculer for Scalable Microservices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Botpress Development: Definitive Reference for Developers and Engineers
From Everand
Practical Botpress Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Language Understanding with LUIS: Definitive Reference for Developers and Engineers
From Everand
Language Understanding with LUIS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Transformers: Principles and Applications
From Everand
Transformers: Principles and Applications
Richard Johnson
No ratings yet
Essential Guide to LLMOps: Implementing effective strategies for Large Language Models in deployment and continuous improvement
From Everand
Essential Guide to LLMOps: Implementing effective strategies for Large Language Models in deployment and continuous improvement
Ryan Doan
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Crafting Excellence in Software Development
From Everand
Crafting Excellence in Software Development
Pasquale De Marco
No ratings yet
Design Patterns Made Easy: A Practical Guide with Examples
From Everand
Design Patterns Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
LoRA Techniques for Large Language Model Adaptation: The Complete Guide for Developers and Engineers
From Everand
LoRA Techniques for Large Language Model Adaptation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Deep Reinforcement Learning: An Essential Guide
From Everand
Deep Reinforcement Learning: An Essential Guide
Robert Johnson
No ratings yet
AI Prompts & Power of Words
From Everand
AI Prompts & Power of Words
D.Cyrus
No ratings yet
Mastering Object-Oriented Design and Simulation with OpenModelica
From Everand
Mastering Object-Oriented Design and Simulation with OpenModelica
Pasquale De Marco
No ratings yet
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Metaprogramming Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
From Everand
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
The Actor Model in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
The Actor Model in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dialogflow Development Essentials: Definitive Reference for Developers and Engineers
From Everand
Dialogflow Development Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Stories Within Stories: Unveiling the Hidden Designs in Software Architecture
From Everand
Stories Within Stories: Unveiling the Hidden Designs in Software Architecture
Pasquale De Marco
No ratings yet
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unraveling the Magic of Large Language Models: A Journey into the Future of Communication
From Everand
Unraveling the Magic of Large Language Models: A Journey into the Future of Communication
Lila Hartney
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
From Everand
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
I. Almeida
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Microservices Architecture Handbook: Non-Programmer's Guide for Building Microservices
From Everand
Microservices Architecture Handbook: Non-Programmer's Guide for Building Microservices
Stephen Fleming
4/5 (5)
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
From Everand
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
Steffen Kruse
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Multi-Agent Conversational Online Learning

Uploaded by

Multi-Agent Conversational Online Learning

Uploaded by

Multi-Agent Conversational Online Learning

for Adaptive LLM Response Identification

Email:{xxdai23, mlliu, zhli, cslui}@cse.cuhk.edu.hk, yuejinxie@hust.edu.cn,

C. Conversational Contextual Mechanism

regardless of whether user preferences have been sufficiently conversational feedback

In this setup, as shown in Fig. 1, each local agent communicates

1e4 1e4 3 1e4 1e4

Setting (d) 2.661 ± 0.056 6.270 ± 2.013 17.919 ± 0.072

spectral decomposition and line 4 in Algorithm 1, we have

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.