0% found this document useful (0 votes)

13 views18 pages

Mac SQL

MAC-SQL is a novel multi-agent collaborative framework designed to enhance Text-to-SQL performance, particularly for large databases and complex queries requiring multi-step reasoning. It features a core Decomposer agent for SQL generation, along with Selector and Refiner agents for database simplification and SQL refinement, respectively. Experimental results demonstrate that MAC-SQL achieves a state-of-the-art execution accuracy of 59.59% on the BIRD benchmark, outperforming existing methods.

Uploaded by

BennedictLuisant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views18 pages

Mac SQL

Uploaded by

BennedictLuisant

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL

Bing Wang1 , Changyu Ren1 , Jian Yang1 , Xinnian Liang1 , Jiaqi Bai1 , Linzheng Chai1
Zhao Yan2 , Qian-Wen Zhang2 , Di Yin2 , Xing Sun2 , Zhoujun Li1 †
1
Beihang University 2 Tencent Youtu Lab
{bingwang,cyren,jiaya,xnliang,bjq,challenging,lizj}@buaa.edu.cn
{zhaoyan,cowenzhang,endymecyyin,winfredsun}@tencent.com

Abstract User Question

List school names of charter schools with an SAT excellence
rate over the average.
Recent LLM-based Text-to-SQL methods usu-
ally suffer from significant performance degra- Database schema
arXiv:2312.11242v5 [cs.CL] 19 Sep 2024

dation on “huge" databases and complex user frpm: CDSCode, County Code, School Code, Charter School(Y/N)
satscores: cds, sname, AvgScrMath, NumTstTakr, NumGE1500, …
questions that require multi-step reasoning. schools: CDSCode, NCESDist, County, City, Zip, …
Moreover, most existing methods neglect the
Evidence
crucial significance of LLMs utilizing exter-
SAT_Excellence_Rate = CAST(NumGE1500 AS REAL) / NumTstTakr
nal tools and model collaboration. To ad-
dress these challenges, we introduce MAC-SQL, Gold SQL
a novel LLM-based multi-agent collaborative SELECT ST.sname FROM frpm FR JOIN satscores ST
ON FR.CDSCode = ST.cds WHERE FR.`Charter School (Y/N)` = 1
framework. Our framework comprises a core AND SAT_Excellence_Rate >
decomposer agent for Text-to-SQL generation ( SELECT AVG(SAT_Excellence_Rate) FROM frpm fr JOIN
satscores st ON fr.CDSCode = st.cds
with few-shot chain-of-thought reasoning, ac- WHERE fr.`Charter School (Y/N)` = 1 )
companied by two auxiliary agents that utilize
external tools or models to acquire smaller sub- Figure 1: A complex example of Text-to-SQL. In the
databases and refine erroneous SQL queries. Gold SQL, we use SAT_Excellence_Rate to represent
The decomposer agent collaborates with auxil- "CAST(NumGE1500 AS REAL)/NumTstTakr" for the
iary agents, which are activated as needed and sake of brevity.
can be expanded to accommodate new features
or tools for effective Text-to-SQL parsing. In
our framework, We initially leverage GPT-4 as Over the past decade, research in this field has
the strong backbone LLM for all agent tasks to
progressed through three stages. In the initial
determine the upper bound of our framework.
We then fine-tune an open-sourced instruction- phase, systems encode input sequence utilizing pre-
followed model, SQL-Llama, by leveraging trained models, and SQL queries are decoded us-
Code Llama 7B, to accomplish all tasks as GPT- ing either abstract syntax trees (Xu et al., 2017;
4 does. Experiments show that SQL-Llama Guo et al., 2019; Wang et al., 2021) or prede-
achieves a comparable execution accuracy of fined sketches (He et al., 2019). More recent sys-
43.94, compared to the baseline accuracy of tems (Raffel et al., 2023; Xie et al., 2022; Scholak
46.35 for vanilla GPT-4. At the time of writ-
et al., 2021) have adopted sequence-to-sequence
ing, MAC-SQL+GPT-4 achieves an execution
accuracy of 59.59 when evaluated on the BIRD methodologies. The latest research (Ouyang et al.,
benchmark, establishing a new state-of-the-art 2022; OpenAI, 2023; Rozière et al., 2023) has
(SOTA) on its holdout test set1 . demonstrated the remarkable capabilities of Large
Language Models (LLMs) in this task. The success
1 Introduction of these models can be ascribed to their emerging
abilities (Wei et al., 2023; Brown et al., 2020) and
Text-to-SQL aims to automate the process of gen- robust reasoning capabilities inherent in LLMs.
erating Structured Query Language (SQL) queries
Recent research on LLM-based Text-to-
for databases from natural language text. This
SQL (Dong et al., 2023; Pourreza and Rafiei,
long-standing challenge is essential for improving
2023; Gao et al., 2023) has mainly concentrated
database accessibility without requiring the exper-
on In-Context Learning prompt strategies and
tise of SQL (Qin et al., 2022; Sun et al., 2023).
supervised fine-tuning using data derived from the
1
https://github.com/wbbeyourself/MAC-SQL target domain. However, these approaches usually
Selector Decomposer
User
Question Database schema User Question
Table schools
List school names of charter schools with an SAT
CDSCode County Street … Phone excellence rate over the average.
109835 Alameda Sperber … 581-0202

Table satscores
Get the average value of SAT excellence rate
cds sname NumGE1500 … NumTstTakr Sub Q1
of charter schools.
109835 2346.0 400 … 191
SELECT AVG(NumGE1500 / NumTstTakr)
SQL 1
FROM frpm JOIN … WHERE …
User Table frpm
CDSCode FRPM Count Meal … Charter (Y/N)
109835 2346.0 4369.0 … 581-0202
List out school names of charter schools with
Sub Q2
an SAT excellence rate over the average.
Final Refiner SELECT sname FROM .. JOIN ..
SQL SQL 2
WHERE SAT_Excellence_Rate > SQL1 and ..

SQLite execute

SQLite error: syntax error

Final SQL
Exception: sqlite3.OperationalError
Wrong SQL

Figure 2: The overview of our MAC-SQL framework, which comprises three agents: (i) the Selector, which
decomposes a large database into a smaller sub-database to mitigate the interference of irrelevant information,
and (ii) the Decomposer, which breaks down a complex question into simpler sub-questions and resolves them
progressively by chain-of-thought reasoning, and (iii) the Refiner, which uses an external tool for SQL execution
and obtains feedback, then refines faulty SQL queries.

suffer from significant performance degradation on 4 as a strong backbone LLM for all agent tasks
“huge” databases and complex user questions that to determine the upper bound of our MAC-SQL
require multi-step reasoning, as demonstrated in framework on the widely used BIRD and Spider
Figure 1. Moreover, most existing methods neglect dataset. Experimental results demonstrate that
the crucial significance of LLMs utilizing external MAC-SQL+GPT-4 achieves an execution accuracy
tools and model collaboration. of 59.59 on the holdout test set of BIRD, establish-
To alleviate the above challenges, we introduce ing a new state-of-the-art (SOTA) at the time of
MAC-SQL, a novel LLM-based multi-agent collab- writing. Furthermore, We utilize SQL-Llama(7B)
orative framework, which exploits LLMs as in- to accomplish all tasks like GPT-4. Surprisingly,
telligent agents with different functionalities for despite SQL-Llama having an order of magnitude
effective Text-to-SQL parsing. Our framework fewer parameters than GPT-4, its execution accu-
comprises a core Decomposer agent for Text-to- racy reaches 43.94, which is remarkably close to
SQL generation, accompanied by two auxiliary the accuracy of GPT-4 (46.35).
agents, the Selector and the Refiner, for tool us- Contribution Our main contributions and re-
age and SQL refinement. Specifically, the Decom- sults are summarized as follows:
poser breaks down a complex question into simpler
sub-questions and resolves them progressively by 1. We propose MAC-SQL, a novel multi-agent col-
chain-of-thought reasoning. When necessary, the laborative framework for Text-to-SQL, which
Selector decomposes a large database into a smaller integrates external tools and facilitates model
sub-database to minimize the interference of irrel- collaboration to address intricate scenarios.
evant information, while the Refiner employs an
external tool for SQL execution, obtains feedback, 2. We introduce an instruction-tuning model,
and refines erroneous SQL queries. named SQL-Llama, to fill in the gaps in open-
Furthermore, we have fine-tuned an instruction- source agent-instruction-following models for
followed model, SQL-Llama, by leveraging Code the task of Text-to-SQL.
Llama 7B, using agent instruction data from
MAC-SQL, thus enabling capabilities in database 3. Experimental results demonstrate that
simplification, question decomposition, SQL gen- MAC-SQL achieves state-of-the-art execution
eration, and SQL correction. accuracy of 59.59% on the BIRD test set at
In our experiments, we initially leverage GPT- the time of writing.
Algorithm 1 The algorithm of MAC-SQL 3 MAC-SQL Framework
Input: question q, database db, knowledge
3.1 Overview
kg
Output: sql In Figure 2, we introduce MAC-SQL, a novel LLM-
1: if need simplify to database then based multi-agent collaborative framework, which
2: db = LLMSelector (q, db, kg) exploits LLMs as intelligent agents with different
3: end if functionalities for effective Text-to-SQL parsing.
4: dbDesc = getDbRepresenation(db, kg) MAC-SQL comprises a core Decomposer agent for
5: subQs, subSQLs = LLMDecomposer (q, dbDesc) Text-to-SQL generation, accompanied by two aux-
6: sql = subSQLs[-1] iliary agents, the Selector and the Refiner, for tool
7: count = 0 usage and SQL refinement. In Algorithm 1, we
8: while count < maxTryTimes do present the collaboration process of three agents
9: ok, err = executeAndAnalyze(sql, db) in MAC-SQL. In the following section, a detailed
10: if ok then introduction of three agents will be presented.
11: return sql
12: else 3.2 Selector
13: sql = LLMRefiner (q, dbDesc, sql, Given an input triple X = (Q, S, K), where
err) database schema S = {T , C}, the Selector agent
14: end if ′ ′ ′
aims to locate the minimal schema S = {T , C },
15: end while ′ ′
where T ⊆ T and C ⊆ C, to answer the question
16: return sql
Q with knowledge K. The function of the Selector
agent can be described as:

2 Preliminaries ′
S = fselector (Q, S, K|M) (2)
2.1 Problem Definition of Text-to-SQL
where fselector (·|M) denotes the function of the
Given a triple X = (Q, S, K), where Q, S and K Selector by prompting the LLM M. The moti-
are natural language question, database schema vation behind designing the selector primarily in-
and external knowledge (optional), the database volves two key factors. Firstly, introducing too
schema S is defined as {T , C}, where T represents many irrelevant schema items in the prompt in-
multiple tables {T1 , T2 , . . . , T|T | } and C represents creases the likelihood of LLM generating irrelevant
columns {C1 , C2 , . . . , C|C| }. The purpose of Text- schema items in the output SQL. Secondly, using
to-SQL task is to generate the correct SQL Y cor- the complete database schema results in excessive
responding to the question Q. text length, leading to unnecessary API costs, and
may exceed the maximum context length of LLM.
2.2 Large Language Model for Text-to-SQL It is important to note that the Selector will only be
activated when the length of the database schema
The task of Text-to-SQL has been formulated as
prompt exceeds the length threshold; otherwise,
a generation task recently (Dong et al., 2023;
the original database schema S will be used for the
Pourreza and Rafiei, 2023), designing appropriate
subsequent process. The complete prompt of the
prompts to guide a large language model M gener-
Selector agent is shown in Appendix 6.
ating SQL queries token by token. The generation
process can be formulated as follows:
3.3 Decomposer

|Y| The purpose of the Decomposer is to enhance

LLM’s reasoning ability by generating a series of
Y
PM (Y|X ) = PM (Yi |Y<i ; X ) (1)
i=1 intermediate steps (i.e. sub-questions and SQLs)
before predicting the final SQL. As shown in Fig-
where Y< i is the prefix of the SQL query Y ure 3, the Decomposer instructs the LLM to de-
and PM (Yi |·) is the conditional probability of the compose the original complex question Q as the
i-th token in the SQL query Y given the prefix Y<i reasoning steps and gets the final SQL query Y in
and the triple X = (Q, S, K). a single pass. It can be described as:
List school names of charter schools with an SAT
excellence rate over the average. SQLite error
near “(“: syntax error
DECOMPOSER
REFINER Exception class
Sub question 1 Sub question 2
<class ‘sqlite3.OperationalError’>
Get the average value of SAT List out school names of charter
excellence rate of charter schools. schools with an SAT excellence rate Wrong SQL
over the average.
SELECT T2.`sname`
FROM frpm AS T1
Sub SQL 1 Sub SQL 2 INNER JOIN satscores AS T2
ON T1.`CDSCode` = T2.`cds`
SELECT AVG(CAST(T2.`NumGE1500` SELECT T2.`sname`
WHERE T1.`Charter School (Y/N)` = 1
AS REAL) / T2.`NumTstTakr`) FROM frpm AS T1
FROM frpm AS T1 INNER JOIN satscores AS T2 AND CAST(T2.`NumGE1500` AS REAL) / T2.`NumTstTakr` > ((
INNER JOIN satscores AS T2 ON T1.`CDSCode` = T2.`cds` SELECT AVG(CAST(T4.`NumGE1500` AS REAL) / T4.`NumTstTakr`)
ON T1.`CDSCode` = T2.`cds` WHERE T1.`Charter School (Y/N)` = 1 …
WHERE T1.`Charter School (Y/N)` = 1 AND CAST(T2.`NumGE1500` AS )
REAL) / T2.`NumTstTakr` > (
<Sub answer 1>
) Fixed SQL
SELECT T2.`sname`
FROM frpm AS T1
INNER JOIN satscores AS T2
Figure 3: The Decomposer Agent Illustration. ON T1.`CDSCode` = T2.`cds`
WHERE T1.`Charter School (Y/N)` = 1
AND CAST(T2.`NumGE1500` AS REAL) / T2.`NumTstTakr` > (
SELECT AVG(CAST(T4.`NumGE1500` AS REAL) / T4.`NumTstTakr`)
…
)
L
′ Y ′
PM (Y|Q, S , K) = PM (Y j |Y <j ; Qj , S , K) (3)
j=1 Figure 4: The Refiner Agent Illustration.
where Qj
and Yj
are the j-th sub-question and
sub-SQL generated by the LLM M given the pre-
vious sub-SQLs Y <j , filtered database schema of Text-to-SQL tasks, the Refiner is essential for
′
S and knowledge K, L is the number of sub- the inspection and correction of generated answers.
questions. For instance, in the ChatDev project (Qian et al.,
The Decomposer pattern can be approached 2024), intelligent agents are responsible for con-
in two prompting methods for text-to-SQL pars- ducting overall and functional module testing in
ing: chain-of-thought (CoT) prompting (Wei et al., addition to overall architectural design and code
2023) and least-to-most prompting (Zhou et al., writing for game software development tasks. Simi-
2022). The former involves generating thinking larly, in Text-to-SQL tasks, the Refiner can be used
and reasoning once to obtain an answer, while the to make appropriate adjustments for the different
latter incurs higher computational costs to generate datasets, database schemas, SQL generation styles,
each SQL query due to the iterative process. and specific inductive biases.
′
Due to the inefficiency of the iterative method Given a flawed SQL query Y and the error
and the need to determine when to stop, we adopt message feedback E, obtaining from external SQL
the CoT approach to generate sub-questions and tools, the Refiner instructs the LLM M to generate
their corresponding SQL. The specific implemen- the correct SQL query Y. It can be described as:
tation is as follows: dynamically judging the diffi-
culty of the user’s question, if it can be answered
′ ′
by a simple SQL query, then the SQL is gener- Y = fref iner (E, Y , Q, S , K|M) (4)
ated directly. If the question is more complex, the
corresponding SQL is generated starting from the
simplest sub-question, and then gradually broken where fref iner (·|M) denotes the function of the
down to obtain progressive sub-questions until the Refiner by prompting the LLM M.
final SQL corresponding to the question is obtained. As shown in Figure 2, upon receiving an SQL
Additionally, we leverage the few-shot approach query, the Refiner diagnoses the SQL statement to
to enhance LLM’s understanding of instructions assess its syntactic correctness, execution feasibil-
through in-context learning. ity, and the retrieval of non-empty results from the
database. In general, the purpose of the Refiner
3.4 Refiner is to achieve self-checking and self-correction of
The primary function of the Refiner is to detect and the model to enhance the overall framework’s fault
automatically rectify SQL errors, as illustrated in tolerance and accuracy. By leveraging the Selec-
Figure 4. In a comprehensive multi-agent collab- tor agent, there is a significant reduction in syntax
orative framework, particularly within the context errors, schema linking, and other simple errors.
4 SQL-Llama Model Despite these challenges, our work on
instruction-tuned models represents a significant
4.1 Instruction Dataset Construction step towards democratizing access to high-
To construct the Agent-Instruct dataset, we instruct performance language models for database-related
the GPT-4 with the training set of the BIRD and tasks. By open-sourcing both the model and the
Spider dataset through multi-agent tasks. We col- instruction dataset, we aim to provide valuable
lect the generated instruction data according to the resources for further research and development in
level of difficulty and filter out those with incor- this area, ultimately leading to more accessible and
rect SQL query output. Finally, the curated Agent- effective tools for database query processing and
Instruct dataset D with N (N=3) instruction tasks, related tasks.
D = {Di }N i=1 contains 10,000 high-quality instruc-
tion data with 3 agent-instruction tasks, covering 5 Experiments
both BIRD and Spider dataset distribution. 5.1 Experimental Setup

4.2 Multi-task Supervised Fine-tuning Datasets The Spider (Yu et al., 2018) dataset is
frequently employed for assessing the performance
Our research has been primarily focused on the of text-to-SQL parsing across multiple databases,
development of open-source models within the necessitating models to demonstrate adaptability
MAC-SQL framework, to achieve performance lev- to unfamiliar database structures. The dataset com-
els comparable to closed-source models like GPT- prises 7,000 question-query pairs in the training set
4. To achieve this, we have put significant effort and 1,034 pairs in the development set, encompass-
into preparing the data for model training and have ing 200 distinct databases and 138 domains. In this
open-sourced SQL-Llama, a model that has been study, we assess the efficacy of our framework on
fine-tuned using three intelligent agent instruction the Spider development set, as the test set is not
data. The SQL-Llama model, based on Code Llama accessible.
7B, has undergone supervised fine-tuning using The BIRD (Li et al., 2023) dataset released by
agent instruction data from MAC-SQL, which has Alibaba DAMO Academy is a new benchmark for
enhanced its capabilities in database simplification, large-scale real databases, containing 95 large-scale
question decomposition, SQL generation, and SQL databases and high-quality Text-SQL pairs, with
correction. a data storage volume of up to 33.4GB spanning
Given the Agent-Instruct dataset with N (N=3) 37 professional domains. Unlike Spider, BIRD
instruction tasks, D = {Di }N i=1 , the LLM trained focuses on massive and real database content, ex-
on D can learn from these tasks and complete agent ternal knowledge reasoning between natural lan-
tasks. The supervised fine-tuning process can be guage questions and database content, and new
described as: challenges in SQL efficiency when dealing with
large databases.
N
X h i Evaluation Metrics Following BIRD (Li et al.,
L=− EQ,S i ,K,Y i ∼D log P (Y i |Q, S i , K; M) (5)
i=1
2023) and Test-suite (Zhong et al., 2020), we con-
sider three metrics, exact match accuracy (EM),
where L is the training objective of N tasks, execution accuracy (EX) and valid efficiency score
S i and Y i are the selected database schema and (VES) to evaluate text-to-SQL models confronted
intermediate SQL query of the i-th task. with real-world scenarios with large database con-
One of the key challenges we encountered during tents. Exact Match Accuracy (EM) treats each
the model training process was balancing model clause as a set and compares the prediction for each
complexity with performance. We had to carefully clause to its corresponding clause in the reference
optimize the model architecture and parameters to query. A predicted SQL query is considered cor-
ensure that it could effectively handle the complexi- rect only if all of its components match the ground
ties of database-related tasks while still maintaining truth. This metric does not take values into ac-
high-performance levels. Additionally, ensuring count. Execution Accuracy (EX) is defined as the
the quality and relevance of the instruction dataset proportion of questions in the evaluation set for
for training was crucial, as it directly impacted the which the execution results of both the predicted
model’s performance. and ground-truth inquiries are identical, relative
Dev Test Method EX (Dev) EX (Test)
Method
EX VES EX VES
C3 + ChatGPT 81.80 82.30
Palm-2 27.38 - 33.04 - DIN-SQL + GPT-4 82.80 85.30
ChatGPT + CoT 36.64 42.30 40.08 56.56
Claude-2 42.70 - 49.02 - DAIL-SQL + GPT-4 84.40 86.60
GPT-4 46.35 49.77 54.89 60.77
DIN-SQL + GPT-4 50.72 58.79 55.90 59.44 SQL-Llama(7B) 65.48 61.63
DAIL-SQL + GPT-4 54.76 56.08 57.41 61.95 MAC-SQL + SQL-Llama(7B) 76.25 70.58
SQL-Llama(7B) 32.87 55.67 - - MAC-SQL + GPT-3.5-Turbo 80.56 75.53
MAC-SQL + SQL-Llama(7B) 43.94 57.36 - - MAC-SQL + GPT-4 86.75 82.80
+ Oracle Schema 51.43 58.24 - -
MAC-SQL + GPT-3.5-Turbo 50.56 61.25 - -
+ Oracle Schema 65.78 60.62 - - Table 2: Execution accuracy(EX) on both dev and test
MAC-SQL + GPT-4 59.39 66.39 59.59 67.68 set of Spider.
+ Oracle Schema 70.28 62.63 - -
Method Simple Mod. Chall. All
Table 1: Execution accuracy(EX) and Valid efficiency
score (VES) on both dev and test set of BIRD dataset. MAC-SQL + GPT-4 65.73 52.69 40.28 59.39
w/o Selector 65.73 52.04 35.14 57.28(↓)
The term "Oracle Schema" refers to the utilization of a w/o Decomposer 61.51 48.82 38.89 55.54(↓)
ground truth sub-database as the input for the Decom- w/o Refiner 63.24 44.52 33.33 54.76(↓)
poser, rather than employing the results obtained from
the Selector. Table 3: Execution accuracy of MAC-SQL ablation study
in BIRD dev set. For brevity, the abbreviation "Mod."
stands for "Moderate" while "Chall." denotes "Challeng-
to the overall number of queries. Valid Efficiency ing".
Score (VES) is designed to measure the efficiency
of valid SQLs generated by models. It is important
to note that "valid SQLs" refers to predicted SQL 5.2 Overall Performance
queries whose result sets align with those of the It is important to note that the experiment utilized
ground-truth SQLs. the 32k version of GPT-4 and the 16k version of
GPT-3.5-Turbo.
Baselines We conduct experiments on both
BIRD and Spider datasets and compare our method BIRD Results In Table 1, we report the perfor-
with the following baseline: mance of our method and baseline methods on the
BIRD dataset. It is evident that our method sur-
• GPT-4 (OpenAI, 2023) uses simple zero-shot passes all LLM-based methods in terms of execu-
text-to-SQL prompt for SQL generation. tion accuracy (EX) and valid efficiency score (VES)
on both the development and test sets. Specifically,
our method outperforms the second-best method
• DIN-SQL (Pourreza and Rafiei, 2023) decom-
by 4.63% on the development set and by 2.18% on
poses the text-to-SQL task into smaller sub-
the test set. At the time of writing, MAC-SQL+GPT-
tasks and designs different prompts for each
4 achieves an execution accuracy of 59.59 when
subtask to instruct GPT-4 to complete each
evaluated on the BIRD benchmark, establishing a
subtask and obtain the final SQL.
new state-of-the-art (SOTA) on its holdout test set.

• DAIL-SQL (Gao et al., 2023) encodes struc- Spider Results Currently, Spider has open-
ture knowledge as SQL statements, selects sourced the test set, so we can evaluate our method
few-shot demonstrations based on their skele- in both the development and the test set. As shown
ton similarities and removes cross-domain in Table 2, for the dev set of Spider (Yu et al., 2018),
knowledge from examples for token effi- our method achieves the highest execution accu-
ciency. racy using GPT-4. These results demonstrate the
generalization ability of our MAC-SQL framework.
• C3-SQL (Dong et al., 2023) first performs
schema linking filtering and then directs GPT- 5.3 Ablation Study
4 with a calibration bias prompt designed for Table 3 presents the results of an ablation study for
Spider using a self-consistency strategy. the MAC-SQL model in the BIRD dev set. The table
lists different variations of the MAC-SQL model, in- BIRD Spider
cluding with and without certain components such Few-shot
EX VES EM EX
as Selector, Decomposer, and Refiner. The other
columns represent the accuracy of the model on dif- 0-shot 55.54 63.31 58.42 74.22
ferent levels of difficulty: Simple, Moderate, and 1-shot 57.26 64.32 59.68 78.35
Challenging, as well as the overall accuracy (All). 2-shot 59.39 66.24 63.20 86.75
The findings show that the original MAC-SQL +
GPT-4 model achieves an accuracy of 65.73% on Table 4: Results of MAC-SQL+GPT-4 on the dev set of
Simple, 52.69% on Moderate, and 40.28% on Chal- BIRD and Spider with few-shot evaluation.
lenging, with an overall accuracy of 59.39%. When
removing the Selector component, the accuracy Figure 5 displays the error type distribution in
remained the same for Simple, but decreased to BIRD and Spider datasets. "Gold Error" is the most
52.04% for Moderate and 35.14% for Challenging, common error type, accounting for 30% and 22%
resulting in an overall accuracy of 57.28% (a de- in BIRD and Spider, respectively, signifying the
crease of 2.11%). Similarly, removing the Decom- significance of gold standard annotations. "Seman-
poser and Refiner components also led to decreases tic Correct" is another prevalent error type, repre-
in accuracy across all difficulty levels. senting 14% and 22% in BIRD and Spider, respec-
Overall, the ablation study indicates that each tively, indicating the importance of semantic under-
component of the MAC-SQL model (Selector, De- standing and correctness. However, "Schema Link-
composer, and Refiner) plays a crucial role in ing Error" is more frequent in BIRD (2%) than in
achieving high accuracy, as their removal resulted Spider (8%), demonstrating differences in schema
in decreased performance across all difficulty lev- linking errors. This analysis underscores the need
els. for addressing gold standard annotations, semantic
5.4 Discussion correctness, and schema linking in dataset devel-
opment and evaluation, thereby improving their
Impact on the number of demonstrations Ta- quality and reliability. The appendix B contains
ble 4 shows evaluation results of MAC-SQL with detailed examples of error types.
different numbers of demonstrations on the BIRD
and Spider datasets. As the number of shots in- 6 Related Work
creases from 0 to 2, there is a consistent improve-
ment in the performance metrics (EX, VES, and LLMs for Text-to-SQL Recent advancements
EM) for both BIRD and Spider. This indicates that in text-to-SQL tasks using large language mod-
the model benefits from additional demonstration els (LLMs) have focused on improving prompt
examples and is able to generalize better with more design and developing multi-stage refined frame-
data. The highest performance is achieved with works. In the early stages of the emergence of
2-shot evaluation, indicating that the model is capa- large language models, research efforts were pri-
ble of learning effectively from a small number of marily focused on designing high-quality prompts
examples. The high cost of the GPT-4 interface re- to better exploit the potential of LLMs for SQL
sults in a significant consumption of tokens during generation. For example, (Tai et al., 2023) system-
a full test of the dev set for Spider and BIRD, es- atically studied how to enhance LLMs’ reasoning
timated at approximately 6 million and 10 million ability through chain-of-thought style prompting,
tokens, respectively. Due to the cost constraints, including the original chain-of-thought prompting
our analysis is limited to a maximum of 2-shot, and and least-to-most prompting. Similarly, (Chang
further experiments involving more shots (e.g., shot and Fosler-Lussier, 2023) comprehensively investi-
k < 2) will have to await a more budget-friendly gated the impact of prompt constructions across var-
implementation of GPT-4. ious settings when constructing the prompt text for
text-to-SQL inputs. Additionally, DAIL-SQL (Gao
5.5 Error Analysis et al., 2023) systematically examined prompt en-
In order to thoroughly assess the limitations of our gineering for LLM-based Text-to-SQL methods,
method, we begin by choosing two datasets (BIRD including question representations, prompt com-
and Spider) that contain various types of structured ponents, example selections, and example organi-
data, as shown in Figure 5. zations. Later studies, like C3-SQL (Dong et al.,
Figure 5: Error Distributions of MAC-SQL on dev set of BIRD and Spider.

2023), DIN-SQL (Pourreza and Rafiei, 2023), and Similarly, OpenAgents (Xie et al., 2023) devel-
StructGPT (Jiang et al., 2023), proposed frame- ops three distinct agents, the Data Agent for data
works for simplifying databases, generating SQL, analysis, the Plugins Agent for plugin integration,
verifying queries, and integrating answers through and the Web Agent for autonomous web browsing,
zero-shot approaches, query decomposition, and each specializing in different domains, similar to
specialized interfaces for structured data access. OpenAI’s ChatGPT Plugins. Additionally, Auto-
However, the aforementioned methods have sev- Gen (Wu et al., 2023) is an open-source framework
eral issues. Firstly, the experiments were con- that enables developers to build customizable, con-
ducted solely on the Spider family dataset, failing versable agents that can operate in various modes,
to demonstrate their generalization to more com- employing combinations of LLMs, human inputs,
plex datasets like BIRD, hence limiting their real- and tools to accomplish tasks. However, how to
world applicability. Secondly, certain methods de- apply LLM-based agents to Text-to-SQL parsing
pend on difficulty-level classifiers and customized remains under-explored.
biases specific to the Spider dataset for error cor- We fill this gap by proposing a multi-agent col-
rection, thus lacking the ability to generalize to laborative Text-to-SQL framework, which inte-
a broader spectrum of error types. Thirdly, these grates multiple LLM-based agents to collectively
methods neglect the utilization of external tools interpret SQL queries and address the complexity
and the collaboration of different modules. Thus, and diversity of SQL queries encountered in real-
we propose a framework centered on multi-agent world scenarios.
collaboration that can be utilized for more intri-
cate data scenarios and a broader spectrum of error
types for detection and correction. 7 Conclusion
LLM-based Agents LLM-based agents have
been a prominent area of study in both academic In summary, this paper proposes the MAC-SQL
and industry communities for an extended pe- framework, which utilizes multi-agent collabora-
riod (Wang et al., 2023). Recently, through the ac- tion to address challenges in Text-to-SQL tasks.
quisition of vast amounts of web knowledge, LLMs The framework, along with the open-sourced SQL-
have demonstrated remarkable potential in achiev- Llama model, achieved an execution accuracy of
ing human-level intelligence. This development 59.59 when evaluated on the BIRD benchmark,
has led to a surge in research exploring autonomous establishing a new state-of-the-art (SOTA) on its
agents based on LLMs. AutoGPT (Team, 2023) holdout test set. This work presents a novel ap-
is an open-source implementation of an AI agent proach to Text-to-SQL and provides practical guid-
and follows a single-agent paradigm in which it ance for achieving high performance in this domain.
augments the AI model with many useful tools, Furthermore, our framework can be expanded to
and does not support multi-agent collaboration. support a broader spectrum of scenarios.
Limitations Cheng, and Yongbin Li. 2023. Can llm already serve
as a database interface? a big bench for large-scale
There are two limitations of our work. Firstly, we database grounded text-to-sqls.
did not extensively engineer the prompts, which
OpenAI. 2023. Gpt-4 technical report. ArXiv.
may not be optimal. Secondly, this paper reports
the fine-tuning results of the 7B CodeLLama model. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Although it performs at a comparable level, we roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
believe its performance can be further improved by Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
using larger models. Maddie Simens, Amanda Askell, Peter Welinder,
Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Ethics Statement Training language models to follow instructions with
human feedback.
The datasets and models utilized in this paper, and
the implementation of the code and the resulting Mohammadreza Pourreza and Davood Rafiei. 2023.
Din-sql: Decomposed in-context learning of text-
models, are not associated with any ethical con- to-sql with self-correction.
cerns.
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan
Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng
Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu,
References and Maosong Sun. 2024. Chatdev: Communicative
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie agents for software development.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Bowen Qin, Binyuan Hui, Lihan Wang, Min Yang,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Jinyang Li, Binhua Li, Ruiying Geng, Rongyu Cao,
Gretchen Krueger, Tom Henighan, Rewon Child, Jian Sun, Luo Si, et al. 2022. A survey on text-to-sql
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, parsing: Concepts, methods, and future directions.
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv preprint arXiv:2208.13629.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Jack Clark, Christopher Berner, Sam McCandlish, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Alec Radford, Ilya Sutskever, and Dario Amodei. Wei Li, and Peter J. Liu. 2023. Exploring the limits
2020. Language models are few-shot learners. of transfer learning with a unified text-to-text trans-
former.
Shuaichen Chang and Eric Fosler-Lussier. 2023. How
to prompt llms for text-to-sql: A study in zero-shot, Baptiste Rozière, Jonas Gehring, Fabian Gloeckle,
single-domain, and cross-domain settings. Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish
Yunjun Gao, lu Chen, Jinshu Lin, and Dongfang Lou. Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen-
2023. C3: Zero-shot text-to-sql with chatgpt. han Xiong, Alexandre Défossez, Jade Copet, Faisal
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier,
Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Thomas Scialom, and Gabriel Synnaeve. 2023. Code
Text-to-sql empowered by large language models: A llama: Open foundation models for code.
benchmark evaluation. Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
danau. 2021. Picard: Parsing incrementally for
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-
constrained auto-regressive decoding from language
Guang Lou, Ting Liu, and Dongmei Zhang. 2019. To-
models.
wards complex text-to-sql in cross-domain database
with intermediate representation. Ruoxi Sun, Sercan O. Arik, Hootan Nakhost, Hanjun
Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Pfister. 2023. Sql-palm: Improved large language
Weizhu Chen. 2019. X-sql: reinforce schema repre- model adaptation for text-to-sql.
sentation with context.
Chang-You Tai, Ziru Chen, Tianshu Zhang, Xiang Deng,
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, and Huan Sun. 2023. Exploring chain-of-thought
Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: style prompting for text-to-sql.
A general framework for large language model to
reason over structured data. AutoGPT Team. 2023. Autogpt: build and use ai agents.
Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Polozov, and Matthew Richardson. 2021. Rat-sql:
Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Relation-aware schema encoding and linking for text-
Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold to-sql parsers.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang,
Xu Chen, Yankai Lin, et al. 2023. A survey on large
language model based autonomous agents. arXiv
preprint arXiv:2308.11432.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. 2023. Chain-of-thought prompting elic-
its reasoning in large language models.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
lah, Ryen W White, Doug Burger, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications
via multi-agent conversation.
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong,
Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Vic-
tor Zhong, Bailin Wang, Chengzu Li, Connor Boyle,
Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming
Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith,
Luke Zettlemoyer, and Tao Yu. 2022. Unifiedskg:
Unifying and multi-tasking structured knowledge
grounding with text-to-text language models.
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Lu-
oxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao,
Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin
Su, Dongchan Shin, Caiming Xiong, and Tao Yu.
2023. Openagents: An open platform for language
agents in the wild.
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
Generating structured queries from natural language
without reinforcement learning.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task. In Proc. of EMNLP.
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic
evaluation for text-to-SQL with distilled test suites.
In Proc. of EMNLP.
¨
Denny Zhou, Nathanael Sch"arli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022.
Least-to-most prompting enables complex reason-
ing in large language models. arXiv preprint
arXiv:2205.10625.
Task Description
As an experienced and professional database administrator , your task is to ...

Instruction
1. Discard any table schema that is not related to the user question and evidence .
2. Sort the columns in each relevant table in descending order of relevance and keep the top 6
columns .
3. Ensure that at least 3 tables are included in the final output JSON .
4. The output should be in JSON format .

Demonstration
[ DB_ID ] banking_system
[ Schema ] Table schemas of account , client , loan , district ...
[ Foreign keys ] ...
[ Question ]: What is the gender of the youngest client who opened account in the lowest average
salary branch ?
[ Evidence ]: Later birthdate refers to younger age ; A11 refers to average salary
[ Answer ]
``` json
{ " account " : " keep_all " ,
" client " : " keep_all " ,
" loan " : " drop_all " ,
" district " : [ " district_id " , " A11 " , " A2 " , ...] }
```

Test Question
[ DB_ID ] { db_id }
[ Schema ] { desc_str }
[ Foreign keys ] { fk_str }
[ Question ] { query }
[ Evidence ] { evidence }
[ Answer ]

Figure 6: An example of Selector prompt. The specific details are omitted for the sake of brevity.
A Prompt Details
A.1 Selector Prompt

As an experienced and professional database administrator, your task is to analyze a user question
and a database schema to provide relevant information. The database schema consists of table
descriptions, each containing multiple column descriptions. Your goal is to identify the relevant
tables and columns based on the user question and evidence provided.

[Instruction]
1. Discard any table schema that is not related to the user question and evidence.
2. Sort the columns in each relevant table in descending order of relevance and keep the top 6
columns.
3. Ensure that at least 3 tables are included in the final output JSON.
4. The output should be in JSON format.

[Requirements]
1. If a table has less than or equal to 10 columns, mark it as "keep_all".
2. If a table is completely irrelevant to the user question and evidence, mark it as "drop_all".
3. Prioritize the columns in each relevant table based on their relevance.

Here is a typical example:

==========
[DB_ID] banking_system
[Schema]
# Table: account
[
(account_id, the id of the account. Value examples: [11382, 11362, 2, 1, 2367].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(frequency, frequency of the acount. Value examples: [’POPLATEK MESICNE’, ’POPLATEK
TYDNE’, ’POPLATEK PO OBRATU’].),
(date, the creation date of the account. Value examples: [’1997-12-29’, ’1997-12-28’].)
]
# Table: client
[
(client_id, the unique number. Value examples: [13998, 13971, 2, 1, 2839].),
(gender, gender. Value examples: [’M’, ’F’]. And F:female . M:male ),
(birth_date, birth date. Value examples: [’1987-09-27’, ’1986-08-13’].), (district_id, location
of branch. Value examples: [77, 76, 2, 1, 39].)
]
# Table: loan
[
(loan_id, the id number identifying the loan data. Value examples: [4959, 4960, 4961].),
(account_id, the id number identifying the account. Value examples: [10, 80, 55, 43].),
(date, the date when the loan is approved. Value examples: [’1998-07-12’, ’1998-04-19’].),
(amount, the id number identifying the loan data. Value examples: [1567, 7877, 9988].),
(duration, the id number identifying the loan data. Value examples: [60, 48, 24, 12, 36].),
(payments, the id number identifying the loan data. Value examples: [3456, 8972, 9845].),
(status, the id number identifying the loan data. Value examples: [’C’, ’A’, ’D’, ’B’].)
]
# Table: district
[
(district_id, location of branch. Value examples: [77, 76].),
(A2, area in square kilometers. Value examples: [50.5, 48.9].),
(A4, number of inhabitants. Value examples: [95907, 95616].),
(A5, number of households. Value examples: [35678, 34892].),
(A6, literacy rate. Value examples: [95.6, 92.3, 89.7].),
(A7, number of entrepreneurs. Value examples: [1234, 1456].),
(A8, number of cities. Value examples: [5, 4].),
(A9, number of schools. Value examples: [15, 12, 10].),
(A10, number of hospitals. Value examples: [8, 6, 4].),
(A11, average salary. Value examples: [12541, 11277].),
(A12, poverty rate. Value examples: [12.4, 9.8].),
(A13, unemployment rate. Value examples: [8.2, 7.9].),
(A15, number of crimes. Value examples: [256, 189].)
]
[Foreign keys]
client.‘district_id‘ = district.‘district_id‘
[Question]
What is the gender of the youngest client who opened account in the lowest average salary branch?
[Evidence]
Later birthdate refers to younger age; A11 refers to average salary
[Answer]
”’json
{
"account": "keep_all",
"client": "keep_all",
"loan": "drop_all",
"district": ["district_id", "A11", "A2", "A4", "A6", "A7"]
}
”’
Question Solved.
==========

Here is a new example, please start answering:

[DB_ID] {db_id}
[Schema]
{desc_str}
[Foreign keys]
{fk_str}
[Question]
{query}
[Evidence]
{evidence}
[Answer]

A.2 Decomposer Prompt

Given a [Database schema] description, a knowledge [Evidence] and the [Question], you need to
use valid SQLite and understand the database and knowledge, and then decompose the question
into subquestions for text-to-SQL generation.

When generating SQL, we should always consider constraints:

[Constraints]
- In ‘SELECT <column>‘, just select needed columns in the [Question] without any unnecessary
column or value
- In ‘FROM <table>‘ or ‘JOIN <table>‘, do not include unnecessary table
- If use max or min func, ‘JOIN <table>‘ FIRST, THEN use ‘SELECT MAX(<column>)‘ or
‘SELECT MIN(<column>)‘
- If [Value examples] of <column> has ’None’ or None, use ‘JOIN <table>‘ or ‘WHERE <column>
is NOT NULL‘ is better
- If use ‘ORDER BY <column> ASC|DESC‘, add ‘GROUP BY <column>‘ before to select distinct
values

==========

[Database schema]
# Table: frpm
[
(CDSCode, CDSCode. Value examples: [’01100170109835’, ’01100170112607’].),
(Charter School (Y/N), Charter School (Y/N). Value examples: [1, 0, None]. And 0: N;. 1: Y),
(Enrollment (Ages 5-17), Enrollment (Ages 5-17). Value examples: [5271.0, 4734.0].),
(Free Meal Count (Ages 5-17), Free Meal Count (Ages 5-17). Value examples: [3864.0, 2637.0].
And eligible free rate = Free Meal Count / Enrollment)
]
# Table: satscores
[
(cds, California Department Schools. Value examples: [’10101080000000’,
’10101080109991’].),
(sname, school name. Value examples: [’None’, ’Middle College High’, ’John F. Kennedy
High’, ’Independence High’, ’Foothill High’].),
(NumTstTakr, Number of Test Takers in this school. Value examples: [24305, 4942, 1, 0, 280].
And number of test takers in each school),
(AvgScrMath, average scores in Math. Value examples: [699, 698, 289, None, 492]. And
average scores in Math), (NumGE1500, Number of Test Takers Whose Total SAT Scores Are
Greater or Equal to 1500. Value examples: [5837, 2125, 0, None, 191]. And Number of Test Takers
Whose Total SAT Scores Are Greater or Equal to 1500. . commonsense evidence:. . Excellence
Rate = NumGE1500 / NumTstTakr)
]
[Foreign keys]
frpm.‘CDSCode‘ = satscores.‘cds‘
[Question]
List school names of charter schools with an SAT excellence rate over the average.
[Evidence]
Charter schools refers to ‘Charter School (Y/N)‘ = 1 in the table frpm; Excellence rate =
NumGE1500 / NumTstTakr

Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
Sub question 1: Get the average value of SAT excellence rate of charter schools.
SQL
”’ sql
SELECT AVG(CAST(T2.‘NumGE1500‘ AS REAL) / T2.‘NumTstTakr‘)
FROM frpm AS T1
INNER JOIN satscores AS T2
ON T1.‘CDSCode‘ = T2.‘cds‘
WHERE T1.‘Charter School (Y/N)‘ = 1
”’

Sub question 2: List out school names of charter schools with an SAT excellence rate over the
average.
SQL
”’ sql
SELECT T2.‘sname‘
FROM frpm AS T1
INNER JOIN satscores AS T2
ON T1.‘CDSCode‘ = T2.‘cds‘
WHERE T2.‘sname‘ IS NOT NULL
AND T1.‘Charter School (Y/N)‘ = 1
AND CAST(T2.‘NumGE1500‘ AS REAL) / T2.‘NumTstTakr‘ > (
SELECT AVG(CAST(T4.‘NumGE1500‘ AS REAL) / T4.‘NumTstTakr‘)
FROM frpm AS T3
INNER JOIN satscores AS T4
ON T3.‘CDSCode‘ = T4.‘cds‘
WHERE T3.‘Charter School (Y/N)‘ = 1
)
”’

Question Solved.

==========

[Database schema]
# Table: account
[
(account_id, the id of the account. Value examples: [11382, 11362, 2, 1, 2367].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(frequency, frequency of the acount. Value examples: [’POPLATEK MESICNE’, ’POPLATEK
TYDNE’, ’POPLATEK PO OBRATU’].),
(date, the creation date of the account. Value examples: [’1997-12-29’, ’1997-12-28’].)
]
# Table: client
[
(client_id, the unique number. Value examples: [13998, 13971, 2, 1, 2839].),
(gender, gender. Value examples: [’M’, ’F’]. And F:female . M:male ),
(birth_date, birth date. Value examples: [’1987-09-27’, ’1986-08-13’].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].)
]
# Table: district
[
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(A4, number of inhabitants . Value examples: [’95907’, ’95616’, ’94812’].),
(A11, average salary. Value examples: [12541, 11277, 8114].) ]
[Foreign keys]
account.‘district_id‘ = district.‘district_id‘
client.‘district_id‘ = district.‘district_id‘
[Question]
What is the gender of the youngest client who opened account in the lowest average salary branch?
[Evidence]
Later birthdate refers to younger age; A11 refers to average salary

Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
Sub question 1: What is the district_id of the branch with the lowest average salary?
SQL
”’ sql
SELECT ‘district_id‘
FROM district
ORDER BY ‘A11‘ ASC
LIMIT 1
”’
Sub question 2: What is the youngest client who opened account in the lowest average
salary branch?
SQL
”’ sql
SELECT T1.‘client_id‘
FROM client AS T1
INNER JOIN district AS T2
ON T1.‘district_id‘ = T2.‘district_id‘
ORDER BY T2.‘A11‘ ASC, T1.‘birth_date‘ DESC
LIMIT 1
”’

Sub question 3: What is the gender of the youngest client who opened account in the lowest
average salary branch?
SQL
”’ sql
SELECT T1.‘gender‘
FROM client AS T1
INNER JOIN district AS T2
ON T1.‘district_id‘ = T2.‘district_id‘
ORDER BY T2.‘A11‘ ASC, T1.‘birth_date‘ DESC
LIMIT 1
”’
Question Solved.

==========

[Database schema]
{desc_str}
[Foreign keys]
{fk_str}
[Question]
{query}
[Evidence]
{evidence}

Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
A.3 Refiner Prompt

[Instruction]
When executing SQL below, some errors occurred, please fix up SQL based on query and database
info. Solve the task step by step if you need to. Using SQL format in the code block, and indicate
script type in the code block. When you find an answer, verify the answer carefully. Include
verifiable evidence in your response if possible.
[Constraints]
- In ‘SELECT <column>‘, just select needed columns in the [Question] without any unnecessary
column or value
- In ‘FROM <table>‘ or ‘JOIN <table>‘, do not include unnecessary table
- If use max or min func, ‘JOIN <table>‘ FIRST, THEN use ‘SELECT MAX(<column>)‘ or
‘SELECT MIN(<column>)‘
- If [Value examples] of <column> has ’None’ or None, use ‘JOIN <table>‘ or ‘WHERE <column>
is NOT NULL‘ is better
- If use ‘ORDER BY <column> ASC|DESC‘, add ‘GROUP BY <column>‘ before to select distinct
values
[Query]
{query}
[Evidence]
{evidence}
[Database info]
{desc_str}
[Foreign keys]
{fk_str}
[old SQL]
”’ sql
{sql}
”’
[SQLite error]
{sqlite_error}
[Exception class]
{exception_class}

Now please fixup old SQL and generate new SQL again.
[correct SQL]

B Error Type Examples

Examples of error types can be observed in Figure 7 (next page).
Gold Error
Question Evidence Gold SQL Pred SQL Error Description
SELECT COUNT(T1.ID) FROM
male refers to Sex = 'M'; SELECT COUNT(*) FROM
Patient AS T1 INNER JOIN Gold SQL uses "ALB
normal level of both Laboratory WHERE ÌD`
How many male patients have a Laboratory AS T2 ON T1.ID = BETWEEN 3.5 AND
albumin and total protein IN ( SELECT ... AND
normal level of both albumin T2.ID WHERE T1.SEX = 'M' AND 5.5", which is conflict
refers to ALB > 3.5 and ÀLB` > 3.5 AND ÀLB`
and total protein? T2.ALB BETWEEN 3.5 AND 5.5 with evidence "3.5 <
ALB < 5.5 AND TP < 5.5 AND `TP` >= 6.0
AND T2.TP BETWEEN 6.0 AND ALB < 5.5".
between 6.0 and 8.5; AND `TP` <= 8.5
8.5
How many artists have designed
available in both "arena" SELECT COUNT(CASE WHEN SELECT COUNT(DISTINCT
a card with a black border àrtist`) FROM cards
and "mtgo" refers to availability LIKE Gold SQL missing the
color and is available in both '%arena,mtgo%' THEN 1 ELSE WHERE `borderColor` = borderColor constraint.
availability like
"arena" and "mtgo" printing NULL END) FROM cards 'black' AND …
'%arena,mtgo%'
type?
Database Misunderstand
Question Evidence Gold SQL Pred SQL Error Description
SELECT SUM(T1.Score),
SELECT ... ON
What is the total score of the T2.WebsiteUrl FROM posts AS
"Yevgeny" is the T1.`LastEditorUserId`
posts edited by Yevgeny and T1 INNER JOIN users AS T2 ON Use wrong db foreign
DisplayName; edited refers = T2.Ìd` WHERE
include the user's website T1.OwnerUserId = T2.Id WHERE key
to LastEditorUserId T2.`DisplayName` =
URL. T2.DisplayName = 'Yevgeny'
'Yevgeny'
GROUP BY T2.WebsiteUrl

Semantic Correct
Question Evidence Gold SQL Pred SQL Error Description
budget allowed refers to SELECT T2.budget_id, T1.cost SELECT T1.cost,
Which budget allowed the most expense_description; FROM expense AS T1 INNER T2.budget_id FROM
Same answer, but with
money for water, chips, and expense_description = JOIN budget AS T2 ON expense AS T1 INNER
T1.link_to_budget = different order
cookies? 'Water, chips, cookies'; JOIN budget AS T2 ON
most money refers to T2.budget_id WHERE … …
Question Misunderstand
Question Evidence Gold SQL Pred SQL Error Description
SELECT DisplayName FROM
"Harlan" and "Jarrod users WHERE DisplayName IN SELECT `DisplayName`,
Which user has a higher Dixon" are both ('Harlan', 'Jarrod Dixon') `Reputation` FROM Pred SQL miss out one
reputation, Harlan or Jarrod DisplayName; highest AND Reputation = ( SELECT users WHERE of the names in the
Dixon? reputation refers to MAX(Reputation) FROM users `DisplayName` = question.
Max(Reputation) WHERE DisplayName IN "Harlan"
('Harlan', 'Jarrod Dixon') )
Dirty Database Values
Question Evidence Gold SQL Pred SQL Error Description
SELECT `name`, Both table cards and
SELECT id, colors FROM cards `colors` FROM cards
What are the cards for set table set_translations
set OGW refers to setCode WHERE id IN ( SELECT id FROM WHERE `setCode` =
OGW? State the colour for set_translations WHERE have the column
= 'OGW'; 'OGW' AND `colors` IS
these cards. setCode = 'OGW' ) setCode with
NOT NULL inconsistent values.
Wrong Schema Linking
Question Evidence Gold SQL Pred SQL Error Description
SELECT T2.Ìd` AS
SELECT T2.Id, T1.Text FROM
What is the post ID and the CommentId, T2.`Text`
comments AS T1 INNER JOIN
comments commented in the AS CommentText FROM
Title = 'Group differences posts AS T2 ON T1.PostId = Different table join
post titled by "Group posts AS T1 INNER
on a five point Likert item'; T2.Id WHERE T2.Title = order, use wrong table id
differences on a five point JOIN comments AS T2
'Group differences on a five
Likert item"? ON T1.Ìd` =
point Likert item'
T2.`PostId` WHERE …
Evidence Misunderstand
Question Evidence Gold SQL Pred SQL Error Description
SELECT T1.City, ...
SELECT T2.City FROM frpm AS JOIN frpm ON
Which cities have the top 5 T1 INNER JOIN schools AS T2 schools.CDSCode =
lowest enrollment number for K-12 refers to students in ON T1.CDSCode = T2.CDSCode frpm.CDSCode WHERE Misunderstand
students in grades 1 through grades 1 through 12. GROUP BY T2.City ORDER BY frpm.`Low Grade` = knowledge K-12
12? SUM(T1.Ènrollment (K-12)`) '1' AND frpm.`High
ASC LIMIT 5 Grade` = '12' ) ...
ASC LIMIT 5
Other Errors
Question Evidence Gold SQL Pred SQL Error Description
SELECT `GSoffered`
FROM schools WHERE
What is the grade span offered SELECT GSoffered FROM
`Longitude` = ( No information about
in the school with the highest schools ORDER BY
SELECT ABS(longtitude)
longitude? ABS(longitude) DESC LIMIT 1
MAX(`Longitude`) FROM
schools )

Figure 7: 8 major types of error cases of BIRD are presented. Some cases are shortcuts for better presentation.

Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
JMockit in Practice: Definitive Reference for Developers and Engineers
From Everand
JMockit in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
From Everand
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
JavaScript Data Structures Explained: A Practical Guide with Examples
From Everand
JavaScript Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
From Everand
Study Guide 300-615 Dcit Troubleshooting Cisco Data Centre Infrastructure
Anand Vemula
No ratings yet
SQLPa LM
No ratings yet
SQLPa LM
61 pages
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
From Everand
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Manual Chiller Carel
No ratings yet
Manual Chiller Carel
71 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Dart Basics
No ratings yet
Dart Basics
43 pages
HCteam IT Proposal
No ratings yet
HCteam IT Proposal
15 pages
MAG-SQ Multi-Agent Generative Approach With Soft Schema Linking
No ratings yet
MAG-SQ Multi-Agent Generative Approach With Soft Schema Linking
21 pages
Objc - App Archirecture - IOs Application Patterns in Swift (EnglishOnlineClub - Com)
No ratings yet
Objc - App Archirecture - IOs Application Patterns in Swift (EnglishOnlineClub - Com)
226 pages
7-Database Integration Nhom4
No ratings yet
7-Database Integration Nhom4
67 pages
BDA - Unit 4 Notes
No ratings yet
BDA - Unit 4 Notes
21 pages
Structure-Guided Large Language Models For
No ratings yet
Structure-Guided Large Language Models For
24 pages
CPQ-301 Salesforce Exam Valid Questions
No ratings yet
CPQ-301 Salesforce Exam Valid Questions
10 pages
CHESS: Contextual Harnessing For Efficient SQL Synthesis: Shayan Talaei Mohammadreza Pourreza
No ratings yet
CHESS: Contextual Harnessing For Efficient SQL Synthesis: Shayan Talaei Mohammadreza Pourreza
39 pages
STaR SQL Self Taught Reasoner For Text To SQL
No ratings yet
STaR SQL Self Taught Reasoner For Text To SQL
11 pages
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
From Everand
Cisco Packet Tracer Implementation: Building and Configuring Networks: 1, #1
S. R. Jena
No ratings yet
T2S Retrieval
No ratings yet
T2S Retrieval
16 pages
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
No ratings yet
Pet-Sql:: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL With Cross-Consistency
15 pages
Y12 CS - Week 24 - Learning Plan
No ratings yet
Y12 CS - Week 24 - Learning Plan
61 pages
Docs Aiogram Dev en Latest
No ratings yet
Docs Aiogram Dev en Latest
663 pages
Ai SQL Accuracy 2023 08 17
No ratings yet
Ai SQL Accuracy 2023 08 17
12 pages
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
No ratings yet
Solid-SQL Enhanced Schema-Linking Based In-Context Learning For
11 pages
670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
No ratings yet
670e4e23bdd7d170839060aa2023 Findings-Emnlp 227
32 pages
LLM Based Text To SQL
No ratings yet
LLM Based Text To SQL
9 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Dbms Lab El Report
No ratings yet
Dbms Lab El Report
20 pages
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
From Everand
SQL 101 Crash Course: Comprehensive Guide to SQL Fundamentals and Practical Applications
Emrys Callahan
5/5 (1)
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Relational Algebra
No ratings yet
Relational Algebra
61 pages
Chase SQL
No ratings yet
Chase SQL
30 pages
+2 Computer Science Marathon Note
No ratings yet
+2 Computer Science Marathon Note
195 pages
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
From Everand
Keras Deep Learning Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
No ratings yet
2013 Ul Hasan Icdar Can We Build Lanugage Independent Ocr Using LSTM Networks
6 pages
24 Data Centric Text To SQL Wi
No ratings yet
24 Data Centric Text To SQL Wi
6 pages
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Layer Effects
No ratings yet
Layer Effects
24 pages
Approaches For Ai Powered Data Analysis
No ratings yet
Approaches For Ai Powered Data Analysis
6 pages
Large Language Model Enhanced Text-to-SQL Generation - A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation - A Survey
18 pages
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
No ratings yet
In Context Reinforcement Learning Based Retrieval Augmented Generation For Text To SQL
8 pages
Assignment 15 Utkarsh
No ratings yet
Assignment 15 Utkarsh
12 pages
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ADB Lab Bismita
No ratings yet
ADB Lab Bismita
15 pages
Base SAS 9.4 Procedures Guide High-Performance Procedures, Third Edition
No ratings yet
Base SAS 9.4 Procedures Guide High-Performance Procedures, Third Edition
172 pages
Dpc31 Hardware Description v21
No ratings yet
Dpc31 Hardware Description v21
115 pages
Class 10 Ojt Soniya
No ratings yet
Class 10 Ojt Soniya
41 pages
Database Outline
No ratings yet
Database Outline
12 pages
Emosoft-UserGuide 2.98
No ratings yet
Emosoft-UserGuide 2.98
45 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
LLM Model Transform For Short Term Trading On Commodity
No ratings yet
LLM Model Transform For Short Term Trading On Commodity
7 pages
Thesis Sample Chapter 4 Philippines
100% (2)
Thesis Sample Chapter 4 Philippines
6 pages
Sensi Touch Wi Fi Smart Thermostat Manual Operation Guide en Us 5242446
No ratings yet
Sensi Touch Wi Fi Smart Thermostat Manual Operation Guide en Us 5242446
12 pages
Hansshow Model 3Y Center Console Dashboard Touch Screen User Manual (10.25 Inch 4G)
No ratings yet
Hansshow Model 3Y Center Console Dashboard Touch Screen User Manual (10.25 Inch 4G)
23 pages
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Defog - Ai Blog Open-Sourcing-Sqleval
No ratings yet
Defog - Ai Blog Open-Sourcing-Sqleval
11 pages
Arts 6 Las Q4
No ratings yet
Arts 6 Las Q4
4 pages
Topic 4
No ratings yet
Topic 4
10 pages
R16 WT Manual
No ratings yet
R16 WT Manual
109 pages
Redshift Essentials: Definitive Reference for Developers and Engineers
From Everand
Redshift Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Log
No ratings yet
Log
13 pages
Answers: Oracle FS1 Series Systems Sales Specialist
No ratings yet
Answers: Oracle FS1 Series Systems Sales Specialist
6 pages
Learning Oracle 12c: A PL/SQL Approach
From Everand
Learning Oracle 12c: A PL/SQL Approach
Prof. Sham Tickoo
No ratings yet
BT MeetMe Services With Cisco WebEx Install Guide
No ratings yet
BT MeetMe Services With Cisco WebEx Install Guide
26 pages
Apple Quick Guides
No ratings yet
Apple Quick Guides
7 pages
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CS Xi
No ratings yet
CS Xi
5 pages
Aissce Xii Ip 065 Practicals QP 3
100% (2)
Aissce Xii Ip 065 Practicals QP 3
2 pages
CSE - Database Management Systems
No ratings yet
CSE - Database Management Systems
17 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
No ratings yet
Course Code CSE3001 CT C LTP 4 Prerequisite: Objectives
7 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Overview 30 60 90 Days Action Plan Powerpoint Slide Themes Powerpoint Templates
No ratings yet
Overview 30 60 90 Days Action Plan Powerpoint Slide Themes Powerpoint Templates
7 pages
BDA Lab 5
No ratings yet
BDA Lab 5
6 pages
Find The Easy Pass
No ratings yet
Find The Easy Pass
3 pages
VAPT RFP v1
100% (1)
VAPT RFP v1
6 pages
Common Navigator Framework
No ratings yet
Common Navigator Framework
13 pages
What's A Database System?
No ratings yet
What's A Database System?
5 pages
A Quick Guide To Mysql Tables & Queries: Inserting Data
No ratings yet
A Quick Guide To Mysql Tables & Queries: Inserting Data
2 pages
Asd
No ratings yet
Asd
4 pages
Mohammed Shoeb Khan - Client Services Technician - CY9
No ratings yet
Mohammed Shoeb Khan - Client Services Technician - CY9
5 pages
New Text Document
No ratings yet
New Text Document
4 pages
Introduction To Sedex Analytics
No ratings yet
Introduction To Sedex Analytics
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Mac SQL

Uploaded by

Mac SQL

Uploaded by

MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL

Abstract User Question

SQLite error: syntax error

|Y| The purpose of the Decomposer is to enhance

Here is a typical example:

Here is a new example, please start answering:

A.2 Decomposer Prompt

When generating SQL, we should always consider constraints:

B Error Type Examples

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.