Mac SQL
Mac SQL
Bing Wang1 , Changyu Ren1 , Jian Yang1 , Xinnian Liang1 , Jiaqi Bai1 , Linzheng Chai1
Zhao Yan2 , Qian-Wen Zhang2 , Di Yin2 , Xing Sun2 , Zhoujun Li1 †
1
Beihang University 2 Tencent Youtu Lab
{bingwang,cyren,jiaya,xnliang,bjq,challenging,lizj}@buaa.edu.cn
{zhaoyan,cowenzhang,endymecyyin,winfredsun}@tencent.com
dation on “huge" databases and complex user frpm: CDSCode, County Code, School Code, Charter School(Y/N)
satscores: cds, sname, AvgScrMath, NumTstTakr, NumGE1500, …
questions that require multi-step reasoning. schools: CDSCode, NCESDist, County, City, Zip, …
Moreover, most existing methods neglect the
Evidence
crucial significance of LLMs utilizing exter-
SAT_Excellence_Rate = CAST(NumGE1500 AS REAL) / NumTstTakr
nal tools and model collaboration. To ad-
dress these challenges, we introduce MAC-SQL, Gold SQL
a novel LLM-based multi-agent collaborative SELECT ST.sname FROM frpm FR JOIN satscores ST
ON FR.CDSCode = ST.cds WHERE FR.`Charter School (Y/N)` = 1
framework. Our framework comprises a core AND SAT_Excellence_Rate >
decomposer agent for Text-to-SQL generation ( SELECT AVG(SAT_Excellence_Rate) FROM frpm fr JOIN
satscores st ON fr.CDSCode = st.cds
with few-shot chain-of-thought reasoning, ac- WHERE fr.`Charter School (Y/N)` = 1 )
companied by two auxiliary agents that utilize
external tools or models to acquire smaller sub- Figure 1: A complex example of Text-to-SQL. In the
databases and refine erroneous SQL queries. Gold SQL, we use SAT_Excellence_Rate to represent
The decomposer agent collaborates with auxil- "CAST(NumGE1500 AS REAL)/NumTstTakr" for the
iary agents, which are activated as needed and sake of brevity.
can be expanded to accommodate new features
or tools for effective Text-to-SQL parsing. In
our framework, We initially leverage GPT-4 as Over the past decade, research in this field has
the strong backbone LLM for all agent tasks to
progressed through three stages. In the initial
determine the upper bound of our framework.
We then fine-tune an open-sourced instruction- phase, systems encode input sequence utilizing pre-
followed model, SQL-Llama, by leveraging trained models, and SQL queries are decoded us-
Code Llama 7B, to accomplish all tasks as GPT- ing either abstract syntax trees (Xu et al., 2017;
4 does. Experiments show that SQL-Llama Guo et al., 2019; Wang et al., 2021) or prede-
achieves a comparable execution accuracy of fined sketches (He et al., 2019). More recent sys-
43.94, compared to the baseline accuracy of tems (Raffel et al., 2023; Xie et al., 2022; Scholak
46.35 for vanilla GPT-4. At the time of writ-
et al., 2021) have adopted sequence-to-sequence
ing, MAC-SQL+GPT-4 achieves an execution
accuracy of 59.59 when evaluated on the BIRD methodologies. The latest research (Ouyang et al.,
benchmark, establishing a new state-of-the-art 2022; OpenAI, 2023; Rozière et al., 2023) has
(SOTA) on its holdout test set1 . demonstrated the remarkable capabilities of Large
Language Models (LLMs) in this task. The success
1 Introduction of these models can be ascribed to their emerging
abilities (Wei et al., 2023; Brown et al., 2020) and
Text-to-SQL aims to automate the process of gen- robust reasoning capabilities inherent in LLMs.
erating Structured Query Language (SQL) queries
Recent research on LLM-based Text-to-
for databases from natural language text. This
SQL (Dong et al., 2023; Pourreza and Rafiei,
long-standing challenge is essential for improving
2023; Gao et al., 2023) has mainly concentrated
database accessibility without requiring the exper-
on In-Context Learning prompt strategies and
tise of SQL (Qin et al., 2022; Sun et al., 2023).
supervised fine-tuning using data derived from the
1
https://github.com/wbbeyourself/MAC-SQL target domain. However, these approaches usually
Selector Decomposer
User
Question Database schema User Question
Table schools
List school names of charter schools with an SAT
CDSCode County Street … Phone excellence rate over the average.
109835 Alameda Sperber … 581-0202
Table satscores
Get the average value of SAT excellence rate
cds sname NumGE1500 … NumTstTakr Sub Q1
of charter schools.
109835 2346.0 400 … 191
SELECT AVG(NumGE1500 / NumTstTakr)
SQL 1
FROM frpm JOIN … WHERE …
User Table frpm
CDSCode FRPM Count Meal … Charter (Y/N)
109835 2346.0 4369.0 … 581-0202
List out school names of charter schools with
Sub Q2
an SAT excellence rate over the average.
Final Refiner SELECT sname FROM .. JOIN ..
SQL SQL 2
WHERE SAT_Excellence_Rate > SQL1 and ..
SQLite execute
Figure 2: The overview of our MAC-SQL framework, which comprises three agents: (i) the Selector, which
decomposes a large database into a smaller sub-database to mitigate the interference of irrelevant information,
and (ii) the Decomposer, which breaks down a complex question into simpler sub-questions and resolves them
progressively by chain-of-thought reasoning, and (iii) the Refiner, which uses an external tool for SQL execution
and obtains feedback, then refines faulty SQL queries.
suffer from significant performance degradation on 4 as a strong backbone LLM for all agent tasks
“huge” databases and complex user questions that to determine the upper bound of our MAC-SQL
require multi-step reasoning, as demonstrated in framework on the widely used BIRD and Spider
Figure 1. Moreover, most existing methods neglect dataset. Experimental results demonstrate that
the crucial significance of LLMs utilizing external MAC-SQL+GPT-4 achieves an execution accuracy
tools and model collaboration. of 59.59 on the holdout test set of BIRD, establish-
To alleviate the above challenges, we introduce ing a new state-of-the-art (SOTA) at the time of
MAC-SQL, a novel LLM-based multi-agent collab- writing. Furthermore, We utilize SQL-Llama(7B)
orative framework, which exploits LLMs as in- to accomplish all tasks like GPT-4. Surprisingly,
telligent agents with different functionalities for despite SQL-Llama having an order of magnitude
effective Text-to-SQL parsing. Our framework fewer parameters than GPT-4, its execution accu-
comprises a core Decomposer agent for Text-to- racy reaches 43.94, which is remarkably close to
SQL generation, accompanied by two auxiliary the accuracy of GPT-4 (46.35).
agents, the Selector and the Refiner, for tool us- Contribution Our main contributions and re-
age and SQL refinement. Specifically, the Decom- sults are summarized as follows:
poser breaks down a complex question into simpler
sub-questions and resolves them progressively by 1. We propose MAC-SQL, a novel multi-agent col-
chain-of-thought reasoning. When necessary, the laborative framework for Text-to-SQL, which
Selector decomposes a large database into a smaller integrates external tools and facilitates model
sub-database to minimize the interference of irrel- collaboration to address intricate scenarios.
evant information, while the Refiner employs an
external tool for SQL execution, obtains feedback, 2. We introduce an instruction-tuning model,
and refines erroneous SQL queries. named SQL-Llama, to fill in the gaps in open-
Furthermore, we have fine-tuned an instruction- source agent-instruction-following models for
followed model, SQL-Llama, by leveraging Code the task of Text-to-SQL.
Llama 7B, using agent instruction data from
MAC-SQL, thus enabling capabilities in database 3. Experimental results demonstrate that
simplification, question decomposition, SQL gen- MAC-SQL achieves state-of-the-art execution
eration, and SQL correction. accuracy of 59.59% on the BIRD test set at
In our experiments, we initially leverage GPT- the time of writing.
Algorithm 1 The algorithm of MAC-SQL 3 MAC-SQL Framework
Input: question q, database db, knowledge
3.1 Overview
kg
Output: sql In Figure 2, we introduce MAC-SQL, a novel LLM-
1: if need simplify to database then based multi-agent collaborative framework, which
2: db = LLMSelector (q, db, kg) exploits LLMs as intelligent agents with different
3: end if functionalities for effective Text-to-SQL parsing.
4: dbDesc = getDbRepresenation(db, kg) MAC-SQL comprises a core Decomposer agent for
5: subQs, subSQLs = LLMDecomposer (q, dbDesc) Text-to-SQL generation, accompanied by two aux-
6: sql = subSQLs[-1] iliary agents, the Selector and the Refiner, for tool
7: count = 0 usage and SQL refinement. In Algorithm 1, we
8: while count < maxTryTimes do present the collaboration process of three agents
9: ok, err = executeAndAnalyze(sql, db) in MAC-SQL. In the following section, a detailed
10: if ok then introduction of three agents will be presented.
11: return sql
12: else 3.2 Selector
13: sql = LLMRefiner (q, dbDesc, sql, Given an input triple X = (Q, S, K), where
err) database schema S = {T , C}, the Selector agent
14: end if ′ ′ ′
aims to locate the minimal schema S = {T , C },
15: end while ′ ′
where T ⊆ T and C ⊆ C, to answer the question
16: return sql
Q with knowledge K. The function of the Selector
agent can be described as:
2 Preliminaries ′
S = fselector (Q, S, K|M) (2)
2.1 Problem Definition of Text-to-SQL
where fselector (·|M) denotes the function of the
Given a triple X = (Q, S, K), where Q, S and K Selector by prompting the LLM M. The moti-
are natural language question, database schema vation behind designing the selector primarily in-
and external knowledge (optional), the database volves two key factors. Firstly, introducing too
schema S is defined as {T , C}, where T represents many irrelevant schema items in the prompt in-
multiple tables {T1 , T2 , . . . , T|T | } and C represents creases the likelihood of LLM generating irrelevant
columns {C1 , C2 , . . . , C|C| }. The purpose of Text- schema items in the output SQL. Secondly, using
to-SQL task is to generate the correct SQL Y cor- the complete database schema results in excessive
responding to the question Q. text length, leading to unnecessary API costs, and
may exceed the maximum context length of LLM.
2.2 Large Language Model for Text-to-SQL It is important to note that the Selector will only be
activated when the length of the database schema
The task of Text-to-SQL has been formulated as
prompt exceeds the length threshold; otherwise,
a generation task recently (Dong et al., 2023;
the original database schema S will be used for the
Pourreza and Rafiei, 2023), designing appropriate
subsequent process. The complete prompt of the
prompts to guide a large language model M gener-
Selector agent is shown in Appendix 6.
ating SQL queries token by token. The generation
process can be formulated as follows:
3.3 Decomposer
4.2 Multi-task Supervised Fine-tuning Datasets The Spider (Yu et al., 2018) dataset is
frequently employed for assessing the performance
Our research has been primarily focused on the of text-to-SQL parsing across multiple databases,
development of open-source models within the necessitating models to demonstrate adaptability
MAC-SQL framework, to achieve performance lev- to unfamiliar database structures. The dataset com-
els comparable to closed-source models like GPT- prises 7,000 question-query pairs in the training set
4. To achieve this, we have put significant effort and 1,034 pairs in the development set, encompass-
into preparing the data for model training and have ing 200 distinct databases and 138 domains. In this
open-sourced SQL-Llama, a model that has been study, we assess the efficacy of our framework on
fine-tuned using three intelligent agent instruction the Spider development set, as the test set is not
data. The SQL-Llama model, based on Code Llama accessible.
7B, has undergone supervised fine-tuning using The BIRD (Li et al., 2023) dataset released by
agent instruction data from MAC-SQL, which has Alibaba DAMO Academy is a new benchmark for
enhanced its capabilities in database simplification, large-scale real databases, containing 95 large-scale
question decomposition, SQL generation, and SQL databases and high-quality Text-SQL pairs, with
correction. a data storage volume of up to 33.4GB spanning
Given the Agent-Instruct dataset with N (N=3) 37 professional domains. Unlike Spider, BIRD
instruction tasks, D = {Di }N i=1 , the LLM trained focuses on massive and real database content, ex-
on D can learn from these tasks and complete agent ternal knowledge reasoning between natural lan-
tasks. The supervised fine-tuning process can be guage questions and database content, and new
described as: challenges in SQL efficiency when dealing with
large databases.
N
X h i Evaluation Metrics Following BIRD (Li et al.,
L=− EQ,S i ,K,Y i ∼D log P (Y i |Q, S i , K; M) (5)
i=1
2023) and Test-suite (Zhong et al., 2020), we con-
sider three metrics, exact match accuracy (EM),
where L is the training objective of N tasks, execution accuracy (EX) and valid efficiency score
S i and Y i are the selected database schema and (VES) to evaluate text-to-SQL models confronted
intermediate SQL query of the i-th task. with real-world scenarios with large database con-
One of the key challenges we encountered during tents. Exact Match Accuracy (EM) treats each
the model training process was balancing model clause as a set and compares the prediction for each
complexity with performance. We had to carefully clause to its corresponding clause in the reference
optimize the model architecture and parameters to query. A predicted SQL query is considered cor-
ensure that it could effectively handle the complexi- rect only if all of its components match the ground
ties of database-related tasks while still maintaining truth. This metric does not take values into ac-
high-performance levels. Additionally, ensuring count. Execution Accuracy (EX) is defined as the
the quality and relevance of the instruction dataset proportion of questions in the evaluation set for
for training was crucial, as it directly impacted the which the execution results of both the predicted
model’s performance. and ground-truth inquiries are identical, relative
Dev Test Method EX (Dev) EX (Test)
Method
EX VES EX VES
C3 + ChatGPT 81.80 82.30
Palm-2 27.38 - 33.04 - DIN-SQL + GPT-4 82.80 85.30
ChatGPT + CoT 36.64 42.30 40.08 56.56
Claude-2 42.70 - 49.02 - DAIL-SQL + GPT-4 84.40 86.60
GPT-4 46.35 49.77 54.89 60.77
DIN-SQL + GPT-4 50.72 58.79 55.90 59.44 SQL-Llama(7B) 65.48 61.63
DAIL-SQL + GPT-4 54.76 56.08 57.41 61.95 MAC-SQL + SQL-Llama(7B) 76.25 70.58
SQL-Llama(7B) 32.87 55.67 - - MAC-SQL + GPT-3.5-Turbo 80.56 75.53
MAC-SQL + SQL-Llama(7B) 43.94 57.36 - - MAC-SQL + GPT-4 86.75 82.80
+ Oracle Schema 51.43 58.24 - -
MAC-SQL + GPT-3.5-Turbo 50.56 61.25 - -
+ Oracle Schema 65.78 60.62 - - Table 2: Execution accuracy(EX) on both dev and test
MAC-SQL + GPT-4 59.39 66.39 59.59 67.68 set of Spider.
+ Oracle Schema 70.28 62.63 - -
Method Simple Mod. Chall. All
Table 1: Execution accuracy(EX) and Valid efficiency
score (VES) on both dev and test set of BIRD dataset. MAC-SQL + GPT-4 65.73 52.69 40.28 59.39
w/o Selector 65.73 52.04 35.14 57.28(↓)
The term "Oracle Schema" refers to the utilization of a w/o Decomposer 61.51 48.82 38.89 55.54(↓)
ground truth sub-database as the input for the Decom- w/o Refiner 63.24 44.52 33.33 54.76(↓)
poser, rather than employing the results obtained from
the Selector. Table 3: Execution accuracy of MAC-SQL ablation study
in BIRD dev set. For brevity, the abbreviation "Mod."
stands for "Moderate" while "Chall." denotes "Challeng-
to the overall number of queries. Valid Efficiency ing".
Score (VES) is designed to measure the efficiency
of valid SQLs generated by models. It is important
to note that "valid SQLs" refers to predicted SQL 5.2 Overall Performance
queries whose result sets align with those of the It is important to note that the experiment utilized
ground-truth SQLs. the 32k version of GPT-4 and the 16k version of
GPT-3.5-Turbo.
Baselines We conduct experiments on both
BIRD and Spider datasets and compare our method BIRD Results In Table 1, we report the perfor-
with the following baseline: mance of our method and baseline methods on the
BIRD dataset. It is evident that our method sur-
• GPT-4 (OpenAI, 2023) uses simple zero-shot passes all LLM-based methods in terms of execu-
text-to-SQL prompt for SQL generation. tion accuracy (EX) and valid efficiency score (VES)
on both the development and test sets. Specifically,
our method outperforms the second-best method
• DIN-SQL (Pourreza and Rafiei, 2023) decom-
by 4.63% on the development set and by 2.18% on
poses the text-to-SQL task into smaller sub-
the test set. At the time of writing, MAC-SQL+GPT-
tasks and designs different prompts for each
4 achieves an execution accuracy of 59.59 when
subtask to instruct GPT-4 to complete each
evaluated on the BIRD benchmark, establishing a
subtask and obtain the final SQL.
new state-of-the-art (SOTA) on its holdout test set.
• DAIL-SQL (Gao et al., 2023) encodes struc- Spider Results Currently, Spider has open-
ture knowledge as SQL statements, selects sourced the test set, so we can evaluate our method
few-shot demonstrations based on their skele- in both the development and the test set. As shown
ton similarities and removes cross-domain in Table 2, for the dev set of Spider (Yu et al., 2018),
knowledge from examples for token effi- our method achieves the highest execution accu-
ciency. racy using GPT-4. These results demonstrate the
generalization ability of our MAC-SQL framework.
• C3-SQL (Dong et al., 2023) first performs
schema linking filtering and then directs GPT- 5.3 Ablation Study
4 with a calibration bias prompt designed for Table 3 presents the results of an ablation study for
Spider using a self-consistency strategy. the MAC-SQL model in the BIRD dev set. The table
lists different variations of the MAC-SQL model, in- BIRD Spider
cluding with and without certain components such Few-shot
EX VES EM EX
as Selector, Decomposer, and Refiner. The other
columns represent the accuracy of the model on dif- 0-shot 55.54 63.31 58.42 74.22
ferent levels of difficulty: Simple, Moderate, and 1-shot 57.26 64.32 59.68 78.35
Challenging, as well as the overall accuracy (All). 2-shot 59.39 66.24 63.20 86.75
The findings show that the original MAC-SQL +
GPT-4 model achieves an accuracy of 65.73% on Table 4: Results of MAC-SQL+GPT-4 on the dev set of
Simple, 52.69% on Moderate, and 40.28% on Chal- BIRD and Spider with few-shot evaluation.
lenging, with an overall accuracy of 59.39%. When
removing the Selector component, the accuracy Figure 5 displays the error type distribution in
remained the same for Simple, but decreased to BIRD and Spider datasets. "Gold Error" is the most
52.04% for Moderate and 35.14% for Challenging, common error type, accounting for 30% and 22%
resulting in an overall accuracy of 57.28% (a de- in BIRD and Spider, respectively, signifying the
crease of 2.11%). Similarly, removing the Decom- significance of gold standard annotations. "Seman-
poser and Refiner components also led to decreases tic Correct" is another prevalent error type, repre-
in accuracy across all difficulty levels. senting 14% and 22% in BIRD and Spider, respec-
Overall, the ablation study indicates that each tively, indicating the importance of semantic under-
component of the MAC-SQL model (Selector, De- standing and correctness. However, "Schema Link-
composer, and Refiner) plays a crucial role in ing Error" is more frequent in BIRD (2%) than in
achieving high accuracy, as their removal resulted Spider (8%), demonstrating differences in schema
in decreased performance across all difficulty lev- linking errors. This analysis underscores the need
els. for addressing gold standard annotations, semantic
5.4 Discussion correctness, and schema linking in dataset devel-
opment and evaluation, thereby improving their
Impact on the number of demonstrations Ta- quality and reliability. The appendix B contains
ble 4 shows evaluation results of MAC-SQL with detailed examples of error types.
different numbers of demonstrations on the BIRD
and Spider datasets. As the number of shots in- 6 Related Work
creases from 0 to 2, there is a consistent improve-
ment in the performance metrics (EX, VES, and LLMs for Text-to-SQL Recent advancements
EM) for both BIRD and Spider. This indicates that in text-to-SQL tasks using large language mod-
the model benefits from additional demonstration els (LLMs) have focused on improving prompt
examples and is able to generalize better with more design and developing multi-stage refined frame-
data. The highest performance is achieved with works. In the early stages of the emergence of
2-shot evaluation, indicating that the model is capa- large language models, research efforts were pri-
ble of learning effectively from a small number of marily focused on designing high-quality prompts
examples. The high cost of the GPT-4 interface re- to better exploit the potential of LLMs for SQL
sults in a significant consumption of tokens during generation. For example, (Tai et al., 2023) system-
a full test of the dev set for Spider and BIRD, es- atically studied how to enhance LLMs’ reasoning
timated at approximately 6 million and 10 million ability through chain-of-thought style prompting,
tokens, respectively. Due to the cost constraints, including the original chain-of-thought prompting
our analysis is limited to a maximum of 2-shot, and and least-to-most prompting. Similarly, (Chang
further experiments involving more shots (e.g., shot and Fosler-Lussier, 2023) comprehensively investi-
k < 2) will have to await a more budget-friendly gated the impact of prompt constructions across var-
implementation of GPT-4. ious settings when constructing the prompt text for
text-to-SQL inputs. Additionally, DAIL-SQL (Gao
5.5 Error Analysis et al., 2023) systematically examined prompt en-
In order to thoroughly assess the limitations of our gineering for LLM-based Text-to-SQL methods,
method, we begin by choosing two datasets (BIRD including question representations, prompt com-
and Spider) that contain various types of structured ponents, example selections, and example organi-
data, as shown in Figure 5. zations. Later studies, like C3-SQL (Dong et al.,
Figure 5: Error Distributions of MAC-SQL on dev set of BIRD and Spider.
2023), DIN-SQL (Pourreza and Rafiei, 2023), and Similarly, OpenAgents (Xie et al., 2023) devel-
StructGPT (Jiang et al., 2023), proposed frame- ops three distinct agents, the Data Agent for data
works for simplifying databases, generating SQL, analysis, the Plugins Agent for plugin integration,
verifying queries, and integrating answers through and the Web Agent for autonomous web browsing,
zero-shot approaches, query decomposition, and each specializing in different domains, similar to
specialized interfaces for structured data access. OpenAI’s ChatGPT Plugins. Additionally, Auto-
However, the aforementioned methods have sev- Gen (Wu et al., 2023) is an open-source framework
eral issues. Firstly, the experiments were con- that enables developers to build customizable, con-
ducted solely on the Spider family dataset, failing versable agents that can operate in various modes,
to demonstrate their generalization to more com- employing combinations of LLMs, human inputs,
plex datasets like BIRD, hence limiting their real- and tools to accomplish tasks. However, how to
world applicability. Secondly, certain methods de- apply LLM-based agents to Text-to-SQL parsing
pend on difficulty-level classifiers and customized remains under-explored.
biases specific to the Spider dataset for error cor- We fill this gap by proposing a multi-agent col-
rection, thus lacking the ability to generalize to laborative Text-to-SQL framework, which inte-
a broader spectrum of error types. Thirdly, these grates multiple LLM-based agents to collectively
methods neglect the utilization of external tools interpret SQL queries and address the complexity
and the collaboration of different modules. Thus, and diversity of SQL queries encountered in real-
we propose a framework centered on multi-agent world scenarios.
collaboration that can be utilized for more intri-
cate data scenarios and a broader spectrum of error
types for detection and correction. 7 Conclusion
LLM-based Agents LLM-based agents have
been a prominent area of study in both academic In summary, this paper proposes the MAC-SQL
and industry communities for an extended pe- framework, which utilizes multi-agent collabora-
riod (Wang et al., 2023). Recently, through the ac- tion to address challenges in Text-to-SQL tasks.
quisition of vast amounts of web knowledge, LLMs The framework, along with the open-sourced SQL-
have demonstrated remarkable potential in achiev- Llama model, achieved an execution accuracy of
ing human-level intelligence. This development 59.59 when evaluated on the BIRD benchmark,
has led to a surge in research exploring autonomous establishing a new state-of-the-art (SOTA) on its
agents based on LLMs. AutoGPT (Team, 2023) holdout test set. This work presents a novel ap-
is an open-source implementation of an AI agent proach to Text-to-SQL and provides practical guid-
and follows a single-agent paradigm in which it ance for achieving high performance in this domain.
augments the AI model with many useful tools, Furthermore, our framework can be expanded to
and does not support multi-agent collaboration. support a broader spectrum of scenarios.
Limitations Cheng, and Yongbin Li. 2023. Can llm already serve
as a database interface? a big bench for large-scale
There are two limitations of our work. Firstly, we database grounded text-to-sqls.
did not extensively engineer the prompts, which
OpenAI. 2023. Gpt-4 technical report. ArXiv.
may not be optimal. Secondly, this paper reports
the fine-tuning results of the 7B CodeLLama model. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Although it performs at a comparable level, we roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
believe its performance can be further improved by Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
using larger models. Maddie Simens, Amanda Askell, Peter Welinder,
Paul Christiano, Jan Leike, and Ryan Lowe. 2022.
Ethics Statement Training language models to follow instructions with
human feedback.
The datasets and models utilized in this paper, and
the implementation of the code and the resulting Mohammadreza Pourreza and Davood Rafiei. 2023.
Din-sql: Decomposed in-context learning of text-
models, are not associated with any ethical con- to-sql with self-correction.
cerns.
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan
Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng
Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu,
References and Maosong Sun. 2024. Chatdev: Communicative
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie agents for software development.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Bowen Qin, Binyuan Hui, Lihan Wang, Min Yang,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Jinyang Li, Binhua Li, Ruiying Geng, Rongyu Cao,
Gretchen Krueger, Tom Henighan, Rewon Child, Jian Sun, Luo Si, et al. 2022. A survey on text-to-sql
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, parsing: Concepts, methods, and future directions.
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv preprint arXiv:2208.13629.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Jack Clark, Christopher Berner, Sam McCandlish, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Alec Radford, Ilya Sutskever, and Dario Amodei. Wei Li, and Peter J. Liu. 2023. Exploring the limits
2020. Language models are few-shot learners. of transfer learning with a unified text-to-text trans-
former.
Shuaichen Chang and Eric Fosler-Lussier. 2023. How
to prompt llms for text-to-sql: A study in zero-shot, Baptiste Rozière, Jonas Gehring, Fabian Gloeckle,
single-domain, and cross-domain settings. Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish
Yunjun Gao, lu Chen, Jinshu Lin, and Dongfang Lou. Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wen-
2023. C3: Zero-shot text-to-sql with chatgpt. han Xiong, Alexandre Défossez, Jade Copet, Faisal
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier,
Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Thomas Scialom, and Gabriel Synnaeve. 2023. Code
Text-to-sql empowered by large language models: A llama: Open foundation models for code.
benchmark evaluation. Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
danau. 2021. Picard: Parsing incrementally for
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-
constrained auto-regressive decoding from language
Guang Lou, Ting Liu, and Dongmei Zhang. 2019. To-
models.
wards complex text-to-sql in cross-domain database
with intermediate representation. Ruoxi Sun, Sercan O. Arik, Hootan Nakhost, Hanjun
Dai, Rajarishi Sinha, Pengcheng Yin, and Tomas
Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Pfister. 2023. Sql-palm: Improved large language
Weizhu Chen. 2019. X-sql: reinforce schema repre- model adaptation for text-to-sql.
sentation with context.
Chang-You Tai, Ziru Chen, Tianshu Zhang, Xiang Deng,
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, and Huan Sun. 2023. Exploring chain-of-thought
Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: style prompting for text-to-sql.
A general framework for large language model to
reason over structured data. AutoGPT Team. 2023. Autogpt: build and use ai agents.
Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang, Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao, Polozov, and Matthew Richardson. 2021. Rat-sql:
Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Relation-aware schema encoding and linking for text-
Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold to-sql parsers.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao
Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang,
Xu Chen, Yankai Lin, et al. 2023. A survey on large
language model based autonomous agents. arXiv
preprint arXiv:2308.11432.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. 2023. Chain-of-thought prompting elic-
its reasoning in large language models.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu,
Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang,
Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadal-
lah, Ryen W White, Doug Burger, and Chi Wang.
2023. Autogen: Enabling next-gen llm applications
via multi-agent conversation.
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong,
Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Vic-
tor Zhong, Bailin Wang, Chengzu Li, Connor Boyle,
Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming
Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith,
Luke Zettlemoyer, and Tao Yu. 2022. Unifiedskg:
Unifying and multi-tasking structured knowledge
grounding with text-to-text language models.
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Lu-
oxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao,
Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin
Su, Dongchan Shin, Caiming Xiong, and Tao Yu.
2023. Openagents: An open platform for language
agents in the wild.
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet:
Generating structured queries from natural language
without reinforcement learning.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-sql task. In Proc. of EMNLP.
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic
evaluation for text-to-SQL with distilled test suites.
In Proc. of EMNLP.
¨
Denny Zhou, Nathanael Sch"arli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022.
Least-to-most prompting enables complex reason-
ing in large language models. arXiv preprint
arXiv:2205.10625.
Task Description
As an experienced and professional database administrator , your task is to ...
Instruction
1. Discard any table schema that is not related to the user question and evidence .
2. Sort the columns in each relevant table in descending order of relevance and keep the top 6
columns .
3. Ensure that at least 3 tables are included in the final output JSON .
4. The output should be in JSON format .
Demonstration
[ DB_ID ] banking_system
[ Schema ] Table schemas of account , client , loan , district ...
[ Foreign keys ] ...
[ Question ]: What is the gender of the youngest client who opened account in the lowest average
salary branch ?
[ Evidence ]: Later birthdate refers to younger age ; A11 refers to average salary
[ Answer ]
``` json
{ " account " : " keep_all " ,
" client " : " keep_all " ,
" loan " : " drop_all " ,
" district " : [ " district_id " , " A11 " , " A2 " , ...] }
```
Test Question
[ DB_ID ] { db_id }
[ Schema ] { desc_str }
[ Foreign keys ] { fk_str }
[ Question ] { query }
[ Evidence ] { evidence }
[ Answer ]
Figure 6: An example of Selector prompt. The specific details are omitted for the sake of brevity.
A Prompt Details
A.1 Selector Prompt
As an experienced and professional database administrator, your task is to analyze a user question
and a database schema to provide relevant information. The database schema consists of table
descriptions, each containing multiple column descriptions. Your goal is to identify the relevant
tables and columns based on the user question and evidence provided.
[Instruction]
1. Discard any table schema that is not related to the user question and evidence.
2. Sort the columns in each relevant table in descending order of relevance and keep the top 6
columns.
3. Ensure that at least 3 tables are included in the final output JSON.
4. The output should be in JSON format.
[Requirements]
1. If a table has less than or equal to 10 columns, mark it as "keep_all".
2. If a table is completely irrelevant to the user question and evidence, mark it as "drop_all".
3. Prioritize the columns in each relevant table based on their relevance.
==========
[DB_ID] banking_system
[Schema]
# Table: account
[
(account_id, the id of the account. Value examples: [11382, 11362, 2, 1, 2367].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(frequency, frequency of the acount. Value examples: [’POPLATEK MESICNE’, ’POPLATEK
TYDNE’, ’POPLATEK PO OBRATU’].),
(date, the creation date of the account. Value examples: [’1997-12-29’, ’1997-12-28’].)
]
# Table: client
[
(client_id, the unique number. Value examples: [13998, 13971, 2, 1, 2839].),
(gender, gender. Value examples: [’M’, ’F’]. And F:female . M:male ),
(birth_date, birth date. Value examples: [’1987-09-27’, ’1986-08-13’].), (district_id, location
of branch. Value examples: [77, 76, 2, 1, 39].)
]
# Table: loan
[
(loan_id, the id number identifying the loan data. Value examples: [4959, 4960, 4961].),
(account_id, the id number identifying the account. Value examples: [10, 80, 55, 43].),
(date, the date when the loan is approved. Value examples: [’1998-07-12’, ’1998-04-19’].),
(amount, the id number identifying the loan data. Value examples: [1567, 7877, 9988].),
(duration, the id number identifying the loan data. Value examples: [60, 48, 24, 12, 36].),
(payments, the id number identifying the loan data. Value examples: [3456, 8972, 9845].),
(status, the id number identifying the loan data. Value examples: [’C’, ’A’, ’D’, ’B’].)
]
# Table: district
[
(district_id, location of branch. Value examples: [77, 76].),
(A2, area in square kilometers. Value examples: [50.5, 48.9].),
(A4, number of inhabitants. Value examples: [95907, 95616].),
(A5, number of households. Value examples: [35678, 34892].),
(A6, literacy rate. Value examples: [95.6, 92.3, 89.7].),
(A7, number of entrepreneurs. Value examples: [1234, 1456].),
(A8, number of cities. Value examples: [5, 4].),
(A9, number of schools. Value examples: [15, 12, 10].),
(A10, number of hospitals. Value examples: [8, 6, 4].),
(A11, average salary. Value examples: [12541, 11277].),
(A12, poverty rate. Value examples: [12.4, 9.8].),
(A13, unemployment rate. Value examples: [8.2, 7.9].),
(A15, number of crimes. Value examples: [256, 189].)
]
[Foreign keys]
client.‘district_id‘ = district.‘district_id‘
[Question]
What is the gender of the youngest client who opened account in the lowest average salary branch?
[Evidence]
Later birthdate refers to younger age; A11 refers to average salary
[Answer]
”’json
{
"account": "keep_all",
"client": "keep_all",
"loan": "drop_all",
"district": ["district_id", "A11", "A2", "A4", "A6", "A7"]
}
”’
Question Solved.
==========
[DB_ID] {db_id}
[Schema]
{desc_str}
[Foreign keys]
{fk_str}
[Question]
{query}
[Evidence]
{evidence}
[Answer]
Given a [Database schema] description, a knowledge [Evidence] and the [Question], you need to
use valid SQLite and understand the database and knowledge, and then decompose the question
into subquestions for text-to-SQL generation.
==========
[Database schema]
# Table: frpm
[
(CDSCode, CDSCode. Value examples: [’01100170109835’, ’01100170112607’].),
(Charter School (Y/N), Charter School (Y/N). Value examples: [1, 0, None]. And 0: N;. 1: Y),
(Enrollment (Ages 5-17), Enrollment (Ages 5-17). Value examples: [5271.0, 4734.0].),
(Free Meal Count (Ages 5-17), Free Meal Count (Ages 5-17). Value examples: [3864.0, 2637.0].
And eligible free rate = Free Meal Count / Enrollment)
]
# Table: satscores
[
(cds, California Department Schools. Value examples: [’10101080000000’,
’10101080109991’].),
(sname, school name. Value examples: [’None’, ’Middle College High’, ’John F. Kennedy
High’, ’Independence High’, ’Foothill High’].),
(NumTstTakr, Number of Test Takers in this school. Value examples: [24305, 4942, 1, 0, 280].
And number of test takers in each school),
(AvgScrMath, average scores in Math. Value examples: [699, 698, 289, None, 492]. And
average scores in Math), (NumGE1500, Number of Test Takers Whose Total SAT Scores Are
Greater or Equal to 1500. Value examples: [5837, 2125, 0, None, 191]. And Number of Test Takers
Whose Total SAT Scores Are Greater or Equal to 1500. . commonsense evidence:. . Excellence
Rate = NumGE1500 / NumTstTakr)
]
[Foreign keys]
frpm.‘CDSCode‘ = satscores.‘cds‘
[Question]
List school names of charter schools with an SAT excellence rate over the average.
[Evidence]
Charter schools refers to ‘Charter School (Y/N)‘ = 1 in the table frpm; Excellence rate =
NumGE1500 / NumTstTakr
Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
Sub question 1: Get the average value of SAT excellence rate of charter schools.
SQL
”’ sql
SELECT AVG(CAST(T2.‘NumGE1500‘ AS REAL) / T2.‘NumTstTakr‘)
FROM frpm AS T1
INNER JOIN satscores AS T2
ON T1.‘CDSCode‘ = T2.‘cds‘
WHERE T1.‘Charter School (Y/N)‘ = 1
”’
Sub question 2: List out school names of charter schools with an SAT excellence rate over the
average.
SQL
”’ sql
SELECT T2.‘sname‘
FROM frpm AS T1
INNER JOIN satscores AS T2
ON T1.‘CDSCode‘ = T2.‘cds‘
WHERE T2.‘sname‘ IS NOT NULL
AND T1.‘Charter School (Y/N)‘ = 1
AND CAST(T2.‘NumGE1500‘ AS REAL) / T2.‘NumTstTakr‘ > (
SELECT AVG(CAST(T4.‘NumGE1500‘ AS REAL) / T4.‘NumTstTakr‘)
FROM frpm AS T3
INNER JOIN satscores AS T4
ON T3.‘CDSCode‘ = T4.‘cds‘
WHERE T3.‘Charter School (Y/N)‘ = 1
)
”’
Question Solved.
==========
[Database schema]
# Table: account
[
(account_id, the id of the account. Value examples: [11382, 11362, 2, 1, 2367].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(frequency, frequency of the acount. Value examples: [’POPLATEK MESICNE’, ’POPLATEK
TYDNE’, ’POPLATEK PO OBRATU’].),
(date, the creation date of the account. Value examples: [’1997-12-29’, ’1997-12-28’].)
]
# Table: client
[
(client_id, the unique number. Value examples: [13998, 13971, 2, 1, 2839].),
(gender, gender. Value examples: [’M’, ’F’]. And F:female . M:male ),
(birth_date, birth date. Value examples: [’1987-09-27’, ’1986-08-13’].),
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].)
]
# Table: district
[
(district_id, location of branch. Value examples: [77, 76, 2, 1, 39].),
(A4, number of inhabitants . Value examples: [’95907’, ’95616’, ’94812’].),
(A11, average salary. Value examples: [12541, 11277, 8114].) ]
[Foreign keys]
account.‘district_id‘ = district.‘district_id‘
client.‘district_id‘ = district.‘district_id‘
[Question]
What is the gender of the youngest client who opened account in the lowest average salary branch?
[Evidence]
Later birthdate refers to younger age; A11 refers to average salary
Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
Sub question 1: What is the district_id of the branch with the lowest average salary?
SQL
”’ sql
SELECT ‘district_id‘
FROM district
ORDER BY ‘A11‘ ASC
LIMIT 1
”’
Sub question 2: What is the youngest client who opened account in the lowest average
salary branch?
SQL
”’ sql
SELECT T1.‘client_id‘
FROM client AS T1
INNER JOIN district AS T2
ON T1.‘district_id‘ = T2.‘district_id‘
ORDER BY T2.‘A11‘ ASC, T1.‘birth_date‘ DESC
LIMIT 1
”’
Sub question 3: What is the gender of the youngest client who opened account in the lowest
average salary branch?
SQL
”’ sql
SELECT T1.‘gender‘
FROM client AS T1
INNER JOIN district AS T2
ON T1.‘district_id‘ = T2.‘district_id‘
ORDER BY T2.‘A11‘ ASC, T1.‘birth_date‘ DESC
LIMIT 1
”’
Question Solved.
==========
[Database schema]
{desc_str}
[Foreign keys]
{fk_str}
[Question]
{query}
[Evidence]
{evidence}
Decompose the question into sub questions, considering [Constraints], and generate the SQL after
thinking step by step:
A.3 Refiner Prompt
[Instruction]
When executing SQL below, some errors occurred, please fix up SQL based on query and database
info. Solve the task step by step if you need to. Using SQL format in the code block, and indicate
script type in the code block. When you find an answer, verify the answer carefully. Include
verifiable evidence in your response if possible.
[Constraints]
- In ‘SELECT <column>‘, just select needed columns in the [Question] without any unnecessary
column or value
- In ‘FROM <table>‘ or ‘JOIN <table>‘, do not include unnecessary table
- If use max or min func, ‘JOIN <table>‘ FIRST, THEN use ‘SELECT MAX(<column>)‘ or
‘SELECT MIN(<column>)‘
- If [Value examples] of <column> has ’None’ or None, use ‘JOIN <table>‘ or ‘WHERE <column>
is NOT NULL‘ is better
- If use ‘ORDER BY <column> ASC|DESC‘, add ‘GROUP BY <column>‘ before to select distinct
values
[Query]
{query}
[Evidence]
{evidence}
[Database info]
{desc_str}
[Foreign keys]
{fk_str}
[old SQL]
”’ sql
{sql}
”’
[SQLite error]
{sqlite_error}
[Exception class]
{exception_class}
Now please fixup old SQL and generate new SQL again.
[correct SQL]
Semantic Correct
Question Evidence Gold SQL Pred SQL Error Description
budget allowed refers to SELECT T2.budget_id, T1.cost SELECT T1.cost,
Which budget allowed the most expense_description; FROM expense AS T1 INNER T2.budget_id FROM
Same answer, but with
money for water, chips, and expense_description = JOIN budget AS T2 ON expense AS T1 INNER
T1.link_to_budget = different order
cookies? 'Water, chips, cookies'; JOIN budget AS T2 ON
most money refers to T2.budget_id WHERE … …
Question Misunderstand
Question Evidence Gold SQL Pred SQL Error Description
SELECT DisplayName FROM
"Harlan" and "Jarrod users WHERE DisplayName IN SELECT `DisplayName`,
Which user has a higher Dixon" are both ('Harlan', 'Jarrod Dixon') `Reputation` FROM Pred SQL miss out one
reputation, Harlan or Jarrod DisplayName; highest AND Reputation = ( SELECT users WHERE of the names in the
Dixon? reputation refers to MAX(Reputation) FROM users `DisplayName` = question.
Max(Reputation) WHERE DisplayName IN "Harlan"
('Harlan', 'Jarrod Dixon') )
Dirty Database Values
Question Evidence Gold SQL Pred SQL Error Description
SELECT `name`, Both table cards and
SELECT id, colors FROM cards `colors` FROM cards
What are the cards for set table set_translations
set OGW refers to setCode WHERE id IN ( SELECT id FROM WHERE `setCode` =
OGW? State the colour for set_translations WHERE have the column
= 'OGW'; 'OGW' AND `colors` IS
these cards. setCode = 'OGW' ) setCode with
NOT NULL inconsistent values.
Wrong Schema Linking
Question Evidence Gold SQL Pred SQL Error Description
SELECT T2.`Id` AS
SELECT T2.Id, T1.Text FROM
What is the post ID and the CommentId, T2.`Text`
comments AS T1 INNER JOIN
comments commented in the AS CommentText FROM
Title = 'Group differences posts AS T2 ON T1.PostId = Different table join
post titled by "Group posts AS T1 INNER
on a five point Likert item'; T2.Id WHERE T2.Title = order, use wrong table id
differences on a five point JOIN comments AS T2
'Group differences on a five
Likert item"? ON T1.`Id` =
point Likert item'
T2.`PostId` WHERE …
Evidence Misunderstand
Question Evidence Gold SQL Pred SQL Error Description
SELECT T1.City, ...
SELECT T2.City FROM frpm AS JOIN frpm ON
Which cities have the top 5 T1 INNER JOIN schools AS T2 schools.CDSCode =
lowest enrollment number for K-12 refers to students in ON T1.CDSCode = T2.CDSCode frpm.CDSCode WHERE Misunderstand
students in grades 1 through grades 1 through 12. GROUP BY T2.City ORDER BY frpm.`Low Grade` = knowledge K-12
12? SUM(T1.`Enrollment (K-12)`) '1' AND frpm.`High
ASC LIMIT 5 Grade` = '12' ) ...
ASC LIMIT 5
Other Errors
Question Evidence Gold SQL Pred SQL Error Description
SELECT `GSoffered`
FROM schools WHERE
What is the grade span offered SELECT GSoffered FROM
`Longitude` = ( No information about
in the school with the highest schools ORDER BY
SELECT ABS(longtitude)
longitude? ABS(longitude) DESC LIMIT 1
MAX(`Longitude`) FROM
schools )
Figure 7: 8 major types of error cases of BIRD are presented. Some cases are shortcuts for better presentation.