0% found this document useful (0 votes)
4 views9 pages

DTS SQL

The document presents a novel two-stage fine-tuning approach for the text-to-SQL task using small open-source language models, aimed at improving execution accuracy while addressing data privacy concerns associated with proprietary models. The proposed method demonstrates a performance improvement of 3 to 7 percent in execution accuracy compared to traditional single-step fine-tuning methods, achieving the highest execution accuracy among 7 billion parameter models on the BIRD hold-out test set. Comprehensive evaluations on multiple benchmarks indicate that this approach effectively aligns the performance of smaller models with larger proprietary counterparts.

Uploaded by

abdul bari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

DTS SQL

The document presents a novel two-stage fine-tuning approach for the text-to-SQL task using small open-source language models, aimed at improving execution accuracy while addressing data privacy concerns associated with proprietary models. The proposed method demonstrates a performance improvement of 3 to 7 percent in execution accuracy compared to traditional single-step fine-tuning methods, achieving the highest execution accuracy among 7 billion parameter models on the BIRD hold-out test set. Comprehensive evaluations on multiple benchmarks indicate that this approach effectively aligns the performance of smaller models with larger proprietary counterparts.

Uploaded by

abdul bari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models

Mohammadreza Pourreza Davood Rafiei


University of Alberta University of Alberta
pourreza@ualberta.ca drafiei@uablerta.ca

Abstract Model EX EM
Fine-tuning methods
Leading models for the text-to-SQL task heav- Llama2 7B
(Gao et al., 2023) 66.7 63.9
ily rely on proprietary Large Language Mod- Llama2 13B
els (LLMs), posing concerns over data pri- (Gao et al., 2023) 67.0 62.7
vacy. Closing the performance gap between Prompting methods
small open-source models and large propri- DAIL-SQL + GPT4
etary models is crucial to mitigate this reliance. (Gao et al., 2023) 84.4 74.4
DIN-SQL + GPT4
To this end, we introduce a novel two-stage (Pourreza and Rafiei, 2024) 74.2 60.1
fine-tuning approach that decomposes the task
into two simpler tasks. Through comprehen- Table 1: Performance comparison of the prompting
sive evaluation on three large cross-domain methods and finetuning methods on Spider validation
datasets and two small LLMs, we show that dataset. EX stands for Execution accuracy and EM
this approach improves execution accuracy by stands for Exact Match accuracy.
3 to 7 percent, effectively aligning the perfor-
mance of open-source models with their pro-
prietary counterparts. Our proposed method
1 presents a performance comparison of the fine-
has achieved 60.31% execution accuracy on
BIRD hold-out test set, which is the highest tuned open-source LLMs on the Spider develop-
performance among methods using 7B parame- ment set, contrasting with methods that employ
ter models. GPT-4’s prompting techniques. Our hypothesis is
that the task of text-to-SQL is too complex to be
1 Introduction mastered in a single stage using small LLMs. We
Natural language interfaces for databases allow aim to address this disparity by introducing a novel
users to derive insights from structured databases two-step decomposed fine-tuning method, employ-
using natural language instead of complex SQL ing two smaller LLMs. This approach, utilizing a
queries. Leading open-source methods (Pourreza model with a parameter size of 7 billion, achieves a
and Rafiei, 2024; Gao et al., 2023; Wang et al., performance comparable to methods using GPT-4
2023) for this task heavily depend on proprietary with few-shot learning and well-designed prompts.
Large language models (LLMs) like GPT-4 and We evaluate the performance of our proposed
GPT-3.5-turbo, which have demonstrated superior method using three Text-to-SQL benchmarks: Spi-
performance in Text-to-SQL benchmarks (Yu et al., der (Yu et al., 2018), BIRD (Li et al., 2023c), and
2018; Li et al., 2023c; Gan et al., 2021). How- Spider-SYN (Gan et al., 2021), along with two 7B
ever, this reliance on large proprietary models has LLMs: DeepSeek DeepSeek-AI (2024) and Mistral
privacy and cost implications. For instance, many Jiang et al. (2023). Our approach demonstrates a
large enterprises cannot share their customer data performance improvement of approximately 3 to 7
with the model-providing companies due to pri- percent in execution accuracy compared to the con-
vacy considerations. Additionally, cost is a factor, ventional single-step fine-tuning method employed
especially for small businesses, in adopting these in previous studies (Gao et al., 2023). This consis-
models. tent performance gain across these datasets high-
Recent attempts to utilize open-source LLMs lights the generalizability of our method. More-
(Gao et al., 2023) and fine-tune them using over, our fine-tuning strategy, utilizing a 7 bil-
question-SQL query pairs have fallen short of the lion parameter LLM, surpasses all previous open-
zero-shot performance of GPT-3.5-turbo. Table source methods on the Spider development set and
8212
Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8212–8220
November 12-16, 2024 ©2024 Association for Computational Linguistics
achieves comparable results to the state-of-the-art 2 Methodology
open-source methods using GPT-4 (Pourreza and
Rafiei, 2024; Gao et al., 2023) on the Spider test set. A notable development in LLMs is their post-
On the hold-out BIRD test set, our method with pretraining refinement, which enhances their align-
DeepSeek 7B surpasses all of the methods using ment with preferred behaviors, as documented by
7B parameter models (Li et al., 2024) and ranked Mishra et al. (2021); Victor et al. (2022); Thop-
second on the leaderboard among methods with pilan et al. (2022). Common methods of align-
publicly available papers, with 60.31% execution ment include Supervised Fine-Tuning (SFT) us-
accuracy. All the necessary code to replicate the ing human demonstrations, as reported by Ouyang
results provided in our GitHub repository 1 . et al. (2022); Tunstall et al. (2023) and Reinforce-
ment Learning from Human Feedback (RLHF), as
detailed by Christiano et al. (2017); Ziegler et al.
1.1 Related Works
(2019); Stiennon et al. (2020); Bai et al. (2022).
Early efforts by the database community, such as The absence of extensive datasets containing ei-
custom templates, marked initial advancements ther human or AI feedback (Lee et al., 2023) has led
but required substantial manual effort (Zelle and to a predominant focus on supervised fine-tuning in
Mooney, 1996). Recently, text-to-SQL methodolo- the text-to-SQL field. This approach necessitates a
gies have increasingly incorporated transformer- collection of specific instructions or prompts along
based models, particularly sequence-to-sequence with their corresponding outputs or responses. In
architectures (Vaswani et al., 2017; Sutskever et al., the following section, we will delve into the estab-
2014). lished methods of supervised fine-tuning for LLMs
Initial sequence-to-sequence models, such as within the Text-to-SQL context. Subsequently, we
IRNet, utilized bidirectional LSTM architecture introduce our novel two-step fine-tuning approach,
and self-attention to encode database schema rep- designed to enhance the performance of models in
resentation (Guo et al., 2019). Advanced models the Text-to-SQL domain.
like RAT-SQL (Wang et al., 2019) and RASAT
2.1 Supervised fine-tuning for Text-to-SQL
(Qi et al., 2022) used relation-aware self-attention
mechanisms. Models like SADGA (Cai et al., In this section, we explore the supervised fine-
2021) and LGESQL (Cao et al., 2021) adopted tuning process for Text-to-SQL tasks, as practiced
graph neural networks to represent relational struc- in the open-source community (Gao et al., 2023).
tures between database schema and queries. Given a set of databases Di comprising pairs of
The field has also benefited from recent method- questions qi and corresponding SQL queries si , the
ological innovations in large language models goal is to fine-tune a large language model M us-
(LLMs). Early approaches leveraged the zero-shot ing a set of training data T = {(qi , si , Di )}, where
in-context learning capabilities of LLMs for SQL qi and si represent the natural language question
generation (Rajkumar et al., 2022). Subsequent and its associated SQL query on database Di . The
models like DIN-SQL (Pourreza and Rafiei, 2024), objective of supervised fine-tuning is to minimize
DAIL-SQL (Gao et al., 2023), MAC-SQL (Wang the empirical loss defined as:
et al., 2023), and C3 (Dong et al., 2023) have
enhanced performance through task decomposi- |T |
tion and techniques like Chains of Thought (CoT) 1 X
min L(M ∗ (σf (qi , Di )), si ) (1)
(Wei et al., 2022), and self-consistency (Wang M ∗ |T |
i=1
et al., 2022). Concurrent with our work, Blar-SQL
(Domínguez et al., 2024) proposed a fine-tuning where L is the next token prediction loss func-
approach for improving the performance of smaller tion used to measure the difference between the
LLMs. However, their method showed significantly SQL queries generated by the model and the ac-
lower performance compared to ours on the BIRD tual, correct (ground truth) queries. The function
benchmark, with roughly 9% percent gap on BIRD σf determines the formatting of the question, the
development set. database schema, and the SQL queries. A key chal-
lenge during inference is that we do not know in
1
https://anonymous.4open.science/r/ advance among all of the tables inside the database
DTS-SQL-2A42 which tables are relevant to a given question for
8213
generating accurate SQL queries. Therefore, a com- with the actual ground truth names. Since our ob-
mon approach in fine-tuning involves including the jective is to predict the set of relevant tables and
all of the tables within the prompts together with columns, and order does not matter in set predic-
the question and SQL pairs. This method serves tion but can affect the next token prediction loss,
a dual purpose: teaching the model to generate we always sort the schema in the prompt. By doing
the correct SQL query and to identify the relevant so, we ask the model to predict the schema in al-
tables from among all the provided tables. This phabetically sorted order to incorporate order in the
approach of training for two objectives simulta- prediction. Additionally, since the number of to-
neously complicates the SQL generation task for kens required to include all columns and tables can
LLMs. Each task – generating SQL queries and exceed the context window size of smaller LLMs,
correctly linking to the relevant schema – demands we first extract the ground truth tables. Then, we
its own reasoning process. A significant proportion sort the remaining tables based on their embedding
of errors in large language models can be attributed similarity to the question’s embedding. We con-
to incorrect schema linking, highlighting this as a tinue adding tables in order of similarity until we
major challenge in the field (Pourreza and Rafiei, reach the context window limit or there are no more
2024; Dong et al., 2023). tables to include. To avoid introducing any order
bias, we shuffle the tables in the schema at the final
2.2 Decomposed Supervised Fine-tuning step.
We propose a two-stage fine-tuning process, which
2.2.2 SQL Generation Fine-tuning
separates schema linking and SQL generation, aim-
ing to enhance overall performance. After identifying the appropriate tables for SQL
generation, the next step is to utilize a model that
2.2.1 Schema-linking Fine-tuning constructs the SQL query based on the question and
Schema linking involves identifying the pertinent the schema of the correct tables. Since we have
columns and tables in a database in response to nat- already identified the potentially correct tables us-
ural language queries. It has been demonstrated to ing the schema-linking module, there is no need to
enhance cross-domain generalizability (Lei et al., include all tables in the input for the SQL genera-
2020) and has been a part of the pipeline for both tion model. In contrast to previous approaches for
early seq-to-seq models (Cao et al., 2021; Guo fine-tuning LLMs, we extract the relevant tables
et al., 2019; Xu et al., 2021) and recent in-context from the training dataset T = {(qi , si , Di )} corre-
learning methods using LLMs (Pourreza and Rafiei, sponding to the ground truth SQL queries. We then
2024; Wang et al., 2023). However, it has not been fine-tune the LLM while minimizing the following
treated as a separate module for fine-tuning LLMs. loss function:
In this work, we treat schema linking as a distinct
|T |
task and explicitly fine-tune LLMs to identify rel- 1 X
min L(M ∗ (σg (qi , Ti )), si ) (3)
evant tables and columns when presented with a M ∗ |T |
i=1
natural language query. Given a training dataset
The loss function is same as the loss function
T = {(qi , si , Di )}, we extract all of the columns
defined in Section 2.1. This decomposition of the
and tables used in the SQL queries and create a
Text-to-SQL training process allows LLMs to be
new dataset of T = {(qi , Ti , Ci , Di )} where Ti
trained with a singular objective. By segregat-
and Ci represent lists of tables and columns used
ing the schema-linking and SQL query generation
in the SQL query si . The primary objective dur-
tasks, we improve the training process, enabling
ing supervised fine-tuning for schema linking is
more focused and effective fine-tuning.
to minimize the empirical loss, as defined by the
following equation: 3 Experiments
3.1 Models
|T |
1 X Our methodology was evaluated using two recent
min L(M ∗ (σs (qi , Di )), Ci , Ti ) (2)
M ∗ |T | LLMs from distinct architectures, namely Mis-
i=1
tral 7B (Jiang et al., 2023) and DeepSeek 7B
Here, L represents the next token prediction loss, (DeepSeek-AI, 2024). Mistral 7B, not specifi-
comparing the predicted column and table names cally pretrained for code generation, surpasses
8214
Model EX EM Model EX EM
DAIL-SQL + GPT-4 Instruction tuning methods
(Gao et al., 2023) 86.6 - DTS-SQL + Mistral 7B
DIN-SQL + GPT-4 (our) 78.6 73.3
(Pourreza and Rafiei, 2024) 85.3 60 DTS-SQL + DeepSeek 7B
DTS-SQL + DeepSeek 7B (our) 85.5 79.1
Ours 84.4 73.7 Llama2 7B
C3 + ChatGPT + Zero-Shot (Gao et al., 2023) 66.7 63.9
(Dong et al., 2023) 82.3 - Llama2 13B
RESDSQL-3B + NatSQL (Gao et al., 2023) 67.0 62.7
(Li et al., 2023a) 79.9 72 Prompting methods
DTS-SQL + Mistral DIN-SQL + GPT4
Ours 77.1 69.3 (Pourreza and Rafiei, 2024) 74.2 60.1
Graphix-3B + PICARD DIN-SQL + CodeX
(Li et al., 2023b) - 74 (Pourreza and Rafiei, 2024) 69.9 57.2
DAIL-SQL + GPT4
Table 2: The comparison of different methods on test (Gao et al., 2023) 84.4 74.4
set of Spider. C3 + GPT-3.5
(Dong et al., 2023) 81.8 -

Table 3: Performance of different methods with LLMs


many counterparts in its scale category (Jiang et al., on the dev set of Spider.
2023). Details about the hyperparameters are in-
cluded in Appendix A.4.
LLMs and are available as open source. Our two-
3.2 Datasets stage decomposed approach with DeepSeek 7B
We conducted our evaluation using three cross- attained state-of-the-art performance on the Spider
domain, challenging Text-to-SQL datasets: (1) Spi- development set, surpassing all previous methods
der, introduced by Yu et al. (2018), includes 160 that utilized prompting techniques and fine-tuning.
schemas allocated for training and development, Additionally, the results of our two-stage method
while the remaining 40 are set aside for testing on Spider-SYN dataset is provided in Table 7 in
purposes. (2) Spider-Syn (Gan et al., 2021) modi- appendix.
fies the Spider dataset by replacing schema-related To validate that the performance gain is stem-
words with synonyms and removing explicit men- ming from the decomposition proposed in this work
tions of schema links in the questions. (3) BIRD (Li or from the base LLMs, we also compared our two-
et al., 2023c) is a pioneering, cross-domain dataset stage method using the same models with vanilla
that examines the impact of extensive database fine-tuning. In Table 4, we showcase the results of
contents on text-to-SQL parsing with over 12,751 our two-stage fine-tuning method on the develop-
unique question-SQL pairs and 95 databases. De- ment set of Spider. The performance is compared
tails about the metric used for evaluation is pro- against two distinct scenarios: firstly, a one-stage
vided in Appendix A.3. For the Spider results, we scenario where the model is fine-tuned on all tables
trained the models on the official Spider training without employing our two-stage approach, and
set, and for the BIRD results, we fine-tuned the secondly, a perfect schema linking scenario where
models on the BIRD training set. we provide the ground truth tables to our fine-tuned
SQL generators. This latter scenario is denoted
3.3 Results as the ’Upper Bound’ in the table. Our two-stage
3.3.1 Spider test set model’s performance is measured by initially us-
As depicted in Table 2, our method employing ing our fine-tuned schema linker model to identify
DeepSeek 7B, when tested on the Spider test potentially relevant tables, which are then provided
dataset, achieves results comparable to state-of- as context to the SQL generator model.
the-art open-source methods in terms of execution
3.4 BIRD results
accuracy and exact set match accuracy.
To further validate the robustness of our proposed
3.3.2 Spider dev set method, we also evaluated the performance of our
In Table 3, we offer a detailed comparison between proposed method on BIRD benchmark test and de-
our method and various other baseline approaches. velopment sets. As shown in Table 5, our proposed
For the baselines, we selected diverse methods method with DeepSeek 7B achieved the second
from different families of approaches that are using highest performance among all published works
8215
Model Tuning EX EM
Mistral 7B FT Tuning 71.9 70.9
Mistral 7B DTS-SQL 78.6 73.3
Mistral 7B Upper bound 86.6 80.7
DeepSeek 7B FT Tuning 82.1 69.0
DeepSeek 7B DTS-SQL 85.5 79.1
DeepSeek 7B Upper bound 90.3 84.2

Table 4: Performance of the LLMs with different tuning


methods on Spider development set. FT stands for Full
tables finetuning, Upper bound performance is the per-
formance which we can achieve with a perfect schema
linking.
Figure 1: Precision, recall, and exeact set match performance
of the schema-linking model on different number of tables
Model Test EX Dev EC
SFT CodeS-15B
(Li et al., 2024) 60.37 58.47
DTS-SQL + DeepSeek (Ours) 60.31 55.8 To further investigate the relationship between
MAC-SQL + GPT-4 schema-linking performance and the effect of
(Wang et al., 2023) 59.59 57.56 schema size, we conducted an analysis of preci-
SFT CodeS-7B
(Li et al., 2024) 59.25 57.17 sion, recall, and exact set match accuracy for the
DAIL-SQL + GPT-4 DeepSeek schema-linking model on the Spider test
(Gao et al., 2023) 57.41 54.76 set with varying numbers of tables. The results,
DIN-SQL + GPT-4
(Pourreza and Rafiei, 2024) 55.90 50.72 shown in Figure 1, indicate that exact set match
Blar-SQL accuracy generally decreases as the number of ta-
(Domínguez et al., 2024) - 46.68
bles increases. However, the schema-linking model
Table 5: The comparison of different methods on test consistently maintains precision and recall values
set and development set of the BIRD benchmark. above 0.96, even with more than 14 tables.

4 Discussion
on the test set, which shows the effectiveness of
our method to achieve comparable results with pro- While our two-step approach has achieved com-
prietary LLMs or even surpass them with a small parable performance to larger models like GPT-
LLM. 4 on three large cross-domain datasets, there is
still significant room for improvement, particu-
3.4.1 Schema-linking Performance larly for the schema-linking models. Currently, our
As discussed in Section 2, our approach employs schema-linking models achieve roughly 90% exact
two LLMs: one for schema linking and another for set match accuracy. However, as noted in Table 4,
SQL query generation. The schema-linking model the substantial gap between the upper bound perfor-
plays a pivotal role in our pipeline, as inaccuracies mance of the SQL generator and that of DTS-SQL
in table detection could hinder the SQL genera- calls for further research into the schema-linking. .
tor’s ability to formulate the correct SQL queries.
We fine-tuned two models, based on the Deepseek 5 Conclusion
and Mistral models, for schema linking. Evalua- Before our research, small open-source models
tion metrics, including exact set match, precision, lagged behind large proprietary models in perfor-
and recall, were used to assess their performance. mance on the text-to-SQL task. Our two-stage
Detailed information about these models on two fine-tuning approach breaks down the task into two
distinct datasets can be found in Table 6. simpler components, enabling small open-source
models to rival larger ones. Subsequent efforts
Model Dataset EX PR RE could focus on enhancing the performance of these
DeepSeek Spider 93.1 98.4 97.7
Mistral Spider 91.1 97.5 97.8 stages and exploring improved methods for trans-
DeepSeek Spider-SYN 87.6 94.6 94.7 ferring the output of one stage to the next.
Mistral Spider-SYN 85.3 91.2 90.5
Limitations
Table 6: Performance of the schema-linker model on
Spider and Spider-SYN dev sets. PR stands for Preci- This paper has placed its primary emphasis on en-
sion, RE is recall, and EX is exact set match accuracy. hancing the performance of both stages of fine-
8216
tuning small large language models for Text-to- Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar-
SQL task. However, there remains scope for further tic, Shane Legg, and Dario Amodei. 2017. Deep
reinforcement learning from human preferences. Ad-
investigation and comparison of various techniques
vances in neural information processing systems, 30.
for schema-linking. Exploring approaches like re-
trieval methods or in-context learning when applied Tri Dao. 2023. FlashAttention-2: Faster attention
in conjunction with larger models such as GPT-4 with better parallelism and work partitioning. arXiv
preprint arXiv:2307.08691.
for the schema-linking task could yield valuable
insights into identifying the most effective method- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,
ologies for schema-linking. and Christopher Ré. 2022. FlashAttention: Fast and
memory-efficient exact attention with IO-awareness.
In Advances in Neural Information Processing Sys-
Acknowledgements tems.
We gratefully acknowledge the financial support DeepSeek-AI. 2024. Deepseek llm: Scaling open-
provided by the Natural Sciences and Engineer- source language models with longtermism. arXiv
ing Research Council of Canada (NSERC), which preprint arXiv:2401.02954.
made this research possible. José Manuel Domínguez, Benjamín Errázuriz, and Patri-
cio Daher. 2024. Blar-sql: Faster, stronger, smaller
Ethics Statement nl2sql. arXiv preprint arXiv:2401.02997.
In this paper, we place a strong emphasis on the sig- Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao,
nificance of ethical considerations in every aspect Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. 2023.
of our research, from its inception to its presenta- C3: Zero-shot text-to-sql with chatgpt. arXiv
preprint arXiv:2307.07306.
tion. We wholeheartedly commit to adhering to the
ACL Ethics Policy and upholding ethical principles Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew
and guidelines throughout our research journey. Purver, John R. Woodward, Jinxia Xie, and Peng-
sheng Huang. 2021. Towards robustness of text-
We have taken proactive measures to minimize to-SQL models against synonym substitution. In
any potential biases or discriminatory elements in Proceedings of the 59th Annual Meeting of the Asso-
our research design, data selection, and interpre- ciation for Computational Linguistics and the 11th
tation of results. Our dedication to transparency, International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 2505–
precision, and fairness in reporting our findings is 2515, Online. Association for Computational Lin-
unwavering, and we have duly acknowledged and guistics.
cited the work of others to give proper credit.
By incorporating this ethics statement, we aim Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun,
Yichen Qian, Bolin Ding, and Jingren Zhou. 2023.
to underscore our unwavering commitment to con- Text-to-sql empowered by large language mod-
ducting research with integrity, respecting ethical els: A benchmark evaluation. arXiv preprint
principles, and contributing responsibly to the ad- arXiv:2308.15363.
vancement of knowledge in our field. Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-
Guang Lou, Ting Liu, and Dongmei Zhang. 2019. To-
wards complex text-to-sql in cross-domain database
References with intermediate representation. arXiv preprint
arXiv:1905.08205.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. sch, Chris Bamford, Devendra Singh Chaplot, Diego
2022. Training a helpful and harmless assistant with de las Casas, Florian Bressand, Gianna Lengyel, Guil-
reinforcement learning from human feedback. arXiv laume Lample, Lucile Saulnier, et al. 2023. Mistral
preprint arXiv:2204.05862. 7b. arXiv preprint arXiv:2310.06825.
Ruichu Cai, Jinjie Yuan, Boyan Xu, and Zhifeng Hao. Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie
2021. Sadga: Structure-aware dual graph aggrega- Lu, Thomas Mesnard, Colton Bishop, Victor Car-
tion network for text-to-sql. Advances in Neural bune, and Abhinav Rastogi. 2023. Rlaif: Scaling
Information Processing Systems, 34:7664–7676. reinforcement learning from human feedback with ai
feedback. arXiv preprint arXiv:2309.00267.
Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao,
Su Zhu, and Kai Yu. 2021. Lgesql: line graph en- Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan, Wei
hanced text-to-sql model with mixed local and non- Lu, Min-Yen Kan, and Tat-Seng Chua. 2020. Re-
local relations. arXiv preprint arXiv:2106.01093. examining the role of schema linking in text-to-sql.

8217
In Proceedings of the 2020 Conference on Empirical Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Methods in Natural Language Processing (EMNLP), Sequence to sequence learning with neural networks.
pages 6943–6954. Advances in neural information processing systems,
27.
Haoyang Li, Jing Zhang, Cuiping Li, and Hong Chen.
2023a. Resdsql: Decoupling schema linking and Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam
skeleton parsing for text-to-sql. In Proceedings of Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
the AAAI Conference on Artificial Intelligence, vol- Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.
ume 37, pages 13067–13075. 2022. Lamda: Language models for dialog applica-
tions. arXiv preprint arXiv:2201.08239.
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xi-
aokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Lewis Tunstall, Edward Beeching, Nathan Lambert,
Cuiping Li, and Hong Chen. 2024. Codes: Towards Nazneen Rajani, Kashif Rasul, Younes Belkada,
building open-source language models for text-to-sql. Shengyi Huang, Leandro von Werra, Clémentine
arXiv preprint arXiv:2402.16347. Fourrier, Nathan Habib, et al. 2023. Zephyr: Di-
rect distillation of lm alignment. arXiv preprint
Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, arXiv:2310.16944.
Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo
Si, and Yongbin Li. 2023b. Graphix-t5: Mixing pre- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
trained transformers with graph-aware layers for text- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
to-sql parsing. arXiv preprint arXiv:2301.07507. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi systems, 30.
Yang, Bowen Li, Bailin Wang, Bowen Qin, Rongyu
Cao, Ruiying Geng, et al. 2023c. Can llm already Sanh Victor, Webson Albert, Raffel Colin, Bach
serve as a database interface? a big bench for large- Stephen, Sutawika Lintang, Alyafeai Zaid, Chaffin
scale database grounded text-to-sqls. arXiv preprint Antoine, Stiegler Arnaud, Raja Arun, Dey Manan,
arXiv:2305.03111. et al. 2022. Multitask prompted training enables zero-
shot task generalization. In International Conference
on Learning Representations.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
Hannaneh Hajishirzi. 2021. Cross-task generaliza-
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
tion via natural language crowdsourcing instructions.
Polozov, and Matthew Richardson. 2019. Rat-sql:
arXiv preprint arXiv:2104.08773.
Relation-aware schema encoding and linking for text-
to-sql parsers. arXiv preprint arXiv:1911.04942.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang, Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun
2022. Training language models to follow instruc- Li. 2023. Mac-sql: Multi-agent collaboration for
tions with human feedback. Advances in Neural text-to-sql. arXiv preprint arXiv:2312.11242.
Information Processing Systems, 35:27730–27744.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Mohammadreza Pourreza and Davood Rafiei. 2024. Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Din-sql: Decomposed in-context learning of text- Denny Zhou. 2022. Self-consistency improves chain
to-sql with self-correction. Advances in Neural Infor- of thought reasoning in language models. arXiv
mation Processing Systems, 36. preprint arXiv:2203.11171.
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Zhang, and Zhouhan Lin. 2022. Rasat: Integrating et al. 2022. Chain-of-thought prompting elicits rea-
relational structures into pretrained seq2seq model soning in large language models. Advances in neural
for text-to-sql. arXiv preprint arXiv:2205.06983. information processing systems, 35:24824–24837.

Nitarshan Rajkumar, Raymond Li, and Dzmitry Bah- Kuan Xu, Yongbo Wang, Yongliang Wang, Zujie Wen,
danau. 2022. Evaluating the text-to-sql capabil- and Yang Dong. 2021. Sead: End-to-end text-to-
ities of large language models. arXiv preprint sql generation with schema-aware denoising. arXiv
arXiv:2204.00498. preprint arXiv:2105.07911.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Dario Amodei, and Paul F Christiano. 2020. Learn- ing Yao, Shanelle Roman, et al. 2018. Spider: A
ing to summarize with human feedback. Advances large-scale human-labeled dataset for complex and
in Neural Information Processing Systems, 33:3008– cross-domain semantic parsing and text-to-sql task.
3021. arXiv preprint arXiv:1809.08887.

8218
John M Zelle and Raymond J Mooney. 1996. Learning Model Tuning EX EM
to parse database queries using inductive logic pro- Mistral 7B FT Tuning 67.0 63.9
gramming. In Proceedings of the national conference Mistral 7B DTS-SQL 71.1 64.6
on artificial intelligence, pages 1050–1055.
Mistral 7B Upper bound 81.9 74.5
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B DeepSeek 7B FT Tuning 70.4 56.6
Brown, Alec Radford, Dario Amodei, Paul Chris- DeepSeek 7B DTS-SQL 76.2 68.9
tiano, and Geoffrey Irving. 2019. Fine-tuning lan-
DeepSeek 7B Upper bound 85.5 78.1
guage models from human preferences. arXiv
preprint arXiv:1909.08593.
Table 7: Performance of the LLMs with different tuning
methods on Spider-SYN dev set. FT stands for Full
tables finetuning, Upper bound performance is the per-
formance which we can achieve with a perfect schema
linking.

A Appendix
A.1 Spider-SYN dataset
To assess the efficacy of our proposed method, we
evaluated its performance on the development set
of Spider-SYN. Although Spider-SYN possesses a
distinct training set, we opted to test our fine-tuned
models directly on its development set, without
any additional tuning on the Spider-SYN training
set. The same performance gain is observed on this
dataset (see Table 7) even though the model was
not directly trained in this dataset.

A.2 Error Analysis


In this section, following the exact same setting
used in (Pourreza and Rafiei, 2024) for error analy-
sis, we sampled 400 queries from the development
set of Spider, and compared our two stage fine-
tuning approach with the vanilla full finetuning to
evaluate the effect of our proposed method. As it
is illustrated in figure 3, our method consistently
improve the error cases across different classes of
errors with the largest improvement on Schema
linking class of errors. Additionally, having a sepa-
rate schema linking module also improved the error
cases for other classes like queries with GROUP
BY or NESTED errors, which shows the impor-
tance of schema linking to help the LLM generate
the query by removing the confusing tables.

A.3 Metrics
For Spider, we used exact set match accuracy (EM)
and execution accuracy (EX). EM involves com-
paring the components of SQL queries, such as se-
lect, where, having, group by, and order by clauses,
focusing on the matching of columns and predi-
cates without considering the order. EX determines
equivalence between a model-generated query and
8219
Figure 3: The prompt used for SQL generation. The
Figure 2: Error analysis on 400 queries from the Spider
database schema is where we put the tables representa-
development set, comparing our DTS-SQL method with
tions.
vanilla Full fineutning (FF) method.

a reference query if they produce identical results


across various database instances. For BIRD, we
used two metrics: valid efficiency score (VES),
which evaluates SQL query performance by consid-
ering both accuracy and execution, and execution
accuracy. We achieved a similar ranking compared
to other models on VES, but due to its high variance
and dependence on the computational environment,
we exclude it from the current analysis.

A.4 Hyperparameters
The two LLMs, schema-linking generator and SQL
generator, were trained on Nvidia Tesla A100 Figure 4: The prompt used for Schema linking. The
GPUs, employing a batch sizes of 64 and 32 with a database schema is where we put the tables representa-
tions.
learning rate of 1*e-5 and 5*e-5 respectively. To
enhance the training efficiency, we incorporated
Flash Attention techniques as detailed in (Dao et al.,
2022; Dao, 2023).

A.5 Prompt
In conducting all our experiments on both models,
we adhered to a standardized prompt format to en-
sure consistency and facilitate reliable comparisons.
The chosen prompt format is well-established as ef-
fective in the Text-to-SQL domain, as demonstrated
in prior research by Gao et al. (2023). In this for-
mat, we provided information about the foreign
key constraints, primary keys, and column types.
Furthermore, to guide the models in understanding
how data is stored within the database, our prompt
incorporated three sample rows, showcasing data
Figure 5: A sample table representation. All of the table
entries.
in a database are represented as above and used in the
The specific prompt used for our experiments is prompts.
as follows:

8220

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy