0% found this document useful (0 votes)
13 views6 pages

10 LLM Based SQL Generation Wi

The document presents two models, SQL-RL-GEN and SQL-RL-GEN*, designed to improve text-to-SQL generation using reinforcement learning while minimizing resource usage. These models leverage a novel reward function generation algorithm, EUREKA, to enhance the training of a base LLM for SQL query generation. The results demonstrate a 2-7% accuracy improvement over existing state-of-the-art methods, even with limited training data and a smaller LLM.

Uploaded by

aliassia995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

10 LLM Based SQL Generation Wi

The document presents two models, SQL-RL-GEN and SQL-RL-GEN*, designed to improve text-to-SQL generation using reinforcement learning while minimizing resource usage. These models leverage a novel reward function generation algorithm, EUREKA, to enhance the training of a base LLM for SQL query generation. The results demonstrate a 2-7% accuracy improvement over existing state-of-the-art methods, even with limited training data and a smaller LLM.

Uploaded by

aliassia995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

LLM-based SQL Generation with Reinforcement Learning

Mariia Berdnyk1 , Marine Collery 1


1
IBM France Lab
mariia.berdnyk@ibm.com, marine.collery@ibm.com

Abstract incorporating a dependency graph to guide token predic-


tions based on their dependencies (Xu, Liu, and Song 2017).
The text-to-SQL problem remains a challenging task, even However, the question remains open whether generations
with the advancements of Large Language Models (LLMs).
without a solid data background can be further improved and
Current state-of-the-art models require extensive preprocess-
ing steps and powerful LLMs to achieve accurate SQL query generalized easily, no matter the model used. Another ap-
generation, which leads to significant resource utilization. proach based on Reinforcement Learning (RL), Seq2SQL,
We introduce two models deriving from one another S QL - uses basic rewards (1 for correct query generation and -1
∗ otherwise) obtained from in-the-loop query execution over
RL - GEN and S QL - RL - GEN , that improve text-to-sql genera-
tion while minimizing the resources needed for training and the database to learn a policy for generating the better query
maximizing flexibility. The S QL - RL - GEN generates a reward (Zhong, Xiong, and Socher 2017). Despite the impressive
function to guide the agent’s training process, while S QL - RL - results that Seq2SQL has demonstrated at the time of its

GEN uses this reward function to tune a base LLM in solving publication, subsequent work suggest that the base reward
the specified task. Our models achieve an accuracy improve- is not enough to solve the text-to-SQL problem (Xu, Liu,
ment of 2-7% compared to state-of-the-art methods on a lim-
and Song 2017).
ited training dataset composed of only 1000 samples and with
a small LLM of 248M parameters. Reward function design for generation task demands sig-
nificant human effort and is known to be notoriously diffi-
cult in practice (Sutton and Barto 1995). For this purpose,
Code — https://github.com/IBM/sql-rl-gen
recently, a generic novel reward design algorithm, E UREKA
Datasets — https://ibm.box.com/v/sql-rl-gen-data (Ma et al. 2024), powered by coding LLMs was proposed.
Unlike prior works using LLMs to aid reward design, E U -
REKA is completely free of task-specific prompts, reward
Introduction
templates, as well as few-shot examples (Ma et al. 2024).
Large Language Models (LLMs) have exhibited remarkable Instead, it uses evolutionary search and feedback to gener-
capabilities in various tasks, including text and code gen- ate the best reward function with LLM.
eration problems (Jiang et al. 2024). The success is largely In this paper, we introduce two models deriving from one
attributed to the vast amount of data available for training another S QL - RL - GEN and S QL - RL - GEN∗ .
and tuning processes. S QL - RL - GEN algorithm finds the best reward function
The text-to-SQL generation problem is a critical area of (reference reward function) to be used for the training of
research within the fields of natural language processing an RL agent to generate SQL queries from text with sim-
(NLP) and database systems. Since SQL remains one of the ilar techniques as proposed by E UREKA i.e. implementing
most widely used programming languages for database man- the reward design for SQL generation, feedback formulation
agement (51.52%), the text-to-SQL translation enables non- and an evolutionary search of the best reward function.
skilled users to access structured databases like engineers S QL - RL - GEN∗ uses the reference reward function gener-
using everyday language (Hong et al. 2024). ated by S QL - RL - GEN on a reference dataset to tune a base
Current text-to-SQL best models, which achieve the top LLM (flan-t5-base) for SQL generation with limited
scores on the most comprehensive SQL datasets, are based resources.
on modifying the model structure by providing several other The approach makes the following key contributions com-
preprocessing steps in between the model and SQL genera- pared to existing work:
tion. For instance, ExSL + granite-34b-code by IBM
Research combines 2 steps before passing the question to 1. Versatility and efficiency of the reference reward
the model, which are: schema linking and content linking function for SQL generation: S QL - RL - GEN∗ outper-
(Martineau 2024). SQLNet uses a sketch-based approach, forms state-of-the-art SQL generation models on a dif-
ferent dataset than the one used to generate the reference
Copyright © 2025, Association for the Advancement of Artificial reward function, with only 1000 samples used for train-
Intelligence (www.aaai.org). All rights reserved. ing and a relatively small base LLM of 248M parameters.
Figure 1: S QL - RL - GEN takes as inputs: a system prompt, an SQL environment code, and a task description prompt. The coding
LLM iteratively generates N reward function candidates, each used to train an SQL generation model from scratch with the
RL Proximal Policy Optimization (PPO) algorithm. The resulting models are evaluated by comparing the rows obtained from
generated SQL queries execution with those from ground truth queries. The evaluation results (feedback) and the best selected
by accuracy reward function are fed back to the coding LLM for the next iteration. S QL - RL - GEN∗ is a special case where the
best reward function from a previous S QL - RL - GEN training is used directly to train the RL agent.

This makes S QL - RL - GEN∗ efficient in terms of resource dataset D is defined as D = ((t1 , q1 ), s1 ), ...((tN , qN ), sN )
utilization. where N is the number of samples.
2. Domain adaptability: S QL - RL - GEN algorithm is easily Once trained, model Ltrained should return for a specific
adaptable for generating reward functions in various text- prompt p a generated SQL query sgen to be compared with
to-code domains, enabling its application in diverse set- corresponding (ground truth) query s.
tings.
Method
Problem Statement An overview of the approach of S QL - RL - GEN is illustrated
Given a textual prompt input p, which is the part of the set in Figure 1. An initialization step is followed by a loop com-
of all possible textual prompts P = {p1 , p2 , ..., pn }, and an posed of:
LLM L : P → O that maps prompts to code outputs in the • the generation of a reward function,
space of all possible code outputs O = {o1 , o2 , ..., om }, our • the training of the RL agent,
goal is to train L to generate an SQL query s ∈ S from the • the evaluation of the tuned SQL generation model and
input prompt p, where S ∈ O is the set of all possible SQL the supply of textual feedback.
queries. Initialization. In the initialization stage, similarly to E U -
The prompt is represented as p = (I, T, Q), where: REKA original approach, we provide the LLM with a prompt
• I is a set of possible instructions, e.g., “convert”, “sum- that outlines the task and SQL environment. It is composed
marize”, “answer”, etc. It can be represented as a binary of the following parts.
vector i ∈ {0, 1}|I| , where each element corresponds to 1. The system prompt explicitly defines the role of the
one of the instructions in I. LLM as a reward engineer and provides an example of
• T is a set of possible table schemas: T = (t1 , t2 , ..., tj ). the reward function signature.
t is a single table, represented as a tuple of columns 2. The task description specifies the goal of the model dur-
t = (c1 , c2 , ..., ck ) where k is the number of columns ing training and generation. For SQL generation, it is set
in the table t. to “Converting question and database tables into SQL
• Q is a set of possible questions, e.g., “How many...”, query”.
“What is...”, etc. Each question can be represented as a 3. The SQL environment component is crucial and pro-
string q. vides the LLM with context where the trained agent will
As instruction (I) for the problem remains unchanged, operate and execute generated reward functions during
training and testing datasets consist of pairs of input data training. In the same manner as in E UREKA, S QL - RL -
(t, q) and corresponding (ground truth) query s such that a GEN feeds the raw environment source code (excluding
reward code, if present) as context with minimal expla- (61297 examples), development (9145 examples) and test
nations of external functions (Ma et al. 2024). sets (17284 examples).
The entire initialization stage sets the generation goal, al- Experimental Setting. For each dataset, a subset of 1000
lowing adaptation to different tasks by modifying the initial randomly selected samples are used for training and an
prompts to solve similar problems in a comparable manner. other subset of 1000 randomly selected samples are used for
All initialization prompts are available in Appendix. testing. The experiments are carried out with k-fold cross-
Reward Function Generation and Training. Thanks to validation strategy with k = 5.
the provided prompts, the coding LLM generates multiple The reward function generation and reflection are im-
reward functions that are used to train RL agents with PPO plemented using llama-3-405b-instruct (Touvron
(Schulman et al. 2017) algorithm in a similar manner to E U - et al. 2023). This model is free, open-source and is known
REKA , and obtain a tuned SQL generation LLM. for its good instructed generation capabilities (Touvron et al.
2023), which makes it the better choice than the propri-
Evaluation and Feedback. In order to improve the next etary one described in the E UREKA reference paper. Char-
iteration of reward function generation, textual feedback on acteristics of the model are available in Appendix, Ta-
the performance of the best tuned SQL generation LLM is ble 4. The initial LLMs (agents) used for generating SQL
provided to the coding LLM as well as the reward function, queries are flan-t5-base (Chung et al. 2024) and a pre-
with which this model is trained. The SQL generation LLM trained version of flan-t5-base on SQL syntax (noa
is considered the best (out of the multiple generated), if after 2023). flan-t5-base transformer-based model consists
training it yields higher average accuracy during the evalua- of only 248 million parameters, which makes its training
tion step than other models from both previous and current process computationally efficient and light. To evaluate the
iterations. efficiency of S QL - RL - GEN∗ , trained flan-t5-base was
To evaluate the tuned SQL generation LLM performance, compared with the trained on the same samples Seq2SQL
similarly to Seq2SQL approach, both S QL - RL - GEN and and SQLNet reference models, which are configured accord-
S QL - RL - GEN∗ evaluation step consists in comparing the ing to their original papers. All agents characteristics can be
SQL rows resulting from the execution of the generated SQL found in Appendix, Table 5.
query and the ones obtained with the ground truth query. The PPO algorithm is configured in the exact same manner
generated queries are only executed when they do not mod- as in (Schulman et al. 2017) and as described in E UREKA
ify the execution environment. reference paper. The parameters are listed in Table 6 in Ap-
The evaluation results are saved, converted into text and pendix. However, unlike the original PPO approach, which
provided back to the LLM as feedback with quantitative in- only allows a single trial per sample before switching to an-
formation of the performance (accuracy, precision, recall, other, for the training of S QL - RL - GEN and S QL - RL - GEN∗ ,
F1-score and intersect over union (IoU)). In addition, if we introduce an improvement by enabling the model to ex-
errors are encountered during the execution of generated periment 10 times on the same sample before moving on.
queries, error types along with the error descriptions are re- This approach enables the agent to learn from its mistakes
turned in the feedback. The error descriptions do not provide and refine its policy for generating better SQL queries. By
specific information about the database context and are data allowing multiple trials on the same sample, we can more
independent. effectively capture the nuances of text generation problems,
As shown in Figure 1, S QL - RL - GEN∗ is derived from which often demand a more refined approach than the origi-
S QL - RL - GEN and consists in retrieving the best generated nal single-trial method. This modification allows our model
reward function from a former training of S QL - RL - GEN and to learn from its errors and improve the quality of subsequent
using it to directly train a RL agent. SQL generations.
All experiments are GPU-based and were conducted on
Experiments a Lenovo ThinkPad P15 Gen 1 with Intel Core i7-10750H
In order to evaluate the validity and usefulness of S QL - RL - CPU, 12 Cores, Quadro T1000/PCIe/SSE2 graphics with
GEN , we apply it on Spider dataset (Yu et al. 2019) to ob- 4Gb of memory and running Red Hat Enterprise Linux 8.10.
tain our reference reward function. The WikiSQL dataset
(Zhong, Xiong, and Socher 2017) is then used to evaluate Preliminary Results
the validity and robustness of this reference reward function, S QL - RL - GEN and Reference Reward Function Gen-
S QL - RL - GEN∗ . eration. Training S QL - RL - GEN on Spider dataset, with
Spider Dataset Spider consists of 10181 questions and flan-t5-base model as initial SQL generation LLM,
5693 unique complex SQL queries on 200 databases with does not lead to any improvements in terms of accuracy
multiple tables covering 138 different domains. In Spi- (O%). This is due to the fact that the flan-t5-base
der 1.0, different complex SQL queries and databases appear model has not been trained on any code or SQL queries,
in train (8659 examples) and test (1034 examples) sets. and that the training on Spider dataset is severely limited by
the constrained size of 1000 training samples and that Spi-
WikiSQL Dataset WikiSQL consists of a corpus of 87726 der features highly intricate and complex queries. However,
hand-annotated SQL query and natural language question as shown in Table 1 when training S QL - RL - GEN on Spider
pairs. These SQL queries are further split into training dataset, with a pretrained for SQL syntax flan-t5-base
pretrained Seq2SQL
S QL - RL - GEN
flan-t5-base Seq2SQL with S QL - RL - GEN∗
reference reward function
accuracy (%) 44.7 ± 1.6 48.0 ± 0.78
exec sgen (%) 61.5 ± 1.5 64.3 ± 1.3 Dev Accqm 53.1% 55.0%
Dev Accexec 60.4% 62.5%
Table 1: Average accuracies and percentages of generated Test Accqm 52.7% 55.3%
executable queries sgen along with standard errors for 5-fold Test Accexec 60.0% 63.2%
cross validation for the initial LLM (flan-t5-base pre-
trained on SQL syntax model) and after S QL - RL - GEN train- Table 3: Accuracy comparison on WikiSQL dataset between
ing on Spider dataset. Metrics shown are obtained on Spider Seq2SQL and Seq2SQL with S QL - RL - GEN∗ reference re-
testing dataset. ward function. Accqm and Accexec indicate the query-match
(string match) and the execution accuracy (correct result)
(Zhong, Xiong, and Socher 2017) respectively on develop-
Seq2SQL SQLNet S QL - RL - GEN∗
ment and testing datasets.
accuracy 7.1% 11.3% 13.8%
exec sgen 12.8% 12.1% 30.6%
2. Generalization: The model improves when transferring
Table 2: Accuracies and percentages of executable generated from Spider to WikiSQL, but its adaptability to un-
queries sgen for Seq2SQL, SQLNet and S QL - RL - GEN∗ ob- seen schemas requires further evaluation across diverse
tained on WikiSQL test dataset. benchmarks.
3. PPO Trials: Additional trials refine the reward function
but increase computational cost. Analyzing diminishing
model as initial SQL generation LLM, the performance in returns could optimize efficiency.
terms of accuracy is improved by more than 3% and on aver- 4. Scalability: Testing on varied datasets and resource con-
age there are almost 3% more executable generated queries. straints would help assess robustness and adaptability.
Versatility of the Reference Reward Function. As
shown in Table 2, S QL - RL - GEN∗ which uses the refer- Conclusion
ence reward function to fine-tune flan-t5-base, out- We have presented S QL - RL - GEN and S QL - RL - GEN∗ deriv-
performs state-of-the-art models Seq2SQL (Zhong, Xiong, ing from one another. The first one proposes a reference re-
and Socher 2017) and SQLNet (Xu, Liu, and Song 2017) ward function calibrated for SQL generation thanks to evo-
on WikiSQL dataset both in terms of accuracy and number lutionary search and feedback formulation (Ma et al. 2024)
of executable generated SQL queries. It points out the ver- that can be used by the second to tune LLM with limited re-
satility of the reference reward function and how efficient sources. The experiments demonstrated that S QL - RL - GEN∗
in terms of resource utilization S QL - RL - GEN∗ is, as only outperforms state-of-the-art methods and that the reference
1000 samples were used for training compared to the entire reward function can boosts the generation capability of RL-
dataset for the other models. based methods on WikiSQL and Spider datasets.
Reusability of the Reference Reward Function. Finally,
in order to validate that the reference reward function can References
also be used in other RL-based algorithm, we compared 2023. Hugging Face. https://huggingface.co/juierror/flan-
Seq2SQL model to a version of Seq2SQL trained with our t5-text2sql-with-schema-v2. Accessed: 2024-11-15.
reference reward function version as shown in Table 3. The Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe-
metrics employed for model evaluation align with those uti- dus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; Web-
lized in the Seq2SQL original paper (and are described in son, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowd-
Appendix). Again, usage of the reference reward function hery, A.; Castro-Ros, A.; Pellat, M.; Robinson, K.; Valter,
improved all of the different accuracies defined in (Zhong, D.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V.; Huang, Y.;
Xiong, and Socher 2017) to evaluate SQL generation. This Dai, A.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; Devlin, J.;
reward function can therefore be reused in other RL-based Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. 2024. Scal-
context in the text-to-SQL generation field. ing Instruction-Finetuned Language Models. Journal of Ma-
chine Learning Research, 25(70): 1–53.
Limitations and Future Directions Hong, Z.; Yuan, Z.; Zhang, Q.; Chen, H.; Dong, J.;
While S QL - RL - GEN and S QL - RL - GEN∗ show strong im- Huang, F.; and Huang, X. 2024. Next-Generation
provements with limited data, further analysis is needed: Database Interfaces: A Survey of LLM-based Text-to-SQL.
1. Error Mitigation: The reward function penalizes syntax arXiv:2406.08426.
errors, logical inconsistencies, and schema mismatches. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; and Kim, S. 2024.
A detailed breakdown of its impact on correction rates A Survey on Large Language Models for Code Generation.
would clarify its role in improving performance. arXiv:2406.00515.
Ma, Y. J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, SQL environment
O.; Jayaraman, D.; Zhu, Y.; Fan, L.; and Anandkumar, A.
2024. Eureka: Human-Level Reward Design via Coding ‘‘‘python
Large Language Models. arXiv:2310.12931. class SQLRLEnv(TextRLEnv):
Martineau, K. 2024. IBM text-to-SQL generator tops leader- def __init__(self, model,
board. https://research.ibm.com/blog/granite-LLM-text-to- tokenizer, dataset, ...):
SQL. Accessed: 2024-10-31. super().__init__(model,
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and tokenizer,
Klimov, O. 2017. Proximal Policy Optimization Algorithms. observation_input,
arXiv:1707.06347. max_length,
Sutton, R. S.; and Barto, A. G. 1995. Reinforcement Learn- compare_sample,
ing: An Introduction. MIT Press. unfreeze_layer_from_past)
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, ...
M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; def sql_query_execution_feedback
Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, (self, input_item,
G. 2023. LLaMA: Open and Efficient Foundation Language predicted_text) -> Dict:
Models. arXiv:2302.13971. ...
Xu, X.; Liu, C.; and Song, D. 2017. SQLNet: Generating
# Base method
Structured Queries From Natural Language Without Rein-
def get_reward(self, input_item,
forcement Learning. arXiv:1711.04436.
predicted_list, finish):
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, if finish:
Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; Zhang, Z.; and Radev, predicted_text = self.
D. 2019. Spider: A Large-Scale Human-Labeled Dataset for tokenizer.
Complex and Cross-Domain Semantic Parsing and Text-to- convert_tokens_to_string
SQL Task. arXiv:1809.08887. (predicted_list[0])
Zhong, V.; Xiong, C.; and Socher, R. 2017. Seq2SQL: Gen- reward, metrics = self.
erating Structured Queries from Natural Language using Re- compute_reward(
inforcement Learning. arXiv:1709.00103. input_item,
predicted_text)
Appendix - Initialization Prompts metrics["reward"] =
reward
System Prompt ...
You are a reward engineer trying return reward
to write reward functions to solve return 0.0
reinforcement learning tasks as
effective as possible. Your goal # Skeleton of generation
is to write a reward function for def compute_reward(self,
the environment that will help the input_item, predicted_text)
agent learn the task described in -> Tuple[float, Dict]
text. Your reward function should use ‘‘‘
useful variables from the environment
as inputs. An example of the reward
function signature can be:
‘‘‘python Appendix - Experimental Settings
{task_reward_signature_string}
‘‘‘
You need to generate the reward
functions of EXACTLY this syntax.
Everything else is not accepted. Please llama-3-405b
make sure that the code is compatible -instruct
with Gym env. **PROVIDE ONLY PYTHON
CODE.** Number of parameters 405B
Temperature 0.95
Task Description Context size 15 000
The Python environment is Decoding method sample
{task_environment_code_string}. Write a
reward function for the following task: Table 4: llama-3-405b-instruct and flan-t5-base characteris-
{task_description}. tics.
flan-t5-base SQLNet Seq2SQL
Architecture Encoder-Decoder BiLSTM + attention Encoder-Decoder
Transfomer (T5) + seq2set + RL
Number of parameters 248M 38.5M 37M
Pretrained Yes No No
Fine-tuning required Yes Yes Yes
Temperature 0.8 0.8 0.8

Table 5: Experimental flan-t5-base, Seq2SQL and SQLNet agents models characteristics.

Parameters Values
Tensors type F32
Temperature 0.8
Top k 100
Top p 0.85
Update interval 50
Minibatch size 512
Number of Epochs 5000
Number of steps 1000
Number of evaluation episodes 5
Maximum training episodes length 1000
Evaluation interval 10
Maximum new tokens 250
Minimum new tokens 10

Table 6: PPO algorithm settings.

Appendix - Reference Reward Function


generated with S QL - RL - GEN

Figure 2: Reference Reward Function generated with S QL -


RL - GEN and used for training of S QL - RL - GEN ∗ .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy