10 LLM Based SQL Generation Wi
10 LLM Based SQL Generation Wi
This makes S QL - RL - GEN∗ efficient in terms of resource dataset D is defined as D = ((t1 , q1 ), s1 ), ...((tN , qN ), sN )
utilization. where N is the number of samples.
2. Domain adaptability: S QL - RL - GEN algorithm is easily Once trained, model Ltrained should return for a specific
adaptable for generating reward functions in various text- prompt p a generated SQL query sgen to be compared with
to-code domains, enabling its application in diverse set- corresponding (ground truth) query s.
tings.
Method
Problem Statement An overview of the approach of S QL - RL - GEN is illustrated
Given a textual prompt input p, which is the part of the set in Figure 1. An initialization step is followed by a loop com-
of all possible textual prompts P = {p1 , p2 , ..., pn }, and an posed of:
LLM L : P → O that maps prompts to code outputs in the • the generation of a reward function,
space of all possible code outputs O = {o1 , o2 , ..., om }, our • the training of the RL agent,
goal is to train L to generate an SQL query s ∈ S from the • the evaluation of the tuned SQL generation model and
input prompt p, where S ∈ O is the set of all possible SQL the supply of textual feedback.
queries. Initialization. In the initialization stage, similarly to E U -
The prompt is represented as p = (I, T, Q), where: REKA original approach, we provide the LLM with a prompt
• I is a set of possible instructions, e.g., “convert”, “sum- that outlines the task and SQL environment. It is composed
marize”, “answer”, etc. It can be represented as a binary of the following parts.
vector i ∈ {0, 1}|I| , where each element corresponds to 1. The system prompt explicitly defines the role of the
one of the instructions in I. LLM as a reward engineer and provides an example of
• T is a set of possible table schemas: T = (t1 , t2 , ..., tj ). the reward function signature.
t is a single table, represented as a tuple of columns 2. The task description specifies the goal of the model dur-
t = (c1 , c2 , ..., ck ) where k is the number of columns ing training and generation. For SQL generation, it is set
in the table t. to “Converting question and database tables into SQL
• Q is a set of possible questions, e.g., “How many...”, query”.
“What is...”, etc. Each question can be represented as a 3. The SQL environment component is crucial and pro-
string q. vides the LLM with context where the trained agent will
As instruction (I) for the problem remains unchanged, operate and execute generated reward functions during
training and testing datasets consist of pairs of input data training. In the same manner as in E UREKA, S QL - RL -
(t, q) and corresponding (ground truth) query s such that a GEN feeds the raw environment source code (excluding
reward code, if present) as context with minimal expla- (61297 examples), development (9145 examples) and test
nations of external functions (Ma et al. 2024). sets (17284 examples).
The entire initialization stage sets the generation goal, al- Experimental Setting. For each dataset, a subset of 1000
lowing adaptation to different tasks by modifying the initial randomly selected samples are used for training and an
prompts to solve similar problems in a comparable manner. other subset of 1000 randomly selected samples are used for
All initialization prompts are available in Appendix. testing. The experiments are carried out with k-fold cross-
Reward Function Generation and Training. Thanks to validation strategy with k = 5.
the provided prompts, the coding LLM generates multiple The reward function generation and reflection are im-
reward functions that are used to train RL agents with PPO plemented using llama-3-405b-instruct (Touvron
(Schulman et al. 2017) algorithm in a similar manner to E U - et al. 2023). This model is free, open-source and is known
REKA , and obtain a tuned SQL generation LLM. for its good instructed generation capabilities (Touvron et al.
2023), which makes it the better choice than the propri-
Evaluation and Feedback. In order to improve the next etary one described in the E UREKA reference paper. Char-
iteration of reward function generation, textual feedback on acteristics of the model are available in Appendix, Ta-
the performance of the best tuned SQL generation LLM is ble 4. The initial LLMs (agents) used for generating SQL
provided to the coding LLM as well as the reward function, queries are flan-t5-base (Chung et al. 2024) and a pre-
with which this model is trained. The SQL generation LLM trained version of flan-t5-base on SQL syntax (noa
is considered the best (out of the multiple generated), if after 2023). flan-t5-base transformer-based model consists
training it yields higher average accuracy during the evalua- of only 248 million parameters, which makes its training
tion step than other models from both previous and current process computationally efficient and light. To evaluate the
iterations. efficiency of S QL - RL - GEN∗ , trained flan-t5-base was
To evaluate the tuned SQL generation LLM performance, compared with the trained on the same samples Seq2SQL
similarly to Seq2SQL approach, both S QL - RL - GEN and and SQLNet reference models, which are configured accord-
S QL - RL - GEN∗ evaluation step consists in comparing the ing to their original papers. All agents characteristics can be
SQL rows resulting from the execution of the generated SQL found in Appendix, Table 5.
query and the ones obtained with the ground truth query. The PPO algorithm is configured in the exact same manner
generated queries are only executed when they do not mod- as in (Schulman et al. 2017) and as described in E UREKA
ify the execution environment. reference paper. The parameters are listed in Table 6 in Ap-
The evaluation results are saved, converted into text and pendix. However, unlike the original PPO approach, which
provided back to the LLM as feedback with quantitative in- only allows a single trial per sample before switching to an-
formation of the performance (accuracy, precision, recall, other, for the training of S QL - RL - GEN and S QL - RL - GEN∗ ,
F1-score and intersect over union (IoU)). In addition, if we introduce an improvement by enabling the model to ex-
errors are encountered during the execution of generated periment 10 times on the same sample before moving on.
queries, error types along with the error descriptions are re- This approach enables the agent to learn from its mistakes
turned in the feedback. The error descriptions do not provide and refine its policy for generating better SQL queries. By
specific information about the database context and are data allowing multiple trials on the same sample, we can more
independent. effectively capture the nuances of text generation problems,
As shown in Figure 1, S QL - RL - GEN∗ is derived from which often demand a more refined approach than the origi-
S QL - RL - GEN and consists in retrieving the best generated nal single-trial method. This modification allows our model
reward function from a former training of S QL - RL - GEN and to learn from its errors and improve the quality of subsequent
using it to directly train a RL agent. SQL generations.
All experiments are GPU-based and were conducted on
Experiments a Lenovo ThinkPad P15 Gen 1 with Intel Core i7-10750H
In order to evaluate the validity and usefulness of S QL - RL - CPU, 12 Cores, Quadro T1000/PCIe/SSE2 graphics with
GEN , we apply it on Spider dataset (Yu et al. 2019) to ob- 4Gb of memory and running Red Hat Enterprise Linux 8.10.
tain our reference reward function. The WikiSQL dataset
(Zhong, Xiong, and Socher 2017) is then used to evaluate Preliminary Results
the validity and robustness of this reference reward function, S QL - RL - GEN and Reference Reward Function Gen-
S QL - RL - GEN∗ . eration. Training S QL - RL - GEN on Spider dataset, with
Spider Dataset Spider consists of 10181 questions and flan-t5-base model as initial SQL generation LLM,
5693 unique complex SQL queries on 200 databases with does not lead to any improvements in terms of accuracy
multiple tables covering 138 different domains. In Spi- (O%). This is due to the fact that the flan-t5-base
der 1.0, different complex SQL queries and databases appear model has not been trained on any code or SQL queries,
in train (8659 examples) and test (1034 examples) sets. and that the training on Spider dataset is severely limited by
the constrained size of 1000 training samples and that Spi-
WikiSQL Dataset WikiSQL consists of a corpus of 87726 der features highly intricate and complex queries. However,
hand-annotated SQL query and natural language question as shown in Table 1 when training S QL - RL - GEN on Spider
pairs. These SQL queries are further split into training dataset, with a pretrained for SQL syntax flan-t5-base
pretrained Seq2SQL
S QL - RL - GEN
flan-t5-base Seq2SQL with S QL - RL - GEN∗
reference reward function
accuracy (%) 44.7 ± 1.6 48.0 ± 0.78
exec sgen (%) 61.5 ± 1.5 64.3 ± 1.3 Dev Accqm 53.1% 55.0%
Dev Accexec 60.4% 62.5%
Table 1: Average accuracies and percentages of generated Test Accqm 52.7% 55.3%
executable queries sgen along with standard errors for 5-fold Test Accexec 60.0% 63.2%
cross validation for the initial LLM (flan-t5-base pre-
trained on SQL syntax model) and after S QL - RL - GEN train- Table 3: Accuracy comparison on WikiSQL dataset between
ing on Spider dataset. Metrics shown are obtained on Spider Seq2SQL and Seq2SQL with S QL - RL - GEN∗ reference re-
testing dataset. ward function. Accqm and Accexec indicate the query-match
(string match) and the execution accuracy (correct result)
(Zhong, Xiong, and Socher 2017) respectively on develop-
Seq2SQL SQLNet S QL - RL - GEN∗
ment and testing datasets.
accuracy 7.1% 11.3% 13.8%
exec sgen 12.8% 12.1% 30.6%
2. Generalization: The model improves when transferring
Table 2: Accuracies and percentages of executable generated from Spider to WikiSQL, but its adaptability to un-
queries sgen for Seq2SQL, SQLNet and S QL - RL - GEN∗ ob- seen schemas requires further evaluation across diverse
tained on WikiSQL test dataset. benchmarks.
3. PPO Trials: Additional trials refine the reward function
but increase computational cost. Analyzing diminishing
model as initial SQL generation LLM, the performance in returns could optimize efficiency.
terms of accuracy is improved by more than 3% and on aver- 4. Scalability: Testing on varied datasets and resource con-
age there are almost 3% more executable generated queries. straints would help assess robustness and adaptability.
Versatility of the Reference Reward Function. As
shown in Table 2, S QL - RL - GEN∗ which uses the refer- Conclusion
ence reward function to fine-tune flan-t5-base, out- We have presented S QL - RL - GEN and S QL - RL - GEN∗ deriv-
performs state-of-the-art models Seq2SQL (Zhong, Xiong, ing from one another. The first one proposes a reference re-
and Socher 2017) and SQLNet (Xu, Liu, and Song 2017) ward function calibrated for SQL generation thanks to evo-
on WikiSQL dataset both in terms of accuracy and number lutionary search and feedback formulation (Ma et al. 2024)
of executable generated SQL queries. It points out the ver- that can be used by the second to tune LLM with limited re-
satility of the reference reward function and how efficient sources. The experiments demonstrated that S QL - RL - GEN∗
in terms of resource utilization S QL - RL - GEN∗ is, as only outperforms state-of-the-art methods and that the reference
1000 samples were used for training compared to the entire reward function can boosts the generation capability of RL-
dataset for the other models. based methods on WikiSQL and Spider datasets.
Reusability of the Reference Reward Function. Finally,
in order to validate that the reference reward function can References
also be used in other RL-based algorithm, we compared 2023. Hugging Face. https://huggingface.co/juierror/flan-
Seq2SQL model to a version of Seq2SQL trained with our t5-text2sql-with-schema-v2. Accessed: 2024-11-15.
reference reward function version as shown in Table 3. The Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe-
metrics employed for model evaluation align with those uti- dus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; Web-
lized in the Seq2SQL original paper (and are described in son, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowd-
Appendix). Again, usage of the reference reward function hery, A.; Castro-Ros, A.; Pellat, M.; Robinson, K.; Valter,
improved all of the different accuracies defined in (Zhong, D.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V.; Huang, Y.;
Xiong, and Socher 2017) to evaluate SQL generation. This Dai, A.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; Devlin, J.;
reward function can therefore be reused in other RL-based Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. 2024. Scal-
context in the text-to-SQL generation field. ing Instruction-Finetuned Language Models. Journal of Ma-
chine Learning Research, 25(70): 1–53.
Limitations and Future Directions Hong, Z.; Yuan, Z.; Zhang, Q.; Chen, H.; Dong, J.;
While S QL - RL - GEN and S QL - RL - GEN∗ show strong im- Huang, F.; and Huang, X. 2024. Next-Generation
provements with limited data, further analysis is needed: Database Interfaces: A Survey of LLM-based Text-to-SQL.
1. Error Mitigation: The reward function penalizes syntax arXiv:2406.08426.
errors, logical inconsistencies, and schema mismatches. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; and Kim, S. 2024.
A detailed breakdown of its impact on correction rates A Survey on Large Language Models for Code Generation.
would clarify its role in improving performance. arXiv:2406.00515.
Ma, Y. J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, SQL environment
O.; Jayaraman, D.; Zhu, Y.; Fan, L.; and Anandkumar, A.
2024. Eureka: Human-Level Reward Design via Coding ‘‘‘python
Large Language Models. arXiv:2310.12931. class SQLRLEnv(TextRLEnv):
Martineau, K. 2024. IBM text-to-SQL generator tops leader- def __init__(self, model,
board. https://research.ibm.com/blog/granite-LLM-text-to- tokenizer, dataset, ...):
SQL. Accessed: 2024-10-31. super().__init__(model,
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and tokenizer,
Klimov, O. 2017. Proximal Policy Optimization Algorithms. observation_input,
arXiv:1707.06347. max_length,
Sutton, R. S.; and Barto, A. G. 1995. Reinforcement Learn- compare_sample,
ing: An Introduction. MIT Press. unfreeze_layer_from_past)
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, ...
M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; def sql_query_execution_feedback
Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, (self, input_item,
G. 2023. LLaMA: Open and Efficient Foundation Language predicted_text) -> Dict:
Models. arXiv:2302.13971. ...
Xu, X.; Liu, C.; and Song, D. 2017. SQLNet: Generating
# Base method
Structured Queries From Natural Language Without Rein-
def get_reward(self, input_item,
forcement Learning. arXiv:1711.04436.
predicted_list, finish):
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, if finish:
Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; Zhang, Z.; and Radev, predicted_text = self.
D. 2019. Spider: A Large-Scale Human-Labeled Dataset for tokenizer.
Complex and Cross-Domain Semantic Parsing and Text-to- convert_tokens_to_string
SQL Task. arXiv:1809.08887. (predicted_list[0])
Zhong, V.; Xiong, C.; and Socher, R. 2017. Seq2SQL: Gen- reward, metrics = self.
erating Structured Queries from Natural Language using Re- compute_reward(
inforcement Learning. arXiv:1709.00103. input_item,
predicted_text)
Appendix - Initialization Prompts metrics["reward"] =
reward
System Prompt ...
You are a reward engineer trying return reward
to write reward functions to solve return 0.0
reinforcement learning tasks as
effective as possible. Your goal # Skeleton of generation
is to write a reward function for def compute_reward(self,
the environment that will help the input_item, predicted_text)
agent learn the task described in -> Tuple[float, Dict]
text. Your reward function should use ‘‘‘
useful variables from the environment
as inputs. An example of the reward
function signature can be:
‘‘‘python Appendix - Experimental Settings
{task_reward_signature_string}
‘‘‘
You need to generate the reward
functions of EXACTLY this syntax.
Everything else is not accepted. Please llama-3-405b
make sure that the code is compatible -instruct
with Gym env. **PROVIDE ONLY PYTHON
CODE.** Number of parameters 405B
Temperature 0.95
Task Description Context size 15 000
The Python environment is Decoding method sample
{task_environment_code_string}. Write a
reward function for the following task: Table 4: llama-3-405b-instruct and flan-t5-base characteris-
{task_description}. tics.
flan-t5-base SQLNet Seq2SQL
Architecture Encoder-Decoder BiLSTM + attention Encoder-Decoder
Transfomer (T5) + seq2set + RL
Number of parameters 248M 38.5M 37M
Pretrained Yes No No
Fine-tuning required Yes Yes Yes
Temperature 0.8 0.8 0.8
Parameters Values
Tensors type F32
Temperature 0.8
Top k 100
Top p 0.85
Update interval 50
Minibatch size 512
Number of Epochs 5000
Number of steps 1000
Number of evaluation episodes 5
Maximum training episodes length 1000
Evaluation interval 10
Maximum new tokens 250
Minimum new tokens 10