You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-13Lines changed: 50 additions & 13 deletions
Original file line number
Diff line number
Diff line change
@@ -1,30 +1,61 @@
1
-
##chiDB SQL Parser
1
+
# chiDB SQL Compiler Front-End
2
2
3
3
This is a SQL Parser and compiler frontent which can parse a reasonably large subset of SQL. The parser is generated by lex and yacc, and all other code is in C. The parser generates an abstract syntax tree (AST) representation of SQL, which is based on Relational Algebra, extended to fit closer with SQL. This extended relational algebra is termed SRA (sugared relational algebra), as it itself can be compiled into more or less pure relational algebra.
4
4
5
-
### Example
6
-
Consider the following SQL query:
5
+
Since relational algebra primarily deals with only queries, there are separate data structures to deal with `Create Table`, `Create Index`, `Insert Into`, and `Delete From` commands (and possibly other things in the future). These don't use RA, but they do, for example, use the Expression part of the abstract syntax tree.
7
6
8
-
```sql
7
+
## Methodologies
9
8
10
-
SQL:
11
-
selectf.aas Col1, g.aas Col2 from Foo f, Foo g where Col1 != Col2;
9
+
The parser is is generated by Lex (a lexer generator) and Yacc (a parser generator). All other code is written in C, along with prototypes in Haskell.
10
+
11
+
### Lex
12
+
13
+
The lexer contains a series of regular expressions defining tokens of the language and associating them with instructions to be performed for when a given regular expression is found. The Lex utility converts a Lex file into C code which will scan an input for the next token and perform whatever instructions are to be performed when that token is found. These can be as simple as returning some integral value indicating what type of token it is (which is drawn out of an enum generated by yacc), or more complicated instructions such as dealing with comments, converting escape sequences, or storing the value of a constant expression (like an integer or string) or name of a variable.
14
+
15
+
### Yacc
16
+
17
+
Yacc is used to generate the parser. A yacc file is similar to a lex file in that it has a series of definitions of structures, and instructions for when those structures are encountered. However, in Yacc the structures are grammatical rules, and the instructions that accompany them are usually to build a parse tree (abstract syntax tree), which is a representation of the language in data structures. Yacc allows us to rapidly construct a correct and efficient parser, which gives us detailed error messages (either in parsing or in writing the parser itself), and saves us lots of time compared to a hand-written parser.
18
+
19
+
## Sugared Relational Algebra
20
+
21
+
In the ChiDB SQL parser, the instructions in the Yacc file produce a representation which we call Sugared Relational Algebra (SRA). This is an extended form of relational algebra; with several differences. For example:
22
+
23
+
* it contains as primitives multiple join types (Inner Join, Full/Left/Right outer join, Natural Join) and Intersection
24
+
25
+
* it has no rho operator, because all renaming and aliasing is contained alongside the expressions. For example in SRA, a `Project` structure has a list of expressions which can optionally have aliases, but in RA, a `Pi` structure has only expressions, and any aliases must be done with a Rho operator.
26
+
27
+
* SRA is also allowed to use `*` as in SQL to stand for "all columns in the table". The step which translates from SRA to RA will expand all `*`s into the actual list of columns.
12
28
29
+
The translation from SRA to RA is called desugaring. Here's an example. Say we have a table `t` which has columns `w`, `x`, and `y`:
30
+
31
+
```
32
+
SQL:
33
+
SELECT *, x+y as z from t;
34
+
35
+
SRA:
36
+
Project([*, (Add(x, y), z)],
37
+
Table(t))
38
+
39
+
RA:
40
+
Pi([w, x, y, z],
41
+
Rho(Add(x,y), z,
42
+
Pi([w, x, y, Add(x,y)],
43
+
Table(t))))
13
44
```
14
45
15
-
The parser will create an AST which looks like this:
46
+
A more complicated example:
16
47
17
48
```
49
+
SQL:
50
+
select f.a as Col1, g.a as Col2 from Foo f, Foo g where Col1 != Col2;
51
+
52
+
SRA:
18
53
Project([(f,a,Col1), (g,a,Col2)],
19
54
Select(Col1 != Col2,
20
55
Join([(Foo,f), (Foo,g)])
21
56
)
22
57
)
23
-
```
24
58
25
-
The desugaring step will produce a Relational Algebra tree:
26
-
27
-
```
28
59
Pi([Col1, Col2],
29
60
Sigma(Col1 != Col2,
30
61
Cross(
@@ -35,9 +66,15 @@ Pi([Col1, Col2],
35
66
)
36
67
```
37
68
38
-
Since relational algebra primarily deals with only queries, there are separate data structures to deal with `Create Table`, `Create Index`, `Insert Into`, and `Delete From` commands (and possibly other things in the future). These don't use RA, but they do, for example, use the Expression part of the abstract syntax tree.
69
+
## Current Status
70
+
71
+
Currently, we have a good deal of machinery in place. The parser and lexer are finished, and all of the data structures for both SRA and RA are written, along with constructors, destructors, pretty printers, etc. There is a doubly-linked list library which is quite robust (though not fully thread-safe, but this could be reasonably easily accomplished). There is also a vector library which might be useful, for example, for string-building, or if a vector-based instead of list-based representation of columns, expressions, etc. is desired.
72
+
73
+
## Future Directions
74
+
75
+
The largest thing missing at this point is the desugarer. However, we have one written up in the Haskell language (`Desugar.hs`, `SRA.hs`, examples are in `Tests.hs`), and the translation of this code into C should be fairly straightforward. Haskell allows us to represent and manipulate structures of RA and SRA with ease, conciseness and correctness, and I humbly recommend any modification of the code to be prototyped in Haskell prior to coding it in C.
0 commit comments