I have a parsed representation of an SQL query in the form of an abstract syntax tree. I'd like to convert it into an intermediate logical tree representation, which can in turn be expanded into candidate physical operation trees by the optimizer.
Following various resources discussing the topic (CMU lecture, Database Management System, Dayal's 1987 paper on nested subqueries), I managed to translate most of the query into an intermediate algebraic form (e.g. π_a(R⋈S)) but I get stuck at translating common table expressions and lateral joins.
- I can't find a way to include CTEs in the algebraic tree without inlining all references to them. While running down the algebraic tree, the optimizer would come across the same expanded CTE, which seems inefficient.
- By definition, subqueries in lateral joins can reference columns provided by preceding FROM items. When using a relational algebra tree representation, FROM items preceding the subquery are parent nodes, but join nodes can only reference columns output by child nodes. For instance:
SELECT *
FROM
(SELECT 1 id) A,
(
(SELECT 2) B
JOIN LATERAL (SELECT A.id) C ON TRUE
) D;
NAT. JOIN
├── A
└── LATERAL JOIN (D)
├── B
└── C <- should be able to reference A
Can these constraints be expressed in terms of algebraic operations or do they typically need bespoke data structures? For example, separate subtrees for CTEs instead of a single tree for the whole logical plan - but then, how can the optimizer optimize the query globally instead of each subtree individually?
More generally, are there better intermediate representations than algebraic trees for the purpose of query optimization?