Cost-Based Plan Selection

We were discussing how to approximately compute the number of I/O operations required for different physical query plan operations.
The cost of a whole query can be estimated as the sum of the estimated costs of the operations in it.
The Query optimizer often chooses from several possible plans for a given query.
The number of I/Os of the plan we choose is influenced by:
- The particular logical operators chosen to implement the query.
- The sizes of intermediate relations.
- The physical operators used to implement the logical operators.
- The ordering of similar operations such as joins.
- The method of passing arguments from one physical operator to another.

Obtaining Estimates for Size Parameters

The DBMS stores values of `V(R,A)` in its catalog.
This is not always updated, but can be explicitly updated by the DBA.
For example, in Oracle (mainly up to 10g), Mysql, we can use
```
ANALYZE TABLE some_table COMPUTE STATISTICS;
```
Oracle also has a more modern dbms_stats utility for this purpose.
In Postgres and sqlite, the command is just ANALYZE or ANALYZE table_name.
The DBMS may compute a histogram rather than just `V(R,A)`. That is, a plot of a range of values for an attribute versus number of objects in that range.
Usually such a histogram has equal-width, can also compute equal height statistics, and most frequent value statistics.
Most of these statistics can be computed by a table scan.

Create a new sqlite database foo with table FOO(A INT, B INT, C INT) and GOO(C INT PRIMARY KEY, D INT);
Run SELECT * FROM sqlite_stat1; before you have run ANALYZE on anything? What do you see.
Run ANALYZE on FOO then do a SELECT * FROM sqlite_stat1; to see the statistics table.
Run ANALYZE on GOO then do a SELECT * FROM sqlite_stat1; to see the statistics table.
Insert three rows into each table. After each insert do an analyze on the table followed by a SELECT * FROM sqlite_stat1;
Make sure one of your inserts to GOO's D column has a duplicate value, create an index on the D column of GOO. Run ANALYZE on GOO. What do you suppose the second value in the stat column of sqlite_stat1 keeps track of? If you had an index on multiple columns of a table, there would be additional fields in the stat column of sqlite_stat1. Find out what they are for.
Please post the transcripts of your sqlite session to the March 18 In-Class Exercise Thread.

Once statistics have been computed, we want to be able to maintain them without having to recompute everything.
For example, to maintain `T(R)`, we can just add +1 on an insert and -1 on a delete.
If there is a B-tree index on any attribute `R`, we could also estimate `T(R)` using the number of leaves in B-tree and assuming each is 3/4 full. (Don't have to change each insert.)
If there in an index on `A`, then `V(R,A)` can also be calculated as we insert/delete from the index.
In the particular case that `A` is a key, then `T(R)=V(R,A)`.
In the no index case, one might use a rudimentary structure that holds each value of `A`.
One might also sample data and assume the data is according to some distribution: uniform, Zipfian, etc.

We might try things like pushing selects down the tree.
As the preferred logical query plan is being generated, we consider plans generated by applying several such heuristics.
Then we compute the cost before and after each transformation.
If the transformation doesn't help, we don't do it.

There are two standard approaches:
- Top down: Work from the root down. For each possible implementation of the operation at the root, consider each possible way to evaluate its arguments. Compute the cost of each possibility and choose the least.
- Bottom up: For each sub-expression, compute the cost of all possible ways to compute that sub-expression. Choose the one of least cost.
We'll focus on bottom up approach.

Heuristic selection - Try different heuristics to improve cost. For example:
- We can try using an index if we are doing a select.
- If we are doing a join and there is an index on the join attribute, then we can try using an index join.
- If we are doing a join and one attribute of the join is sorted, then we can try using a sort join.
- If we are doing a set operation, we can try grouping smallest relations first.
Branch and Bound Plan Enumeration - Use heuristics to first find a good plan, say with cost `C`. Then enumerate plans for sub-expressions. If any cost for a sub-expression is above `C`, can discard all plans that involve this way to compute the sub-expression. Hopefully, using this can speed up over exhaustive search.
Hill Climbing - make small local changes to plan, if this improves cost, we switch the plan. If no small change in plan yields an improvement within a timeout period for considering plans, then output the best current plan.
Dynamic Programming - say more about in a minute, mainly for join orderings.
Selinger-Style Optimization - modifies dynamic programming. Keeps low cost plan as well as several non-low cost sub-expressions which might make up their slowness by how they update their data. For example, a non-low cost sub-expression might compute its output in sorted order and this might be useful later in the query. Details can be found in Selinger et al. (1979).

When ordering joins, we should remember our algorithms are often asymmetric in terms of cost.
It is possible that $A \bowtie B$ is cheaper than $B \bowtie A$.
We should when performing a:
- Nested-loop join use the smaller relation as the outer loop.
- Index-join use as the right argument the table with the index attribute.

Consider $(A \bowtie B \bowtie C)$. We could order this as:
$((A \bowtie B) \bowtie C)$
or as
$(A \bowtie (B \bowtie C))$.
There are exponentially many ways to take the natural join of `n` tables.
It would take too long if we had to enumerate all of them.

An ordering like $(((A_1 \bowtie A_2) \bowtie A_3) \ldots)$ is called a left deep ordering.
Note $(((A_3 \bowtie A_1) \bowtie A_2) \ldots)$ computes the same join and is also left deep.
One can define a right deep ordering in a similar fashion by associating to the right.
A join like $((A_1 \bowtie A_2) \bowtie (A_3 \bowtie A_4))$ uses neither a left or right ordering and is called bushy.
Even if we allow non-join operations, we might still call it a left-deep ordering, provided binary operators are parenthesized in above way.
In general, there is only one left-deep tree shape for `n` relations.
However, one is able to put the relations into this shape in `n!` ways.
The number of tree shapes is given by
`T(1)=1`,
`T(n) = \sum_{i=1}^{n- 1}T(i)T(n-i)`

They tend to produce efficient plans because:
- If one pass join algorithms are used, and the build relation is on the left, then the amount of memory needed at any one time tends to be smaller than if used a right -deep or bushy tree.
- If nested-loop joins are used, implemented by iterators, then we avoid having to construct any intermediate relation more than once.