Outline
- Dynamic Programming for Join Order
- Completing the Physical-Query-Plan Selection
- In-Class Exercise
- Kinds of System Failures
Dynamic Programming for Join Order
- Suppose we want to compute $(R_1 \bowtie R_2 \bowtie \ldots \bowtie R_n)$
- We can build a table that contains:
- Each table `R_i`, as well as its size and number of values for the join attribute
- For each subset of `j` tables, `R_{i_1} ... R_{i_j}`, an estimate
of the total cost of performing the join of this subset.
- The second items above can be computed from the best joins of subsets `j-1`
tables by:
- Go through the entries for subsets of size `j-1`, joining one more table.
- Check if this j-subset is already in the table.
- If it is, compare its currently listed cost with the new way of joining.
- If the new value is smaller, then replace its value in the table.
In-Class Exercise
- Suppose we have three tables R(A,B), S(B,C), U(C,D). Here T(R)=300, T(S) = 600, T(U) = 900 and V(R,A)=10, V(R,B)=20, V(S,B)=30, V(S,C)=40, V(U,C)=50, V(U,D)=60. Use the algorithm
of the last slide to determine the cost of computing the natural join of these three tables.
- Please post your solution to the Mar 15 In-Class Exercise Thread.
Completing the Physical-Query
- To finish up creating a query plan, we need to consider the following issues:
- How to select algorithms to implement the operations of the query
plan not done as part of some earlier step as in the dynamic programming for join order
- Decide whether to materialize intermediate results or pipeline them
- Notations for physical plan operators
Choosing a selection method
- What way should we use to evaluate `sigma_C(R)`?
- Assuming there are no multidimensional attributes we can, if possible:
- Choose an attribute `A` in `C`, with an index, and which is compared to a constant
- After using the index to find those things satisfying this
part of `C`, apply the rest of the condition to complete the selection.
- If no such attribute exists, we need to do a table-scan
- If there is more than one choice for `A`, we want to choose the most selective single indexed attribute,
so we might do cost estimates over the different possibilities.
Choosing a Join Method
- Request a nested loop join, if one can hope that the outer table
can fit totally in memory (so maybe don't need to execute outer loop more than once).
- Use a sort-join when one or both tables is already sorted on the join attribute or if there are two or more tables joined on the same attribute
(as then sort step can be amortized).
- If there is an index on the join attribute and the results are expected to be small use an index join.
- Fall back to nested loop join if none of the above apply.
Pipelining Versus Materialization
- The naive way to evaluate queries is to write intermediate results back out
to disk.
- This is called materialization.
- Pipelining is a technique whereby we pass the output of the sub-query
computed so far to the next stage in the query evaluation. It is typically
implemented using iterators.
Notations for Physical Query Plans
- Each node in logical query plan needs to be replaced by a physical operator.
- Operators for leaves:
- TableScan(R)
- SortScan(R,L) - sort according to list L
- IndexScan(R,C) - look up index according to condition C
- IndexScan(R,A) - entire relation is retrieved via lookup on attribute A.
- Operators for Selection:
- Filter(C) -- filter child results according to C.
- Some select might be an IndexScan followed by a Filter
- Operator for Sorting:
- Sort(L) -- may occur anywhere interior to tree.
- Other operators
- Join( type, # of buffers) - for example type might be two pass hash-join,
and number of buffers might be 100
Ordering of Physical Operations
- The physical query plan is typically represented as a tree.
- Data must flow up the tree.
- Since interior nodes may not be ancestors or descendants of each other in a bushy
tree, we need to figure out an order to evaluate sub-trees of the same height.
- To do this we evaluate sub-trees from the bottom up and evaluate sub-trees of the same
height from left to right. (Pre-order.)
- Execute all nodes of each sub-tree using a network of iterators.
DBMS Goals
- We now start a new topic, how to handle system failures.
- We would like our DBMS to:
- Protect data in the case of system failure.
- Keep the database in a consistent state even if multiple operations are
being carried out at the same time.
Possible Kind of Database Errors
- Erroneous Data Entry -- We can try to prevent by type checking.
- Media Failures -- We can try to prevent by using a RAID system or archiving data to tape
- Catastrophic failures -- For example, fires, rogue airplanes, etc. We can try to distribute the database (How Sun recovered 9/11).
- System failures -- These are things like power loss or software errors that cause a particular transaction to be lost.
- Handling system failures are what we are going to focus for the next several lectures.
Transactions
- In order to understand the recovery process from system failure, we need to understand a little about transactions.
- When doing a sequence of operations from a DMBS shell, each query or statement is roughly a transaction.
- In an embedded SQL/JDBC setting, a transaction will be a sequence of
database commands up until either a COMMIT or ROLLBACK (abort) command is issued.
- The transaction manager in the DBMS has the job of making sure a transaction is executed correctly.
Jobs of the Transaction Manager
- The Transaction Manager:
- Issues signals to the log manager so that necessary log records are written.
- Ensures that concurrently executing transactions do not interfere with each other.