Sort-based Join

Want to join `R(X,Y)` and `S(Y,Z)`.
Create sorted sublists of size `M` using join attribute `Y` for both `R` and `S`.
Bring the first block of each sublist into a buffer (assume `le M` sublists).
Repeatedly find the least `Y` value `y` among the sublist blocks in memory.
Identify tuples in both relations having this `Y` value. Output the join of all these tuples.
This takes time `3(B(R)+B(S))` provided `B(R)+B(S) lt M^2`.

Partitioning Relations by Hashing

For the algorithms we'll consider next, it is useful to be able to read a file quickly into a hash table.
To do this we use one block of main memory to read from the file and use M-1 blocks for each bucket of the hash table (we assume this has at most `M-1` buckets).
We read successive blocks from the file apply the hash function to each tuple in these blocks and move these tuples to the correct bucket buffer.
When a bucket buffer gets filled we write it to disk.

Hashed based Duplicate Elimination

First hash file into bucket files.
Then eliminate duplicates from each bucket file using the one pass algorithm and output the results.
The idea is if `t` occurs multiple times, it will always hash to the same bucket. So can eliminate duplicates bucketwise.
This works provided the hash function is random-like and `B(R)` is less than `M(M-1)`.
The bound comes from the fact that each bucket will end up with roughly `(B(R))/(M-1)` blocks and we need to be able to read in whole buckets into memory to duplicate elimination, so we need `(B(R))/(M-1) < M`.
It takes `3 B(R)` time.

Hash-Join

Use only the join attribute as the hash key.
Hash R and S to hash files of buckets.
Pick the hash files whose largest buckets are smaller.
As long as for this file the largest bucket is ` lt M`, we can read it into at most `M-1` buffer blocks, and use the other block to read from the other hash file
So in this situation can do one pass join of bucket 1's, 2's, etc.
Works provided `min(B(R), B(S)) lt M(M-1)`.
Takes time `3(B(R)+B(S))`.

Index-based Selection

Suppose have a clustered index and want to do `sigma_{a=v}(R)`.
Then the number of I/Os will be roughly `(B(R))/(V(R,a))`.
Here we are ignoring the cost of the index, roundoff errors, and that the blocks of `R` might not be completely filled with tuples from `R`.
If `R` is a non-clustered index, then the cost will be approximately `(T(R))/(V(R,a))`. So it will be more likely a table scan will be faster.

Index-based Join

Examine each block of R. For each tuple t in that block, look up all things that join with it in S and output the joins.
Cost will approximately be `(T(R) cdot T(S))/(V(S,Y))`, where `Y` is the join attribute if `S` is nonclustered and `(T(R) cdot B(S))/(V(S,Y))` if it is clustered.

Quiz

Which of the following is true?

It is impossible for a table to be in more than one tablespace.
Sorting a table cannot be done if the table cannot fit in RAM.
In the map reduce paradigm, a combiner runs the reduce job on the local machine for each sorted file, producing sorted reduced files which can then be sent to their respective machines.

Obtaining Estimates for Size Parameters

The DBMS stores values of `V(R,A)` in its catalog.
This is not always updated, but can be explicitly updated by the DBA.
For example, in Oracle (mainly up to 10g), Mysql, we can use
```
ANALYZE TABLE some_table COMPUTE STATISTICS;
```
Oracle also has a more modern dbms_stats utility for this purpose.
In Postgres and sqlite, the command is just ANALYZE or ANALYZE table_name.
The DBMS may compute a histogram rather than just `V(R,A)`. That is, a plot of a range of values for an attribute versus number of objects in that range.
Usually such a histogram has equal-width, can also compute equal height statistics, and most frequent value statistics.
Most of these statistics can be computed by a table scan.

Incremental Computation of Statistics

Once statistics have been computed, we want to be able to maintain them without having to recompute everything.
For example, to maintain `T(R)`, we can just add +1 on an insert and -1 on a delete.
If there is a B-tree index on any attribute `R`, we could also estimate `T(R)` using the number of leaves in B-tree and assuming each is 3/4 full. (Don't have to change each insert.)
If there in an index on `a`, then `V(R,a)` can also be calculated as we insert/delete from the index.
In the particular case that `a` is a key, then `T(R)=V(R,a)`.
In the no index case, one might use a rudimentary structure that holds each value of `a`.
One might also sample data and assume the data is according to some distribution: uniform, Zipfian, etc.

Incremental Indexing

Since we are on the topic of incremental algorithms, we next consider log-structured merge trees.
The sparse secondary indexes we discussed earlier tend to be both smaller than B-trees and involve more contiguous block accesses on the disk.
So if we have such an index, it can often be faster to do look ups than a B-tree.
The drawback to it though is how to maintain it, if there are a lot of inserts and deletes.
So B-trees are a better mutable storage structure and sparse indexes are a better immutable storage structure.
In the case, where we know the changes to the tables will usually be append only inserts, we can employ an incremental approach to sparse indexing.

Building a Log-Structured Merge Tree

First at the file level, we assume the file is organized into partitions. We maintain a partition in memory where new inserts are written to. When a partition fills we write to disk at the end of the file.
The log-structured merge tree index or file is computed simultaneously with this process.
To keep things simple we consider the index. Such an index is organized into tiers.
When a partition is written to disk we create an index for just this partition and add it to the set of tier 0 indexes.
If we have `M-1` tier 0 indexes, we use our merge operation that we discussed for sorting to produce a new tier 1 index for the `M-1` partitions these tier 0 indexes corresponded to.
Similarly, if we have `M-1` tier 1 indexes, we use our merge operation that we discussed for sorting to produce a new tier 2 index for these `(M-1)^2` partitions, and so on.
Index lookup in this data structure involves doing binary search in the files of each tier for the at most `log_{M-1} B(R)` many tiers.
In the case that the key is also being incremented as we are inserting, then we can often predict which particular tier file to look at directly, and only have to do binary search in it.

Enterprise Storage Subsystems and Business Continuity

We next look at the storage hardware aspect of physical database organization.
First, we are going to look disk arrays and RAID.
Then we will look at network storage techniques such as SANs and NAS.
Finally, we will look at how these techniques can be applied to business continuity.

Intermittent Failures

There are four main kinds of disk failures: Intermittent failures, Media decay, Write Failures, and Disk crash.
We now discuss each of these in more detail as well as how to mitigate against them.
Disk sectors are usually stored with some extra bits that enable us to tell if the sector's content is correct or not on a read or a write.
We model a disk read as returning a pair `(w,s)` where `w` is the data and `s` is the status as to whether the read was okay.
In an intermittent failure, we may get several bad `s's` before a good one.
Usually, we have some max tries (say 100) before we permanently fail.
For write requests, after the write, we can do a read to see if the `s` is correct, and hence, if the write was successful.
This read will typically take an additional disk revolution.
If the write was not successful, we can repeat it.

Checksums

What should we use for `s`?
- A common thing to do is to take the parity of the bits in the sector for `s`. That is, `s` is 1 if the number of `1` bits in the sector is odd and `0` otherwise.
- For example, if `w=0101100` then `s=1`. If `w=1111000` then `s=0`.

Stable Storage

We want a policy in effect to be sure that, when we write to disk, we will get a sector which is correctly written, or that any data that was in the sector before the write can be safely recovered.
This policy is called stable storage.

Implementing Stable Storage

We pair sectors and each pair represents one-sector contents `X`.
Call the two copies the left and right copies and denote them `X_L` and `X_R`.
To write:
- Write `X` into `X_L`. Check that `s` is good. If not repeat the write. If fail repeatedly, assume `X_L` corrupted and replace it with a different sector. Repeat this process for `X_R`.
To read:
- Try to read `X` from `X_L`. Check if `s` is good. If yes, then that is `X`, if not repeat. If fail repeatedly try to get the value from `X_R`.
Above scheme works well for media and write failures.

Failure Model for Disks

We now want to consider RAID (redundant arrays of independent disks) schemes that work well for disk crashes.
The average time till a disk crash is called the mean time to failure (MTTF). For modern disk this is about 10 years.
We model this by assuming roughly 10% of the disks fail in a given year.
We want to use multiple disks to make the mean time to data loss to be a much bigger number than the MTTF.

Mirroring as a Redundancy Technique

Suppose you have two disks with the same data.
One is called the data disk and the other is the redundant disk.
If a disk fails we replace it with a new disk and copy the data from the other disk that is still okay.
What is the chance the second disk could fail during the replacement process?
- If replacement takes 3 hours = `1/8` of a day = `1/2920` of a year, then the odds are `1/10 times 1/2920` or one in `29200` failures. The odds that one of the two disks fails in a given year is `2/10 = 1/5`. So MTTF for 2 disk is 5 years. So the mean time to data loss is `5 times 29200 = 146,000` years.
Raid level 1 is to use mirroring.
Raid level 0 is to use striping - writing the `i`t bit or block of our file to disk `i mod n` were `n` is our number of disks. .
Raid 2 and Raid 3 are rarely used. They combine either bit or byte level striping with either a Hamming code or parity bit for error detection.

Parity Blocks

Mirroring is somewhat wasteful.
RAID level 4 scheme assumes we have `n` data disks and one parity disk.
The `i`th block of the parity disks consists of the parities of the data stored in the `i`th block of the other disks
For example, if disk 1 block `i` had `1010` and disk2 block `i` had `1101` then a parity disk would have `0111`.
To read in this set-up we don't bother with the parity disk.
To write need to read old block and parity block. Then write new block and recalculate the parity.

RAID 4 and Failures

If the parity disk crashes, we just replace it.
If one disk fails then we can replace that disk and compute the value for each of its block based on the other disks and the parity blocks. For example, if
```
Disk1blk1: 1001
Disk2blk1: 1111
Disk3blk1: ???? 
ParityBlk1:0010
```
Then Disk3blk1 must be:0100

RAID 5

RAID 4 has a bottleneck whenever we need to do writes because we must also update the parity disk.
Notice the rule of recovery for a parity disk is the same as for any other disk in RAID level 4.
In RAID level 5 to reduce the bottleneck problem we change which disk is the parity disk depending on the block number.
For instance, if have `n` disks then disk `j` is used as the parity disk for block `i` if `i mod n = j`.

Joins, Log-Structured Merge Trees, Enterprise Storage

Outline

Sort-based Join

Partitioning Relations by Hashing

Hashed based Duplicate Elimination

Hash-Join

Index-based Selection

Index-based Join

Quiz

Obtaining Estimates for Size Parameters

Incremental Computation of Statistics

Incremental Indexing

Building a Log-Structured Merge Tree

Enterprise Storage Subsystems and Business Continuity

Intermittent Failures

Checksums

Stable Storage

Implementing Stable Storage

Failure Model for Disks

Mirroring as a Redundancy Technique

Parity Blocks

RAID 4 and Failures

RAID 5