Create sorted sublists of size `M` using join attribute `Y` for both `R` and `S`.
Bring the first block of each sublist into a buffer (assume `le M` sublists).
Repeatedly find the least `Y` value `y` among the sublist blocks in memory.
Identify tuples in both relations having this `Y` value.
Output the join of all these tuples.
This takes time `3(B(R)+B(S))` provided `B(R)+B(S) lt M^2`.
Partitioning Relations by Hashing
For the algorithms we'll consider next, it is useful to be able to
read a file quickly into a hash table.
To do this we use one block of main memory to read from the file and use M-1 blocks
for each bucket of the hash table (we assume this has at most `M-1` buckets).
We read successive blocks from the file apply the hash function to each tuple in
these blocks and move these tuples to the correct bucket buffer.
When a bucket buffer gets filled we write it to disk.
Hashed based Duplicate Elimination
First hash file into bucket files.
Then eliminate duplicates from each bucket file using
the one pass algorithm and output the results.
The idea is if `t` occurs multiple times, it will always hash to the same bucket.
So can eliminate duplicates bucketwise.
This works provided the hash function is random-like and `B(R)` is less than `M(M-1)`.
The bound comes from the fact that each bucket will end up with roughly `(B(R))/(M-1)` blocks and
we need to be able to read in whole buckets into memory to duplicate elimination, so we need `(B(R))/(M-1) < M`.
It takes `3 B(R)` time.
Hash-Join
Use only the join attribute as the hash key.
Hash R and S to hash files of buckets.
Pick the hash files whose largest buckets are smaller.
As long as for this file the largest bucket is ` lt M`,
we can read it into at most `M-1` buffer blocks, and use
the other block to read from the other hash file
So in this situation can do one pass join of bucket 1's, 2's, etc.
Works provided `min(B(R), B(S)) lt M(M-1)`.
Takes time `3(B(R)+B(S))`.
Index-based Selection
Suppose have a clustered index and want to do
`sigma_{a=v}(R)`.
Then the number of I/Os will be roughly `(B(R))/(V(R,a))`.
Here we are ignoring the cost of the index, roundoff errors,
and that the blocks of `R` might not be completely filled with tuples from `R`.
If `R` is a non-clustered index, then the cost will be approximately `(T(R))/(V(R,a))`. So it will be more likely a table scan will be faster.
Index-based Join
Examine each block of R. For each tuple t in that block,
look up all things that join with it in S and output the joins.
Cost will approximately be `(T(R) cdot T(S))/(V(S,Y))`, where `Y` is the join
attribute if `S` is nonclustered and `(T(R) cdot B(S))/(V(S,Y))` if it is clustered.
Quiz
Which of the following is true?
It is impossible for a table to be in more than one tablespace.
Sorting a table cannot be done if the table cannot fit in RAM.
In the map reduce paradigm, a combiner runs the reduce job on the local machine for
each sorted file, producing sorted reduced files which can then be sent to their respective machines.
Obtaining Estimates for Size Parameters
The DBMS stores values of `V(R,A)` in its catalog.
This is not always updated, but can be explicitly updated by the DBA.
For example, in Oracle (mainly up to 10g), Mysql, we can use
ANALYZE TABLE some_table COMPUTE STATISTICS;
Oracle also has a more modern dbms_stats utility for this purpose.
In Postgres and sqlite, the command is just ANALYZE or ANALYZE table_name.
The DBMS may compute a histogram rather than just `V(R,A)`. That is, a plot of a range of values for an attribute versus number of objects in that range.
Usually such a histogram has equal-width, can also compute equal height statistics, and most frequent value statistics.
Most of these statistics can be computed by a table scan.
Incremental Computation of Statistics
Once statistics have been computed, we want to be able to maintain them without having to recompute everything.
For example, to maintain `T(R)`, we can just add +1 on an insert and -1 on a delete.
If there is a B-tree index on any attribute `R`, we could also estimate `T(R)` using the number of leaves in B-tree and assuming each is 3/4 full. (Don't have to change each insert.)
If there in an index on `a`, then `V(R,a)` can also be calculated as we insert/delete from the index.
In the particular case that `a` is a key, then `T(R)=V(R,a)`.
In the no index case, one might use a rudimentary structure that holds each value of `a`.
One might also sample data and assume the data is according to some distribution: uniform, Zipfian, etc.
Incremental Indexing
Since we are on the topic of incremental algorithms, we next consider log-structured merge trees.
The sparse secondary indexes we discussed earlier tend to be both smaller than B-trees and involve more contiguous block accesses on the disk.
So if we have such an index, it can often be faster to do look ups than a B-tree.
The drawback to it though is how to maintain it, if there are a lot of inserts and deletes.
So B-trees are a better mutable storage structure and sparse indexes are a better immutable storage structure.
In the case, where we know the changes to the tables will usually be append only inserts, we can employ an incremental approach to sparse indexing.
Building a Log-Structured Merge Tree
First at the file level, we assume the file is organized into partitions. We maintain a partition in memory where new inserts are written to. When a partition fills we write to disk at the end of the file.
The log-structured merge tree index or file is computed simultaneously with this process.
To keep things simple we consider the index. Such an index is organized into tiers.
When a partition is written to disk we create an index for just this partition and add it to the set of tier 0 indexes.
If we have `M-1` tier 0 indexes, we use our merge operation that we discussed for sorting to produce a new tier 1 index for the `M-1` partitions these tier 0 indexes corresponded to.
Similarly, if we have `M-1` tier 1 indexes, we use our merge operation that we discussed for sorting to produce a new tier 2 index for these `(M-1)^2` partitions, and so on.
Index lookup in this data structure involves doing binary search in the files of each tier for the at most `log_{M-1} B(R)` many tiers.
In the case that the key is also being incremented as we are inserting, then we can often predict which particular tier file to look at directly, and only have to do binary search in it.
Enterprise Storage Subsystems and Business Continuity
We next look at the storage hardware aspect of physical database organization.
First, we are going to look disk arrays and RAID.
Then we will look at network storage techniques such as SANs and NAS.
Finally, we will look at how these techniques can be applied to business continuity.
Intermittent Failures
There are four main kinds of disk failures: Intermittent failures, Media decay, Write Failures, and Disk crash.
We now discuss each of these in more detail as well as how to mitigate against them.
Disk sectors are usually stored with some extra
bits that enable us to tell if the sector's content is
correct or not on a read or a write.
We model a disk read as returning a pair `(w,s)`
where `w` is the data and `s` is the status as to
whether the read was okay.
In an intermittent failure, we may get several bad
`s's` before a good one.
Usually, we have some max tries (say 100) before we permanently fail.
For write requests, after the write, we can do a read to see if the `s` is
correct, and hence, if the write was successful.
This read will typically take an additional disk revolution.
If the write was not successful, we can repeat it.
Checksums
What should we use for `s`?
A common thing to do is to take the parity of
the bits in the sector for `s`. That is, `s` is 1 if the
number of `1` bits in the sector is odd and `0`
otherwise.
For example, if `w=0101100` then `s=1`. If
`w=1111000` then `s=0`.
Stable Storage
We want a policy in effect to be sure that, when we write to disk,
we will get a sector which is correctly written, or that any data that
was in the sector before the write can be safely recovered.
This policy is called stable storage.
Implementing Stable Storage
We pair sectors and each pair represents one-sector contents `X`.
Call the two copies the left and right copies and denote them `X_L` and `X_R`.
To write:
Write `X` into `X_L`. Check that `s` is good. If not repeat the write. If fail repeatedly, assume `X_L` corrupted and replace it with a different sector. Repeat this process for `X_R`.
To read:
Try to read `X` from `X_L`. Check if `s` is good. If yes, then that is `X`, if not repeat. If fail repeatedly try to get the value from `X_R`.
Above scheme works well for media and write failures.
Failure Model for Disks
We now want to consider RAID (redundant arrays of independent disks) schemes that work well for disk crashes.
The average time till a disk crash is called the mean time to failure (MTTF). For modern disk this is about 10 years.
We model this by assuming roughly 10% of the disks fail in a given year.
We want to use multiple disks to make the mean time to data loss to be a much bigger number than the MTTF.
Mirroring as a Redundancy Technique
Suppose you have two disks with the same data.
One is called the data disk and the other is the redundant disk.
If a disk fails we replace it with a new disk and copy the data from the other disk that is still okay.
What is the chance the second disk could fail during the replacement process?
If replacement takes 3 hours = `1/8` of a day = `1/2920` of a year, then the odds are `1/10 times 1/2920` or one in `29200` failures. The odds that one of the two disks fails in a given year is `2/10 = 1/5`. So MTTF for 2 disk is 5 years. So the mean time to data loss is `5 times 29200 = 146,000` years.
Raid level 1 is to use mirroring.
Raid level 0 is to use striping - writing the `i`t bit or block of our file to disk `i mod n` were `n` is our number of disks. .
Raid 2 and Raid 3 are rarely used. They combine either bit or byte level striping with either a Hamming code or parity bit for error detection.
Parity Blocks
Mirroring is somewhat wasteful.
RAID level 4 scheme assumes we have `n` data disks and one parity disk.
The `i`th block of the parity disks consists of the parities of the data stored in the `i`th block of the other disks
For example, if disk 1 block `i` had `1010` and disk2 block `i` had `1101` then a parity disk would have `0111`.
To read in this set-up we don't bother with the parity disk.
To write need to read old block and parity block. Then write new block and recalculate the parity.
RAID 4 and Failures
If the parity disk crashes, we just replace it.
If one disk fails then we can replace that disk and compute the value for each of its block based on the other disks and the parity blocks. For example, if