On Wednesday, we said we would begin this semester by talking about how database management systems (DBMSs) are implemented.
We began by looking at the memory hierarchy for computers going from the microprocessor registers and caches up through tertiary
storage.
With respect to DBMS implementation we are concerned with the capacity, speed, and volality of different levels of this hierarchy.
So far, we have considered register, cache, and RAM memory and we have compared sizes of these memories with respect to commonly encountered
file sizes and famous data sets.
We next look at virtual memory and secondary storage.
Virtual Memory
Programs make use of a virtual addressing scheme for storing data during program execution.
The goal is to try to fit as much data in memory as can at one time. If a program needs to look up some data that is not in memory then the CPU goes to disk to either read or store it.
For older 32 bit processors, one word can store up to `2^{32}` different numbers.
For modern 64 bit processors, one word can store up to `2^{64}` different numbers.
This word can be used as an address for the location of a memory word.
Somewhat counter-intuitively, software in use often changes much more slowly than hardware, so many operating systems, or software written for them, still make legacy 32 bit assumptions, so we'll use it as an example. Also, the numbers are smaller.
For a traditional 32 bit machine, a process would pretend it had 4GiB (the number of bytes addressable with a single word) of main memory (even if one had less) and would swap blocks of memory into and out of main memory to disk corresponding to data that was immediately needed.
Swapping between memory and is disk done using page sized blocks (4KiB to 56KiB).
Volatile/Non-Volatile Memory
In our memory hierarchy, we have indicated some layers are volatile and others are non-volatile.
Roughly, volatile memory is memory that one can expect to not-remain correct when the power goes off on the computer.
On the other hand, non-volatile memory is expected to persist even after a sustained power outage.
For most DBMS operations, we view the completion of the operation as occurring when the operation's effects have been committed to non-volatile memory (secondary storage).
Secondary Storage
Secondary Storage is any kind of memory device other than main memory which is non-volatile. It is often also slower, but more capacious than main memory.
Usually, secondary storage means a hard disk drive (HDD) (these store data by magnetizing portions of a rotating disk) or a solid-state disk (SSD) (these are based on NAND flash and use transistors and a charge trap to store data).
Raw read speeds for RAM (high single digit GB/s DDR5) are about 10x faster than SSD (around a GB/s) which in turn 10x faster than HDDs (low hundred MB/s).
In practice all the data we want is typically not adjacent to each other on an HDD, so HDD speeds are much slower.
On the other hand, flash-based memory suffers memory wear and may only be overwritten (program-erase cycles) some fixed number of times (on the orders of a couple 100,000).
One confusing thing about HDD specs is they often report the SATA bus speed which is much higher (such as 6Gb/s --notice b not B). Slow HDD speeds are somewhat offset
by their RAM buffers which are often a few hundred MBs.
Currently, (2023) a 16TB single drive HDD (fits nicely into a safety deposit box) goes for about $170 (on sale, $270 not on sale -- on Amazon), an 8TB SSD around $640. i.e., SSD's are more than 4x the cost/TB.
Data on disks is organized in blocks.
Disk reads and disk writes (Disk I/Os) operate on the block level.
Unlike HDDs, on SSDs, an empty block starts as all 1's, changing 1's to 0's can be done at the bit level, but resetting 0's to 1's must be done at the block level, adding to SSD complexity.
Memory read off an HDD, is usually temporarily stored in a buffer SSD or RAM before transfer to RAM.
Tertiary Storage - Reliability
Tertiary Storage is slower, (latency in the seconds rather than milliseconds), but more capacious and reliable memory than secondary storage.
Once written, but never accessed, data on SSD and HDD have lifespans in years. Here we are talking about cold storage type workloads, not day-to-day laptop usage. As mechanically SSDs have no "moving parts", they tend to not physically break and consume less power.
They do have a maximum number of read/write cycles.
Another issue is how long the charge traps persist without losing their bits. This is especially and issue if the SSD suffers multiple power outages or is run too hot.
For this reason, SSD reliability is still somewhat open. Studies from Blackblaze, seem to indicate SSD to be more reliable than HDDs, but the average age of data stored in them in the study was 1-2yrs versus 6yrs or so for HDDs.
The technology for data recovery is more mature for HDDs. Finally, tapes can last more than a decade.
Examples of tertiary storage are tape drives with robotic tape switchers or optical-disk juke boxes.
How HDDs Work
A given platter is broken into track which are further
broken into arcs called sectors.
Disk Controller
A disk controller is a small processor in an HDD capable of:
Controlling the actuator arm that moves the head
assembly
Selecting a surface to read/write from.
Transferring bits to/from the desired sector
to/from the computer's memory.
Disk Storage Characteristics
Rotation speed: usually 5400RPMs, 7200RPMS, 10000RPMs (faster is better).
Number of Platters: 2 and 3 platter drives are common.
Number of tracks per surface: low 100,000s.
Number of bytes per track: between 1 and 10MB.
If you know drive size, tracks/surface, and number of platters, you can calculate bytes/track.
For example, a 4TB drive with 2 platters, has 4 surfaces. If each surface had 200,000 tracks. Then
the
number of bytes/track = 1TB/200,000tracks = `(10^{12}B)/(2 times 10^5 mbox{tracks}) = 5 times 10^6B = 5MB`.
Quiz
Which of the following statements is true?
Insurance points in this class are directly added to your final score.
Computer main memory is typically faster than Cache Memory.
A gigibyte is larger than a gigabyte.
Disk Access Characteristics Review
The time to access a sector on the disk and read it into RAM from an HDD is split into
the following steps:
Processor/controller time
Time to position head to correct cylinder (seek time).
5-10mS
Time to rotate head over a sector (rotational latency) - this depends only on rotational speed.
For example, for a 7200 RPM drive, Time/round = 1/(7200 rounds/min) = 60/7200 S/round = 8.3mS. Average rotational latency is half this,
so about 4.2 mS.
Transfer time - time for whole sector/gaps between adjacent sectors to rotate under head.
Let's use these steps in an example.
Disk Access Characteristics Example
Suppose we want to read a 16384 byte block from a disk (book calls Megatron 747) that rotates at 7200 RPM which has a total of 65536 track/surface with 256 sectors/track.
To move between cylinders takes one 1mS to start and stop, plus one additional 1mS for every 4000 cylinders, giving a time of 1.00025mS to switch one track over. This gives
a total time to move from innermost to outermost track of 1mS + 65536/4000 mS = 17.38 mS.
10% of a track consists of gaps between cylinders. I.e., 360/10 = 36 degrees of arc are gaps.
The remaining 324 degrees of arc are sectors.
So if there are 4096bytes in a sector, the block we want fills four sectors.
To read 4 sectors, the head must pass over 4 sectors and 3 gaps. Hence,
we need to read `(36 \times 3)/256 + (324 \times 4)/256 = 5.48` degrees of arc to read the data.
Thus, to read the whole block takes `8.3mS \times 5.48/360 = .13mS`. This is the best case and assumes the head was at the start of the block.
In the worst case, we have to add to this the time to go from innermost to outermost track, plus wait one full revolution to get to the data. 17.38 + 8.33 +0.13mS = 25.84mS.
For the average case, .13mS remains the same, the rotational latency is halved,
and one takes roughly 1/3 the worst case for the cylinder seek time. This gives 10.76 mS.
Speeding Up Access to Secondary Storage
If you only have one HDD in use by your computer, the average time for the DB to get a block it wants will typically be larger than the average time as just calculated. This is because there will be other processes besides the DBMS making requests for blocks.
So one also has to consider disk I/O scheduling latency.
There are several techniques one can do to decrease the average disk access time, and thus increase the number of disks accesses/second that can be handled (throughput):
Place blocks that are accessed together on the same cylinder.
Use several smaller disks rather than a larger one. Each has an independent head, so can increase the number of block accesses/S.
"Mirror" a disk: Have two or more copies of the same data on different disks. This both allows several blocks to be accessed at the same time, and it helps protect against disk failures.
Use a better disk scheduling algorithm (in OS or DBMS) to choose which blocks should be read or written next.
Prefetch blocks to main memory in anticipation of latter use
I/O Model of Computation
We would like to speed up the performance of our DBMS when we need to use HDDs or SSDs.
In order to do this we need to fix a performance model that captures the idea that disk is much slower than RAM.
In trying to estimate performance of DBMS we assume:
The database is too large to fit in RAM memory.
That key parts are buffered
Many users (so disk controller needs to queue requests)
That disk requests will be for data on a random location on drive.
We will also assume the I/O Model of Computation:
The time required to move a block to or from main memory is much greater than any time spent manipulating the data in memory. So the latter time will be ignored in our estimates
I/O Model Example
Suppose we make a DB query to a relation R for a record with key value `k`.
In the I/O computation model it is generally more useful to have an index on R to identify the disk block on which the tuple with `k` lives than to have the index also tell us where on the block this tuple appears.
To see this for the Megatron 747 drive of our earlier example, notice its time to read a block was 11mS.
In 11mS, a modern CPU can execute millions of instructions, so in particular can do a search within a 16KiB block using linear search around a 1/100 of this time.
So the cost of looking within a block is negligible compared to the I/O time.
Organizing Data by Cylinder
Seek time takes on average about half the time used to find a block on the disk.
By storing data, such as relations, that are likely to be accessed together in adjacent blocks we cut down on this time.
Example:
Suppose relation `R` is stored 1024 disk blocks in our Megatron 747 drive.
Suppose we need to access all of its tuples (say we want to compute a sum of something like salary).
If the blocks were distributed at random around the drive, each block would take on average 10.76 mS to access, for a total time to access all blocks of 11S.
The book says Megatron 747 has 8 platters for a total of 16 tracks/cylinder.
As there are 256 sectors/track and 4 sectors/block, there are 64 blocks in a track, giving a total of 1024 blocks/cylinder.
So it is possible to hold the whole relation on one cylinder.
After an average seek time of 6.46mS to find the cylinder,and 16 revolutions to completely read all of the blocks of the relation (can only read 1 track at a time), we'd be done. This gives a total time of `(6.46 + 16 times 8.33) = 133 mS`.
Using Multiple Disks
Suppose we replace one disks with `n` disks each operating independently.
Provided the bus and main memory handle the higher transfer rates, this set-up should be able to read/write `n` times as much data/S.
However, the time to access any particular block will be unchanged.
If we want to speed up access to large database objects/relations, we can use a technique called striping to spread the objects blocks across the different disks.
Example:
Suppose we have the same 1024 block relation of the previous slide, but are now using 4 disks.
We can stripe the blocks as follows: on the first disk put blocks 1, 5, 9, etc; on the second disk 2, 6, 10, etc; on the third disk blocks 3, 7, 11, etc; and on the fourth disk blocks 4, 8, 12 etc.
Each disk will have 256 blocks which can be stored on 4 tracks of a cylinder, so to read the whole relation will now take `(6.46 + 4 times 8.33) = 39.8 mS`
In practice, there is some overhead (as highly like at least one disk will have a longer than average seek time), so would get a speed-up of 3 rather than 4 on average.
Mirroring
On the previous slide, we used multiple disks to hold distinct data from a DB.
It can also make sense to have two or more disks which hold the same data.
Such disks are said to be mirrors of each other.
One use case is if one disk fails, the other will still be available as a back up.
Another advantage is that with `n` disks, we can read `n` blocks from a relation in aout the same amount of time as one block in the single disk setting.
Even better, the disk controller can assign the same read request to any of the disks, so we can choose the disk whose head is closest to the block we want next getting a further speed-up.
Although read speeds will be faster with mirroring, write speeds will not have a speed-up as each new block must be written to all disks.