Finish Transactions - Parallel and Distributed Databases
CS157b
Chris Pollett
Apr 24, 2023
Outline
Deadlocks
Long Running Transactions
Models of Parallelism
Parallel Algorithms on Relations
Quiz
Map Reduce
Distributed Databases
Introduction
On Wednesday, we were looking at the interplay of concurrency control techniques and recovery management.
We identified one bad situation that can occur with two-phase locking: cascading rollbacks - where aborting one transaction causes
a second one to abort.
If this happens and there is a crash, it could lead to uncommitted data being on disk.
This problem does not occur for timestamp scheduling.
We said to avoid the problem we can modify 2PL to strict 2PL - don't release any locks until a transaction either commits or aborts.
We said a schedule to be recoverable if for all of its transactions, a transaction commits only after each transaction from which it has read commits.
We gave examples showing that there are recoverable schedules that are not serializable and vice versa.
We said a schedule is ACR (avoids cascading rollbacks) if transactions may read only values written by committed transactions.
Finally, we argued strict 2PL implies (via 2PL implying) serializability and strict 2PL implies ACR implies recoverable.
At the end of Wednesday, we looked at how rollbacks should be handled with respect to block and row level elements.
We start today by finishing up one remaining issue with locking: deadlocks.
Dealing with Deadlocks
A deadlock occurs when every element `x` in a set `S` of at least two transactions is waiting for a lock resource held by some other element
`y in S` where `x ne y`.
Deadlocking can occur even if we are using 2PL using shared, upgrade, and exclusive locks:
`SL_1(A)`, `R_1(A)`, `UL_2(A)`, `R_2(A)`,`SL_2(B)`, `R_2(B)`, `UL_1(B)`, `R_1(B)`, `XL_1(B)` denied, `XL_2(A)` denied
Some common ways to deal with deadlocks are:
Use a timeout: a transaction waits for a certain period for the lock and if it doesn't get it, it aborts and rolls back. Other waiting transactions might now have a chance to complete before their timeouts.
Use a wait-for graph (Holt 1972): A wait-for graph has as its nodes the active transactions and we have a edge from `T_i` to `T_j` if `T_j` has a lock on an item that prevents `T_i` from getting the lock it needs. If this graph has a cycle, then we are in a deadlock situation. To avoid deadlocks, if a transaction requests a lock that would make the wait-for graph have a cycle, we roll it back.
Order the DB elements: We assume the database elements are ordered in some fixed order `X_1, X_2, ..., X_n`. We then require that all transactions must get locks on elements starting from the element with the lowest number they need to the element with the highest number they need. For example, `T` is allowed to request a lock on `X_3` then request a lock on `X_5` as `3 lt 5`, but it wouldn't then be allowed to request a lock on `X_1` as `3 gt 1`.
Use Timestamps (Rosenkrantz, Stearn, Lewis, 1978): Give each transaction a deadlock timestamp corresponding to when it started. If a transaction is rolled back and restarted, the deadlock timestamp is not changed. We then use one of the following two policies for all transactions:
Wait-Die: If `T` is older than `U` and needs a lock held by `U`, then `T` is allowed to wait. On the other hand, if `U` is older than `T`, then `T` is rolled back.
Wound-Wait: If `T` is older than `U`, then unless `U` has already finished and released its locks, we force `U` to be rolled back. On the other hand, if `U` is older than `T`, then `T` waits for the locks held by `U`.
Comparing Deadlock Handling Methods
Using either timeouts or timestamps tends to be the easiest to implement.
Using Wait-for graphs minimizes the number of rolled back transactions.
Timestamps get us the closest to the property of wait-for graphs amongst the remaining options.
Quiz
Which of the following statements is true?
Serializable schedules are always recoverable.
It is impossible to have a write-too-late situation when multi-version timestamping is used for concurrency control.
A DBMS handling a transaction at a serializable isolation might need to make use of IX or IS locks.
Long Duration Transactions (Gray 1981)
A long transaction is a transaction that takes too long to be allowed to hold locks that another transaction needs.
Depending on the context, a long transaction might take seconds, minutes, hours, or even days to complete.
Some example applications that might have long transactions are:
Conventional DBMS Applications. For example, most banking transactions are short (withdraw ATM). However, you might need to have the occasional long transaction that does something such as examine all accounts to verify a total balance is correct.
Design Systems. We might be using a database as a version control system for a CAD application like the design of a car. Users can check out components, modify them, and check them back in. We wouldn't want check outs to be equivalent to getting an exclusive lock, as it might take a while for someone to modify a part, and during this time no one could access it.
Workflow Systems. These systems involve a collection of processes, many executed by software alone, but with some requiring human interaction. These transactions can involve things like billing which might require various office paperwork and can take days to complete. We wouldn't want other transactions to be locked out while these operations were ongoing.
Example Workflow
Notice each of the operations above could involve action where data is stored in a database.
Some of the operations involve humans and may be slow, so we wouldn't want to use exclusive locks for the duration of the whole transaction.
Sagas
The example of the last slide suggests one way to handle long-duration transactions called sagas.
A saga is a collection of actions that taken together form a long-duration transaction. More specifically, it consists of:
A collection of actions
A directed graph whose nodes are either actions or the terminal nodes Abort and Complete.
A start node which is the action that begins the transaction.
The graph on the previous slide would be example saga. The path `A_1A_2` is an example abort path. The path `A_1A_2A_3A_4A_6A_5` is an example of a complete path where the transaction succeeds.
Concurrency Control for Sagas
Two facilities manage concurrency control for sagas:
We view each action as a short transaction and use conventional currency control techniques to implement it.
The overall transaction, which can mean any path from the start node to an abort or commit, is managed by using "compensating transactions".
For each action `A`, we insist there is an action `A^{-1}` which we can use to undo that action in the event that whole long-duration
transaction needs to be aborted. Here `A^{-1}` is called a compensating transaction.
Parallel and Distributed Databases
Today, we are going to look at parallel and distributed databases.
In terms of DBMSs, a parallel database can perform several operations at the same time and we assume that the cost of communication between different processors carrying out these operations is low compared to a disk IO.
This might be because it takes advantage of the multiple cores in CPU or because it can take advantage of the many hundred or thousands of cores in a GPU.
Oracle's Cloud product make use of GPUs, IBM's DB2 BLU makes use of GPUs, PG-Storm is a GPU based extension to Postgres, Brytlyt and MariaDB have a partnership to bring GPU processing to MariaDB.
There are also many new GPU-based DBMSs: SQream DB, MapD, Blazegraph, RAPIDS etc.
A distributed database, on the other hand, consists of several different machines running, either at the same site or in different sites, a copy of or a portion of some database.
In this setting, as the machines might be quite distant from each other, and poorly connected, the cost of communication might be relatively high compared to a disk IO.
Models of Parallelism
Parallel architectures typical involve many processing units, but how they share or don't share different levels of the memory hierarchy varies:
A shared-memory machine
Each processor in this set-up has access to a shared common memory through some kind of bus/switch. In addition, it usually has its own local memory. This is a standard GPU set-up, could also be the set-up of some rack based parallel machines.
A shared-disk machine
Here each machine has its own memory, but there is a global set of disks which are shared. Examples of this set-up include: network attached storage (NAS) drives (sharing at file level) and storage area networks (SAN) (sharing at block level).
A shared-nothing machine
Here each processing unit has its own RAM and disk. This is often the cheapest to deploy. It is useful for example with a web application where you might have several copies of the database on different machines. Read/Write requests might go to a single master machine, but read only requests could go to one of several rsync'd slave machines.
Parallel Algorithms on Relations
How do the algorithms we had earlier in the semester change when we are in a parallel computing environment?
There are two different ways we could try to exploit parallelism:
inter-query parallelism, do all of a query on a single machine, but rotate among machines as to who handles the query.
intra-query parallelism, try to speed up the processing of individual queries themselves by trying to compute different portions of their outputs on different machines.
Tuple-at-a-Time Operations in Parallel
Consider a select operation, `sigma_C(R)`, in a shared nothing architecture:
We could imagine that each machine has some fraction of all the tuples in `R`.
All machines in parallel compute `sigma_C(R)` on their fraction.
Then each machine copies its results to a common server/output.
This same kind of approach can be used to handle project operations `pi_A(R)` as well.
If we have `N` machines, and the tuples are equally distributed, the time to compute
`sigma_C(R)` on these machines will be the time on 1 machine divided by `N` together with the communication overhead.
The scheme we have for partitioning tuples should be random with respect to select operations we are likely to do.
What we don't want is a situation where we do a query like `sigma_{sal=10000}(EMPLOY\E\E)` and all the people with a salary of `10000` are on the same machine.
Parallel Algorithms for Full-Relation Operations - Distinct Case
Suppose we wanted to do a query like:
SELECT DISTINCT SALARY
FROM EMPLOYEE;
in parallel. I.e., in relation algebra, a `delta(R)` operation.
If tuples are distributed to the processors according to a hash function `h` that depends
on just the SALARY portion of the tuple, then if for two tuples `t_1(SALARY) =t_2(SALARY)`, we have `h(t_1) = h(t_2)`.
We could then have each machine run the distinct algorithm from earlier in the semester
on its portion and output the results.
The combined outputs would all be unique.
Parallel Algorithms for Full-Relation Operations - General Case
Similar techniques work for Union, Intersection, Difference, Grouping, and Join.
For these operations, each machine `M` runs the following code:
Determine a hash function `h` for the query operation at hand (each machine should compute the same hash function). For example, for the query of the previous slide we chose a hash function that depended only on SALARY. For join, you might choose a hash function that depends only on the join attributes.
Compute `h(t)` for each tuple in `R` (in the case of a binary operation like `R cup S` for each tuple in `R` and `S`)
The output of the hash function should be a machine number `i` for a machine `M_i` that should handle that particular tuple.
For each `i`, send all the tuples `t` with hash `i` to machine `M_i`.
After receiving all the tuple that other machines have sent to it, `M` computes the relational algebra operation on its received portion of the tuples and outputs the result.
Cost of Parallel Algorithm for Full-Relations
Suppose we wanted to use the algorithm of the last slide to compute in parallel on `N`
processors one binary operation on `R` and `S`. For concreteness, let's say $R \bowtie S$.
Step (1) takes no IOs.
Step (2) and (3) is applied to all the blocks that make up `R` and `S`, so will take `B(R) + B(S)` total IOs. If each machine has its own disk then time to do these IOs will be `(B(R) + B(S))/N`.
Step(4) involves sending over the network `(N-1)/(N)(B(R) + B(S))` blocks, which hope is of negligible cost, and then doing `(B(R) + B(S))/N` on each machine to write its received data to disk.
Step(5) using our algorithm from earlier in the semester would take `(3(B(R) + B(S)))/N` IOs.
So the total time is `(5(B(R) + B(S)))/N` disk IOs, which for `N ge 2` is faster than `3(B(R) + B(S))` we needed to do the operation on a single machine.
Map-Reduce Parallelism Framework
Map-Reduce is a framework developed at Google for doing massively parallel computations on very large datasets.
It is often used to compute properties of the web-graph such as Page Rank.
The original paper in which it was described is Dean and Ghemawat (2004).
The Basic Framework
Map-Reduce was inspired by the map and reduce functions found in functional programming languages such as Lisp.
The map function takes as its argument a function `f` and a list of elements `l = langle l_1, ..., l_n rangle`. It returns a new list
`map(f,l) = langle f(l_1), ..., f(l_n) rangle`.
The reduce function takes a function `g` and a list of elements `l = langle l_1, ..., l_n rangle`. It returns a new element `l'` such that
`l' = \r\e\d\u\c\e(g,l) = g(l_1,g(l_2, g(l_3, ... )))`.
From a high-level view, a Map-Reduce program creates a sequence of key/value pairs, performs some computations on them, and outputs another sequence of key/value pairs.
Keys and values are often strings, but may be of any data type.
Distinct Phases of a Map-Reduce Job
Map Phase, key/value pairs are read from the input and the map function is applied to each of them individually. The function is of the general form:
`map: langle k, v rangle |-> langle langle k_1, v_1 rangle, langle k_2, v_2 rangle, ...,rangle`
Shuffle Phase, the pairs produced during the map phase are sorted by their key, and all values for the same key are grouped together.
Reduce Phase, the reduce function is applied to each key and its values.
`\r\e\d\u\c\e: langle k, langle v_1, v_2, ... rangle rangle |-> langle k, langle v_1', v_2', ... rangle rangle`
That is, for each key the reduce function processes the list of associated values and outputs another list of values. The output values and their number may not be the same as the input values.
Example Map-Reduce Job for Counting
Distributed Databases
The difference between a distributed database system and a shared-nothing parallel database system is that for the former we assume network communication is expensive, and for the latter we don't.
Some situations where we might use a distributed database are:
A bank may have several branches. Each branch might have a database of its accounts, but customers can choose to bank at any branch. Each of these databases might be further replicated offsite so as to be recoverable in the case of a fire.
A department store chain has individual stores each with their own database of sales. We also have a central database used to keep track of chain-wide inventory.
A digital library might consist of a collection of universities each of which holds different on-line books and documents. We might want to be able to do a search across all of these collected universities for a particular book.
Notice we can often view these system as being a big database partitioned across machines.
For example, each chain in the department store chain might have a table Sales(item, date, price, purchaser), so we can imagine they each store some tuples of a big global table.
Next day, we will look at how to handle transactions in this distributed set-up.