Introduction

Since last week we have been talking about distributed databases.
These are databases which span several machines and for which the communication cost between machines is considered to be large.
We gave three example situations where a distributed database might be used: banking, department store chains, and digital libraries.
On Monday, we looked at how to modify queries in this setting, in particular, join computations, so as to minimize communication costs.
We then looked at how serializability and recovery change in the distributed setting and described the two-phase commit protocol.
Today, we begin by looking at how to modify locking if we have multiple databases in a distributed setting.

Distributed Locking

We assume each site has its own lock manager which is responsible locking elements on its component of a transaction.
We also assume components of a transaction can only request locks on elements at its site.
If we have multiple copies of the same data element `X` (i.e., we replicate `X`), we need to ensure that all copies of `X` are changed in the same way by each transaction.
We distinguish locking the logical database element `X` from locking one or more of its copies.

Centralized Locking

In centralized locking, we maintain a lock site.
This site has a lock table for logical elements, whether or not the lock site has copies of these elements at its site.
When a transaction wants a lock on logical element `X`, it sends a request to the lock site, which grants it as appropriate.
The cost of locking is then three messages per lock (request, grant, release).
Some drawbacks of centralized locking are:
1. The lock site can become a bottleneck to the throughput of the system.
2. If the lock site goes down, no transactions can be processed.

Using a Lock Coordinator

Let's now consider the setting where there is no replication, and each site has its own locking mechanisms for the elements that it has.
A distributed transaction might need locks on several database items `X`, `Y`, and `Z` that don't live all on the same site.
Component parts of the transaction at different sites need to at least exchange messages to determine if all the necessary locks for the transaction have been obtained before proceeding.
To simplify this process, we assume one component of every transaction acts as a lock coordinator for that transaction.
It is responsible for gathering all the locks needed for the transaction.
For locks needed on its own site, the lock coordinator doesn't need any messages to do locking.
To lock `X` on a different site `S`, the following messages need to occur:
1. A message to `S` requesting the lock for `X`.
2. A reply message granting the lock or a denial message (assume lock will eventually be granted by a later message).
3. A message to `S` releasing the lock on `X`.
For a given transaction, we typically choose the lock manager to be the site of the component of the transaction that needs the most locks.

Locking with Replication -- the Problem

Suppose we don't have a global lock table, but we do have replicated database elements `X`.
What we don't want is to have two copies of logical element `X` on different sites, say `X_1` and `X_2`, and for transaction `T_1` to get the lock on `X_1` and at the same time `T_2` to get the lock on `X_2` and each change the value of their copy, but because they don't have the lock, be unable to change the other copy.
The database in this bad scenario might end up in an inconsistent state, and we might not be able to serialize `T_1` and `T_2`.
To assure serializability, we need transactions to take global locks on the logical elements, even though these don't exist physically (only the copies exist physically).
We next look at a scheme for turning local locks into global locks which have the property:
- A logical element `X` can have either one exclusive lock and no shared lock, or any number of shared locks and no exclusive lock.

Primary-Copy Locking

In the primary-copy lock method, each logical element `X` has one of its copies designated the "primary copy".
Different database elements `X`, `Y` might have their primary copies on different sites.
In order to get a lock on a logical element `X`, a transaction sends a request to the site of the primary copy of `X`.
This site has a lock table with `X` in it and can appropriately grant or deny the request.

Distributed Deadlocks

As in the non-distributed setting, it is quite possible in the distributed setting to get deadlocks.
It is often harder, however, to detect them.
For this reason, rather than use wait-for graphs or timestamping, in the distributed setting most systems use timeouts.
If a transaction has not complete after an appropriate amount of time, it is assumed to have deadlocks, and is rolled back.

In-Class Exercise

Suppose we have `n` copies of a database element `X`.
Rather than use primary copy locking, suppose we require `s` of the copies of the element to grant a shared lock before we allow a transaction to assume it has a shared lock.
Similarly, we require `x` of the copies of the element to grant an exclusive lock before we allow a transaction to assume it has an exclusive lock.
What are some good choices for `s` and `x`? Why?
Give an advantage of this approach. Give a disadvantage of this approach.
Post your solutions to the May 2 In-Class Exercise Thread.

Information Integration

We now start a new topic: information integration.
Information integration is the process of taking several databases or information sources and making the data in these sources available to application users as if they were a single database.
The integrated database may be either physical, for example, stored in a data warehouse; or virtual, a "mediator" may be queried and it turn queries the various sources.
In the case of systems with mediators, special kinds of query optimization are needed and further the kind of optimization varies by the kind of mediator.
There are two main kinds of mediators:
1. global-as-view where the data at the integrated database is defined by how it is constructed from the sources. Here we might be interested in capability based optimization.
2. local-as-view where the content of the sources is defined in terms of the schema that the integrated database supports. Here the optimizer has to be able to figure out how to compose the answer to a query from defined views.
Different data sources might refer to the same entity. So another task that is needed by an information integration solution is to resolve discrepancies such as misspelt names, out-of-date addresses, etc. that different sources have. This task is called entity resolution.

Use Cases of Information Integration

Databases are often created independently, but are later asked to work together. For example, different police stations might have their own database of case files, but the state later mandates that they share certain kinds of data.
The use of databases evolves, so we cannot design a database to support every possible future use. For example, Facebook was originally developed using a Mysql database, which does not look anything like their current set-up. Facebook still might want to access old data from the original database and schema from time-to-time, but make it appear as if from fresh modern source.
As another example, a university might have started automating different functions at different times. For example,
- The registrar at the university created a database of courses and records. This was used to build a transcript app.
- At some different time, the bursar built a different database to manage tuition.
- The HR department asynchronously made a database for employees.
- The grants office made a separate one for funding, etc.
- The university might want it to be the case that the registrar does not record grades for students who don't pay their fees.
- This is not possible without information integration.
One more example is the case of corporate mergers: For example, when American and US Airways merged, American stored bookings in Sabre, which dates to the 1960s, US Airways, was using a more modern booking system from EDS. In turn, US Airways had earlier merged with America West using EDS, and had switched its system from Sabre to what American West was using.

The Heterogeneity Problem

If information sources were developed independently they can differ in many ways, even if they are intended to store the same data. This is called the Heterogeneity Problem.
- The book has an example of Aardvark Automobile Co which has 1000 dealers, each of which has its own database of its stock.
- Aardvark wants to create an integrated database so that dealers can locate a particular model that a different dealer might have in the case that the local dealer doesn't have it.
Heterogeneity can come in several different forms:
- Communication Heterogeneity. One system might be accessed through one protocol say HTTP, but another might use CORBA RPC calls, and another might use FTP.
- Query-Language Heterogeneity. This might involve different dialects of SQL at different sites. Or maybe some site uses Excel and another XQuery. Maybe one even has a legacy site that uses QUEL.
- Schema Heterogeneity. The database schema at the different sites might be different and so the queries needed to get the same data need to change. For example, one site might have a single table CARS, another dealer might have two tables AUTOS and OPTIONS.
- Data Type Heterogeneity. Serial numbers might be stored as a CHAR in one site as an INTEGER at another, etc.
- Value Heterogeneity. One site might use Black to mean a black car, another site might use the code BL.
- Semantic Heterogeneity. Terms might differ from site to site. One site might use CARS to refer to both cars and trucks while another site only puts cars in the CARS table.

Distributed Locking - Information Integration

Outline

Introduction

Distributed Locking

Centralized Locking

Using a Lock Coordinator

Locking with Replication -- the Problem

Primary-Copy Locking

Distributed Deadlocks

In-Class Exercise

Information Integration

Use Cases of Information Integration

The Heterogeneity Problem