Introduction

Last day, we were discussing the basics of the relational model.
We said in this model a database consists of a set of relations.
Each relation consists of a set of tuples whose attributes match the types of the relational schema for that relation.
We then looked at the relational algebra, a way of defining new relations based on a set of database relations.
We discussed the operations of projection, selection, Cartesian product, join, theta join, and renaming.
We also showed how to build up more complicated expressions using composition, and how these expressions might be used to answer human language queries about data stored in our relations.
Today, we continue our description of the relational model by look at more complicated constraints on data in the model that we might need to be able to handle.

Constraints on Relations

So far, we have seen only one kind of constraint on the data stored in database relations: that an attribute or set of attributes form a key.
There are many other kinds of constraints.
The first of these we will consider are set based constraints. These are quite general and can be used to define most of the other constraint types we will see.
The second kind we will consider are perhaps the most common. In these, a key column of one relation is used for values of another relation.
For example, the Project relation might use as one of its columns supervisor_id where this value must be an employee_id of the Employee table.
This is called a referential integrity constraint.

We can use expressions in the relational algebra to express constraints.
Two ways in which this can be done are:
1. Constraints of the form: `R = emptyset`, which says there are no tuples in the result of `R`. Here `R` may be a single relation or it could be a relational algebra query.
2. Constraints of the form: `R subseteq S`, which says every tuple `t` of `R` must be a tuple of `S`.
Both ways above actually turn out to be equivalent. `R subseteq S` can be expressed as `R - S = emptyset`, and `R = emptyset` can be written as `R subseteq emptyset`.
The first style turns out to be more commonly used in SQL programming.

A referential integrity constraint asserts that a value appearing in one context also appears in another, related context.
In our Movies DB, a tuple in StarsIn has a starName component where we expect that name to be the name of a star in the MovieStar relation.
More formally, a referential integrity constraint arises, if `v` is a value of attribute `A` of relation `R`, and because of our design intentions, we may expect that `v` will also appear in a particular component (say `B`) of some other relation `S`.
We can express this integrity constraint in the relational algebra as `pi_A(R) subseteq pi_B(S)`.
Below are some example referential integrity constraints for the Movies database: `pi_{mbox(starName)}(StarsIn) subseteq pi_{mbox(starName)}(MovieStar)`
`pi_{mbox(producerC#)} (Movies) subseteq pi_{mbox(cert#)} (MovieExec)`
`pi_{mbox(movieTitle,movieYear)} (StarsIn) subseteq pi_{mbox(title,year)} (Movies)`

The same constraint notation allows us also to express algebraically the constraint that a certain set of attributes form a key.

For example, consider the table:

MovieStar(name, address, gender, birthdate)

The book uses name as the key for this relation.
One consequence of this constraint is that if two tuples agree on name, they must also agree on address.
Write `MS1` as shorthand for `rho_{MS1(mbox(name,address,gender,birthdate))}(MovieStar)`.
Write `MS2` as shorthand for `rho_{MS2(mbox(name,address,gender,birthdate))}(MovieStar)`.
Then we can express the address constraint as:
`sigma_{MS1.mbox(name) = MS2.mbox(name) mbox( AND ) MS1.mbox(address) ne MS2.mbox(address)}(MS1 times MS2) = emptyset`.
Using a sequence of such constraints we could enforce that name was a key.

A large family of constraints involve the permitted values in a context.
For example, the fact that each value has a type constrains the values of that attribute.
Examples of such constraints are "integers only" or "characters up to length 30".
These are quite general, however, we might want a constraint which says that the value of column must come from a fixed enumerated type or define some complex limitation.
For example, suppose we wanted to restrict the gender type to be one of `M` or `F`. We could do this with the constraint:
`sigma_{mbox(gender) ne 'F' mbox( AND ) mbox(gender) ne 'M'}(mbox(MovieStar)) = emptyset`
Another constraint might be that we require the net worth of a movie studio president to be at least $10,000,000. We could express this as:
`pi_{mbox(presC#)}(Studio) subseteq pi_{mbox(cert#)}(sigma_{mbox(netWorth) geq 10000000}(mbox(MovieExec)))`

Suppose we have two relations: Emp(eid, name, address, salary), Project(pno, pname, manager_id).
Express the following constraints in the form query `= emptyset`:
1. pno is a key.
2. A value for a manager_id must come from an eid in Emp relation.
3. No employee has a salary less than 20000.
Please post your solution to the Sep 13 In-Class Exercise Thread.

There are several different approaches one could take to figuring out what relations one should have for a database.
In a couple of weeks we will learn a high-level notation for describing the structure of data diagrammatically. We will then show how to convert these diagrams to relations.
We can also examine the requirements for a DB and define relations directly without using a modeling/diagramming notation.
After we get an initial design, we then must ask ourselves, does ones choice of relations and constraints make sense?
To analyze this question in a rigorous way, DB designers make use of "dependencies" and study what kind of dependencies exist on the data and how well these dependencies get mapped into our choice of relations.
The first kind of dependencies we will consider are functional dependencies. We will then later consider multivalued dependencies.

A functional dependency (FD) on a relation `R` is a statement of the form "If two tuples of `R` agree on all of the attributes `A_1, A_2, ..., A_n`, then they will agree on the attributes `B_1, ..., B_m`.
If `t[vec{A}]` denotes those columns of `t` from `A_1, ..., A_n`, then what we are saying is `t_1[vec{A}] = t_2[vec{A}] -> t_1[vec{B}] = t_2[vec{B}]`.
We write such a functional dependency as `A_1, ..., A_n -> B_1, ..., B_m` and read this as "`A_1, ..., A_n` functionally determines `B_1,...,B_m`."
If we know a FD will hold for the tuples of every instance of a relation `R`, we say `R` satisfies the functional dependency.
We can expand an FD `A_1, ..., A_n -> B_1, ..., B_m` as a set of FDs:
`A_1, ..., A_n -> B_1`
`A_1, ..., A_n -> B_2`
...
`A_1, ..., A_n -> B_m`.

Consider the relation:

Movies1(title, year, length, genre, studioName, starName)

It seems reasonable that

title year -> length genre studioName

is an FD, but

title year -> starName

is not.