Last week, we were discussing the basics of the relational model.
We said in this model a database consists of a set of relations.
Each relation consists of a set of tuples whose attributes match
the types of the relational schema for that relation.
We then looked at the relational algebra, a way of defining new relations based on a set of database relations.
We discussed the operations of projection, selection, Cartesian product, join, theta join, and renaming.
We also showed how to build up more complicated expressions using composition, and how these expressions might be
used to answer human language queries about data stored in our relations.
Today, we continue our description of the relational model by look at more complicated constraints on data in
the model that we might need to be able to handle.
Constraints on Relations
So far, we have seen only one kind of constraint on the data stored in database relations: that an attribute or set of attributes form a key.
There are many other kinds of constraints.
The first of these we will consider are set based constraints. These are quite general and can be used to define most of the other constraint types we will see.
The second kind we will consider are perhaps the most common. In these, a key column of one relation is used for values of another relation.
For example, the Project relation might use as one of its columns supervisor_id where this value must be an employee_id of the Employee table.
This is called a referential integrity constraint.
Relational Algebra as a Constraint Language
We can use expressions in the relational algebra to express constraints.
Two ways in which this can be done are:
Constraints of the form: `R = emptyset`, which says there are no tuples in the result of `R`. Here `R` may be a single relation or it could be a relational algebra query.
Constraints of the form: `R subseteq S`, which says every tuple `t` of `R` must be a tuple of `S`.
Both ways above actually turn out to be equivalent. `R subseteq S` can be expressed as `R - S = emptyset`, and `R = emptyset` can be written as
`R subseteq emptyset`.
The first style turns out to be more commonly used in SQL programming.
Referential Integrity Constraints
A referential integrity constraint asserts that a value appearing in one context also appears in another, related context.
In our Movies DB, a tuple in StarsIn has a starName component where we expect that name to be the name of a star in the MovieStar relation.
More formally, a referential integrity constraint arises, if `v` is a value of attribute `A` of relation `R`, and because of our design intentions, we may expect that `v` will also appear in a particular component (say `B`) of some other relation `S`.
We can express this integrity constraint in the relational algebra as `pi_A(R) subseteq pi_B(S)`.
Below are some example referential integrity constraints for the Movies database:
`pi_{mbox(starName)}(StarsIn) subseteq pi_{mbox(starName)}(MovieStar)`
`pi_{mbox(producerC#)} (Movies) subseteq pi_{mbox(cert#)} (MovieExec)`
`pi_{mbox(movieTitle,movieYear)} (StarsIn) subseteq pi_{mbox(title,year)} (Movies)`
Key Constraints
The same constraint notation allows us also to express algebraically the constraint that a certain set of attributes form a key.
For example, consider the table:
MovieStar(name, address, gender, birthdate)
The book uses name as the key for this relation.
One consequence of this constraint is that if two tuples agree on name, they must also agree on address.
Write `MS1` as shorthand for `rho_{MS1(mbox(name,address,gender,birthdate))}(MovieStar)`.
Write `MS2` as shorthand for `rho_{MS2(mbox(name,address,gender,birthdate))}(MovieStar)`.
Then we can express the address constraint as:
`sigma_{MS1.mbox(name) = MS2.mbox(name) mbox( AND ) MS1.mbox(address) ne MS2.mbox(address)}(MS1 times MS2) = emptyset`.
Using a sequence of such constraints we could enforce that name was a key.
Other Kinds of Constraints
A large family of constraints involve the permitted values in a context.
For example, the fact that each value has a type constrains the values of that attribute.
Examples of such constraints are "integers only" or "characters up to length 30".
These are quite general, however, we might want a constraint which says that the value of column
must come from a fixed enumerated type or define some complex limitation.
For example, suppose we wanted to restrict the gender type to be one of `M` or `F`. We could do this
with the constraint:
`sigma_{mbox(gender) ne 'F' mbox( AND ) mbox(gender) ne 'M'}(mbox(MovieStar)) = emptyset`
Another constraint might be that we require the net worth of a movie studio president to be at least $10,000,000.
We could express this as:
`pi_{mbox(presC#)}(Studio) subseteq pi_{mbox(cert#)}(sigma_{mbox(netWorth) geq 10000000}(mbox(MovieExec)))`
Quiz
Which of the following is true?
The command to delete a SQL table is DELETE TABLE.
A SQL table can have more than one primary key.
The relational algebra select operation applied to R makes a new relation with only the listed columns in the select of R.
Database Design Theory
There are several different approaches one could take to figuring out what relations one should have for a database.
In a couple of weeks we will learn a high-level notation for describing the structure of data diagrammatically. We will then show how to convert these diagrams to relations.
We can also examine the requirements for a DB and define relations directly without using a modeling/diagramming notation.
After we get an initial design, we then must ask ourselves, does ones choice of relations and constraints make sense?
To analyze this question in a rigorous way, DB designers make use of "dependencies" and study what kind of dependencies exist on the data
and how well these dependencies get mapped into our choice of relations.
The first kind of dependencies we will consider are functional dependencies. We will then later consider multivalued dependencies.
Definition of a Function Dependency
A functional dependency (FD) on a relation `R` is a statement of the form "If two tuples of `R` agree on all of the attributes
`A_1, A_2, ..., A_n`, then they will agree on the attributes `B_1, ..., B_m`.
If `t[vec{A}]` denotes those columns of `t` from `A_1, ..., A_n`, then what we are saying is `t_1[vec{A}] = t_2[vec{A}] -> t_1[vec{B}] = t_2[vec{B}]`.
We write such a functional dependency as `A_1, ..., A_n -> B_1, ..., B_m` and read this as "`A_1, ..., A_n` functionally determines `B_1,...,B_m`."
If we know a FD will hold for the tuples of every instance of a relation `R`, we say `R` satisfies the functional dependency.
We can expand an FD `A_1, ..., A_n -> B_1, ..., B_m` as a set of FDs:
`A_1, ..., A_n -> B_1`
`A_1, ..., A_n -> B_2`
...
`A_1, ..., A_n -> B_m`.
Now that we have fixed the notion of FD, we can give a more precise definition of key.
We say a set of attributes `{A_1, ..., A_n}` is a key for a relation `R` if:
It is a superkey. That is, `A_1,...,A_n -> B` is satisfied for any column `B` in R.
No proper subset of `{A_1, ..., A_n}` is a superkey. I.e., `{A_1, ..., A_n}` is a minimal superkey.
For example, {title, year, starName} form a key for the relation Movies1 of the previous slide.
A relation can have more than one key. Often, we choose one of these to be the primary key.
This is the key we will thenceforth use most frequently when talking about a key for that relation.
Reasoning About Functional Dependencies
We now look at a couple of different ways we can infer new functional dependencies given that we know some dependencies already hold.
For example, suppose we know `R(A,B,C)` satisfies `A -> B` and `B -> C`. Then we also know if we fix column `A` we must also fix column `C`.
i.e., `A->C`.
This shows that we might not need to list all the FDs on a table, it suffices to list a subset from which all other FDs can be derived.
To reason about this we introduce the following definitions:
We say two sets of FDs `S` and `T` are equivalent if the set of relation instances satisfying `S` is exactly the same as the set of relation instances satisfying `T`.
We say a set of `FD`'s `S` follows from a set of FD's `T` if every relation instance satisfying all the FDs in `T` also satisfies those of `S`.