A Flavor of Transactions, Scaling Database Architectures
CS157a
Chris Pollett
Dec 2, 2019
Outline
Transactions
Quiz
Introduction
Before the holiday, we discussed how to connect to a DBMS system from an application programming language such as Java.
There are many scenarios in which we might want to do this.
For example, we might want to write the software that is run on an ATM machine and have it interact with a database that
is storing Bank Account information.
Our programs might be run in a setting where multiple instances of our program connects to the same database at the same time.
A single ATM interaction might involve several operations on the database that we'd like to treat as a unit.
For example, to do a withdrawal, we want to get a user's current ATM balance, say using a SELECT query. Check if the withdrawal is feasible,
and if so do an UPDATE of the balance and give the user money.
To indicate to the DBMS that we'd like to treat a sequence of consecutive operations as a logical unit, our program can execute
the SQL command: BEGIN TRANSACTION.
When our transaction is done, our program should execute the SQL command COMMIT, to let the database know that the operations so
far should be committed to secondary storage.
We now discuss a couple issues with transactions and their implications to how DBMSs work. This to some degree foreshadows topics we will
discuss in more detail in CS157B.
Concurrency Control
Suppose a husband and wife each have an ATM card for the same bank account.
They each go to a different ATM, and withdraw `$`40.
What we don't want to happen is the following: The instance of our program handling the husband's transaction reads the current balance, then the
wife's instance reads the current balance. Both get the same value, the update statement is run twice based on this same value `X` to set the new balance
to `X-40` and give each person 40 dollars. I.e., they withdrew `\$80`, but the balance only reflects a `\$40` change.
It is the job of a DBMS process known as the Transaction Manager to prevent this from happening while still allowing some interleaving of
read and write operations from various clients of the DBMS.
Techniques for doing this in a way such that each transaction client feels it is the only one accessing the database are known as concurrency control.
For example, one way to solve the above problem is for the DBMS maintain a table of locks on DB resources and which transactions currently have them.
A transaction is only allowed to proceed with an operation after the transaction manager has assigned it the lock for the database item it wants to manipulate
and otherwise must wait.
Distributed Databases
Using concurrency control techniques, multiple client applications can connect to and use a database while each still feels as if it has access to the database by itself.
As the usage of a database goes up, it might no longer be the case that a single machine holding a database can handle all of the traffic and operations that need to be done on it.
For this reason, it might be necessary to split the database across multiple machines.
Since each of these machines runs processes using it own clock, the result is a so called distributed database.
There are many other reasons why one might end up with a database split across multiple machines. For example, Company A acquires Company B each of which had their own database system previously and now they must communicate.
Distributed Databases and how to handle transaction in them is a topic that is covered in more detail in CS157B.
For the remainder of today though, I'd like to consider the particular sub-case of scaling a three-tier web application that uses a database where we replicate the whole database.
Quiz
Which of the following is true?
An application server is another name for a DBMS.
To connect to a database management system on another machine we would typically use the DBMS's application network layer interface.
Queries encapsulated using PreparedStatement's are recompiled every time executeQuery is invoked.
Scaling a Web-site
Suppose we have a web-site and we want to make it so rather handling a few thousand hits a day, it is capable of handling thousands of hits per second.
Rough traffic (2008) - addons.mozilla.com (3000-6000 hits/second); wikipedia.org, a top ten site (30,000 -40,000 hits/second).
(I know this is old, but I am reasonably sure of the numbers, and would be roughly ballpark for a large website.)
Scaling is a term that is often used in a vague sense.
To make this term precise, we will say a web-site is scalable if:
It can accommodate increased usage
It can accommodate an increased dataset
It is maintainable.
Throwing more hardware at a problem might help solve the first two points, but if the additional hardware makes the system to hard to manage three will fail.
Let's look at an example web-site architecture in terms of scalability.
Example Web-site Architecture
Database Part of the Architecture
The diagram on the last slide only shows one master DB and many slaves.
The slaves would be read only and the master usually supports reads and writes. The slaves could be kept
in sync with the master using rsync.
To handle more data might have several master DBs and horizontally partition the data between them.
By horizontally partition the data, I mean that table data for larger tables would be split across machines according to some key field. For example, Machine 100
might get rows with key values for a table between `x` and `y`.
ORMCache Part of the Architecture
If our web-site is built using an MVC/MVA pattern, then
one of the common sources of database queries will be marshaling data for our models.
Let's briefly consider how this is done.
Typically, models are used to get objects into and out of a database.
As an example, a user might make many posts to a discussion board.
This is an example of a one-to-many relationship.
We might have two database tables user and post. To represent this relationship the post table might have a column uid as a foreign key reference into the user table.
Now a User object might be modeled as PHP associative array consisting of the name, and other fields of the user, one of these fields being an array of posts objects:
An object relational mapping is a mapping between these kind of objects and database tables.
Besides one-to-many relationship, one might need to handle many-to-many, belongs-To, and one-to-one relationships.
In pretty much all cases, marshaling an object requires several database queries, which can be slow, so may be sped up by caching.
An ORM Cache such as Redis or Memcache solves this problem by providing a RAM-based key value store for objects that have been previously marshaled and which have not been written to.
Reverse Proxy Part of the Architecture
The reverse proxy in front of the architecture from a couple slides back may provide several functions:
Act as a cache for previously calculated whole web pages that have not changed.
Choose which web server should handle a request (might have several machines all running the identical web app).
Hardware support for things like SSL.
The reverse proxy in front of everything is typically implemented using Varnish, Squid, or a hardware specific system.
An advantage of a layer seven network appliance hardware over a Squid is that it can do SSL in hardware. On the
other hand, hardware is harder to change.
This completes our brief overview of our architecture.
I'd like to point out that many products nowadays run virtualized and cloudified as well.
Content Delivery Networks
One problem with even an architecture like that provided a couple slides back, is that traffic all still has to come
to your subnetwork of the internet and this could lead to a bottleneck.
To scale so that geographic effects don't mess with your ability to provide content, you often would also use a
content delivery network.
Akamai is one early example of such a network. There is also Cloudflare. Amazon has Cloudfront and Microsoft has Azure.
CDN's are often implemented using a distributed hash table.
Like a usual hash table, such a table stores key value pairs.
Unlike a hash table, there is also a distance function d(k1, k2) which can be used to compare the distance between two keys.
A key might contain information about a url, an ip address (geographic info), and a replication number; a value might be a web page.
Each node in the network also has a key associated with itself.
Nodes in such a DHT network usually are arranged in some kind of topology like a ring with additional edges to keep the number of hops between any two nodes logarithmic.
When a request comes in to one node in this network, it checks if any neighbor has distance closer to the key, if so it let's the closest neighbor try to handle the request; otherwise, it handles it.
C10k Problem and its Variants
Ideally one would like to run ones web app on as few machines as possible because it will be easier to manage.
The C10k refers to 10,000 simultaneously open connections on a single machine and was coined by Kegel in 1999.
Modern variants replace k with some number and M.
Originally, 10,000 connections was a bottleneck for Apache web servers because the server uses multiple concurrently running modules, each in its own process, each of
which in turn runs multiple threads.
Sockets connections require file handles. The operating system usually places a limit on the number of threads and file handles that a process is allowed to have open. The limits on threads are often caused by how much memory one has and Apache threads can be heavy.
Once these limits have been exhausted, blocking occurs, slowing down the server.
One approach to avoid this issue is to use event-driven I/O and not spawn threads or processes. This reduces the memory footprint of the server and then one can use ulimit or some other such construct to raise the limit on the allowed open connection for a process.
This is the approach the nginx (Engine X) web server takes and it is also the approach of Node.