In my last post, Economies of Scale, I did a quick thought experiment to see how "big" an Ancho model might get. The reference number I came up with is about 20 GB of raw data, produced in anywhere from a few hours to a few seconds, depending on whether you're running your Ancho model on a single machine or a cluster.
Desktop computers routinely ship with 500 GB hard disks nowadays, and 20 GB may not seem like a huge amount of storage. It's not like you're going to be keeping all of that raw data forever. Ancho will generate it, analyze it to produce your model's reports, and then dispose of the raw data.
Despite that, here are some reasons why I think that Ancho's storage requirements merit a large NoSQL database like Cassandra.
Size and Scale
There are lots of different ways that someone could interact with Ancho, and not all of them involve a traditional desktop or laptop computer. If you're writing Python code, you will probably develop your model on a PC. However, let's say that you're simply plugging your own data in to a model that someone else built, or you're building a model using library functions that someone else has written. There's no reason why you would need a full PC for this. It could be done in a web application, or on a tablet device.
In the web application case, you would be sharing that system with lots of other users, all of whom might be creating their own multi-gigabyte models. In the tablet case, the tablet computer simply doesn't have the local storage or the CPU for this kind of computing. You could think of it more as a display client for a system that was really running on a server elsewhere -- just like the web application.
In either of these cases, the data needs to be kept in a cloud-based database that has the potential to store vast amounts of data, even though much of that data is only stored temporarily.
Rate of Speed
If you're running models on a cluster of computers, the rate at which the data is generated starts to be an issue.
At peak, let's say you've got a cluster of 1000 computers running 100 different models for 100 different people, all generating data at a rate of about 2 GB per hour per machine. That's 2 terabytes in an hour, or over 500 megabytes per second. The theoretical maximum write speed for a run-of-the-mill SATA disk is about 150 MB per second. Machines interacting with a database over a network incur various forms of overhead, from the network protocols to the speed of the network itself, that further limit those speeds.
Even if the database software is very fast, the hardware it's running on simply isn't going to be able to keep up. A cluster of model runners is going to need a clustered database.
So, what do we really need in a database?
- High Throughput. The database must be structurally designed to write data to disk very quickly, just to keep up with the cluster that generates it.
- High Availability. With the architecture I've got in mind, the model runner nodes come and go, but the database is always there. If the database goes down for any reason, nobody can run their models. The simple fact is that hardware fails -- in a cluster of 1,000 machines you're always going to have a node or two down for some reason. Our database needs to work around those problems without bothering users.
- Scale. The database must be able to scale to at least tens of terabytes without hitting a wall due to hardware limitations. It must also be able to scale back down once we've post-processed the model data and discarded it, so we're not paying for idle computers with empty disks when we don't need to.
- Flexible data model. In the usage scenarios I described above, many users will be storing data from many different models into the same database.
That Rules Out Relational Databases, Then
Even if you don't know it, you've interacted with relational databases (also known as relational database management systems, or RDBMSes) quite a bit. They are very good at what they do:
- Storing long lists (tables) of more or less identically structured items (rows), such as accounting transactions
- Indexing them so they can be queried efficiently
- Enforcing "transaction" rules so an operation either does, or does not, complete. (The database is never in an invalid, in-between state. The technical terms are atomicity and consistency.)
Unfortunately, relational databases (SQL databases like Oracle, PostgreSQL, MySQL or Microsoft SQL Server) lack some features that are important for us. They also have several resource-intensive features that we can do without.
Relational Databases Do Not Scale Well
When your database gets big, you have two choices: scale up (vertically), or scale out (horizontally.) Scaling up requires exotic hardware; for instance, you can run Oracle on a single machine with 64 CPUs, vast amounts of RAM and huge disk resources. The costs are enormous and you still have a single point of failure. If that machine goes down, so does your entire operation. We don't want to scale up. The other, increasingly popular choice is to scale "out" by spreading the database across a cluster of smaller, cheaper machines. Relational databases don't do this very well, because they rely on a "single master" in order to maintain consistency. Also, clustered relational databases -- even open source packages -- are difficult and/or expensive to set up.
Relational Databases Are Not Very Flexible
Generally, a relational database's schema must be defined in advance, which would be a problem for Ancho since we need to support heterogeneous data models.
Relational Databases Have Features We Do Not Need
- Transaction support -- a key benefit of relational databases -- is not that important for Ancho.
- Relational databases are also very good at complex ad-hoc queries, such as "Show me all X's that have a Y whose Z property is greater than 3." Since Ancho models will follow common design patterns due to their participation in the Ancho framework, we should know the kinds of queries we'll need in advance, so this isn't very important either.
That Leaves Non-Relational Databases
There are several types of non-relational databases. So-called "NoSQL" databases typically scale well, mostly because they don't enforce any relational or transactional constraints. Many of them are explicitly designed to scale horizontally by adding machines. The databases that power Google, Amazon and Facebook are of this type.
Either the "wide column store" or "document store" styles of NoSQL database would suffice for our needs. In practice there is some overlap between these two. Neither one enforces the kind of schema restriction that a relational database does.
This comparison of several candidate database packages -- Cassandra, CouchDB, HBase and MongoDB -- shows their features side-by-side. All are open source, "free software," non-SQL, non-transactional data stores.
The biggest differences between them are:
- Whether they support specific data types (characters, dates, numbers) as opposed to simple binary "blob" data.
- Their support for querying by values of certain columns (document databases generally can't give you a list of all the entries with a certain property)
- How well they support scalability. Do they have single points of failure? Do they deal with the loss of a member computer gracefully? Are they reliable?
Cassandra For The Win!
For the features we need, I believe that Apache Cassandra is the winner.
It is fully decentralized with no single point of failure. Read and write performace scale up linearly as machines are added to the database cluster. Individual machines in the cluster can fail (or be taken down for maintenance) without interrupting access to your data. The cluster can grow or shrink as needed based on the amount of data we need to store.
It is also reputed to be much easier to deploy and keep running than some of the alternatives. For instance, Cassandra can also be deployed on a single user's machine for development purposes much more easily than HBase, which requires a lot of dependencies, all of whose versions must match correctly.
Cassandra was originally built by Facebook engineers. It was open-sourced in 2008 and its features continue to develop nicely. It is built on many of the same principles that underlie the Google BigTable and Amazon Dynamo databases, neither of which is open-source and free to use by others. It is currently in production use backing Cisco's WebEx, Constant Contact's social media marketing application, Netflix's "Watch Instantly" streaming service, and at Twitter.
In summary, Cassandra does all the things I believe we'll need a database to do, is ready for production use, and its price point is just right: free. If you think I'm being ridiculous, and that a massively scalable data store like Cassandra is either overkill or the wrong fit, please let me know why.