Economies of Scale

Update: I re-ran this code in October 2014 on a Macbook Pro and the time estimates are, somewhat predictably, very different.  Where my old laptop would take 10 hours to run the model, the new hardware could run it in about 45 minutes.

Photo Credit  via Flickr / Creative Commons

Photo Credit via Flickr / Creative Commons

I did a little "code dive" to see how an Ancho model might work.  It simulated a Run of 10 Sequences, each with 500 Frames, and each Frame randomizing 1,000 variables.  A real Run might need more on the order of 10,000 Sequences to get meaningful results.  Those numbers are somewhat arbitrary, but they're reasonable to get a feeling for the scale of things.

Here's the time trial code, which I posted at Pastie mostly for the syntax highlighting.  Let's see what this code tells us in terms of Ancho's requirements for CPU and storage.

Based on the results of this time trial (on my four-year-old Macbook Air), it would take about 10 hours to run 10,000 Sequences of this model.  Assuming 4 bytes per variable for a single-precision floating point number, it would generate 4 kilobytes per Frame, about 2 megabytes per Sequence, and about 19 gigabytes for the entire Run.

Ten hours? Dang.

Yep. The point of Monte Carlo is not computational efficiency, but to throw computing horsepower at the problem.  Still, you probably don't want to sit there for ten hours while your laptop generates, and then aggregates, 20 gigabytes of data.

Fortunately, an Ancho model should be what is referred to in computer science as "embarassingly parallel."  Each Frame can depend on the results of previous Frames in a Sequence, but each Sequence is totally independent of the other Sequences.  That means you could have a cluster of 10,000 computers, each doing a part of the problem, and you could complete the Run in about 4 seconds.  Obviously, there are some trade-offs involved: time to set up the cluster and time to move data back and forth over the network.

But... I Don't Have a Cluster of Computers

Few people do. Fortunately, the era of "cloud computing" is upon us. Companies like Amazon, Rackspace or Linode have racks of computers set up and waiting, and they will rent them to you for pennies per hour.

Part of my idea for Ancho is that it will do this for you.  You define your model, and it will handle setting up a temporary cluster, installing all the necessary software, running the specified number of Sequences and storing the data, and finally it will de-allocate the cluster so you aren't continuing to pay for it.  On Amazon's EC2, the sample model described above could probably be run for about $1.00.

Doing it that way, the only computing infrastructure that needs to be permanent is the database that stores the results of the Run.  In my next post, I'll look at the database packages I considered and explain my reasoning for the one that I chose.