Ancho Architecture Plan

In trying to get Ancho's wheels turning again, I've spent some time in the last week trying to figure out how to translate my concepts into running software. As you may recall, one of Ancho's design goals is to allow fairly large models to run faster by running them in parallel on a cluster.

These are all just my perceptions, so if you're reading this and I got something completely wrong, please let me know.

Resource Manager

Mesos is a project of the Berkeley AMPLab, part of their so-called BDAS (pronounced "bad-ass") stack. Its purpose is to allow multiple compute scheduling frameworks to share the same computing cluster without having to statically partition nodes to specific frameworks. The other alternative here was YARN, which is the resource manager used in most Hadoop distributions. I read some academic papers on Mesos, as well as for Google's Omega scheduler.

The Omega paper helped me understand the architectural differences between Mesos and YARN.  Mesos listens for announcements that computing slots (units CPU & RAM) are available, offers them to running frameworks based on policies you can set, and the frameworks can accept or reject the available resources. This has some drawbacks, as identified in the Omega paper, but it generally works better in multi-framework environments than YARN's philosophy of centralizing control over what runs where. It's easier in practice for each framework to figure out what it wants or doesn't want.

So, I've chosen Mesos as my preferred underlying resource manager.

Hadoop Ecosystem vs. BDAS

The choices here were basically the Hadoop ecosystem (MapReduce, HDFS, Hive, Pig, Mahout, etc) or something else. In all honesty, I have always had trouble understanding how the various parts of Hadoop fit together. There are so many moving parts that I would want to use a curated distribution (Cloudera, Hortonworks or MapR) and Hortonworks is the only distro that's fully open.

The Hadoop distributions are all YARN-based. Mesos makes more sense to me conceptually, and if Ancho will need to be a framework running on top of the resource manager, I need to program against something I understand.  Also, Mesos explicitly supports tasks that run in Docker-based executors. I like Docker, I feel like I understand what it's doing, and I want to use it for sandboxing user-written model code, so this has a lot of appeal.

BDAS isn't quite as neatly packaged, but it also seems like it's easier to put together just the components that I need without needing a Master's degree in compiler theory.

Distributed Storage

A long time ago I decided that of all the NoSQL databases, I prefer Cassandra. As above, I feel like I understand what it's doing, and its data model fits well with what Ancho models will generate and then need to analyze. Cassandra is essentially a very large multi-level hashtable. It makes more sense to me than trying to use something with file semantics like HDFS, even if database semantics are added on top of it with a layer like Hive. Also, it seems like it should have better performance for my use cases.

Distributed Computing

Apache Spark seems like the way to go on this, for several reasons. It's fast, it's directly integrated with Mesos, it can read and write directly to Cassandra, and its Python API is a first-class Spark citizen. I feel like I understand the "Resilient Distributed Dataset" (RDD) abstraction that forms the core of Spark.

On top of Spark, their machine learning library MLlib seems to offer a lot of the functions I will need for computing summary statistics after an Ancho model has run. (It also says that it interoperates with NumPy, although I haven't seen any details about this.) It would be nice to have access to Pandas for this, but Pandas is not designed to run across distributed data as far as I know, and SparklingPandas (Pandas on top of Spark) doesn't seem to be very actively developed.

Of the things I can imagine needing a distributed computing framework for, Spark can do it all.

Architecture

So, my architecture diagram ends up looking something like this:

 Diagram generated with  Draw.io

Diagram generated with Draw.io

User Story

The usage scenario would run something like this:

  • Users author their models in Python against an Ancho API. Development, testing and small scale runs of the model can be done locally without any cluster infrastructure.
  • To run the model in a cluster, the user would submit the model to an Ancho Cluster (running atop Mesos as a framework) by one of two means:
    • Submitting the needed parameters by POSTing a JSON file (or something similar) to a REST-like interface. There would probably be a command line tool for this, but the API would allow jobs to be submitted by an automated process.
    • Logging into a web site that allows users to submit jobs, monitor their execution, and get the results.
  • Parameters submitted to the Ancho Cluster for a model run would include:
    • Location of your model code -- Ancho can accept a tarball, or check it out of a source control repository, or pull it from S3, or whatever.
    • Any parameters you need to submit to your Run: starting values, number of sequences to run, target values, etc.
  • Once the Run is submitted, the Ancho Cluster framework would start one or more docker executors across the Mesos cluster to run your model. Those executors would download the model, install any Python packages required (as specified in pip-requirements.txt) and begin running Sequences. After each Sequence completes, it reports its results back to the Ancho framework (and stored in Cassandra), which eventually declares the Run complete and decommissions most of the executors.
  • A few remaining executors will be told to generate the summary statistics about the Run using Spark/MLlib. I originally considered having the framework do the summary work, but this would make it impossible for users to define their own summary functions. I think that Cassandra's security model will allow me to restrict user code to accessing just the user's own data (only certain keyspaces/tables), but I may need to revisit this.
  • The final results will be exposed as a data structure to users of the RESTy interface. Web users could be presented something more interactive, such as an IPython session with the result data preloaded, or something prettier and easier to digest, such as a PDF generated via ReportLab.

MapReduce and NumPy?

This is a question, but it's a long question and I'm hoping someone out there might know the answer. I'm looking for something equivalent to NumPy/SciPy that is implemented in a Map/Reduce fashion -- or a better way to conceive this problem.

Say, for instance, that you have a very large array of floating point numbers, and you want to compute summary statistics like standard deviation, skewness and kurtosis. It might fit in memory on modern machines, and it would definitely fit on disk and you could work with it using numpy.memmap, which is is basically an array-like object that is backed by a file on disk. However, the process that generated this very large array was distributed across several nodes, so the data isn't all in one place. So basically we've mapped our function, and we need a reduce step that produces the needed result. Preferably, one that doesn't require copying all the data around between nodes.

Sure, you could copy all the data into one place and do the reduction, but it seems like it would be drastically more efficient to use the distributed infrastructure to solve this problem. Do some of the work at each node, and then do a final step to aggregate their work. The problem is that I don't want to re-implement all these functions myself if someone else has already done it, because I will probably do it badly compared to someone that really understands the math.

So, does a fork or subproject of numpy/scipy exist that can do functions on parts of arrays and then combine them? Does a map/reduce-aware implementation of numpy exist?

UPDATE

Brian Hicks suggested using mrjob or Sparkling Pandas. The Pandas-on-Spark project might do it, but I think I'm going to sink some time into learning Spark's MLlib, which seems like it would do what I need.

Thank You, Generous Benefactor

I tweeted about this on July 21, but I scored an Apache-licensed accounting package for inclusion in Ancho (or any Python project), and the tale deserves to be told in longer form.

  Photo Credit  via Flickr / Creative Commons

Photo Credit via Flickr / Creative Commons

This story starts out in 2008, when I attended PyCon in Chicago.  One of the vendors who had a booth in the expo hall was FiveDash, an Australian company who was working on a Python-based accounting system.  I expressed some interest and we corresponded briefly after that, mostly about the incredible complexity of implementing sales tax calculations for all the various jurisdictions in the United States.  That conversation died down, and that was that... until 2013.

A few months ago, I realized that in order to implement some of my use cases for Ancho, I was going to need a double-entry general ledger accounting system that I could embed inside my own project, or at least call out to through some reasonable API.  I took a look at what was out there in the open source world.  Generally, what I found was that most open source accounting systems are either:

  • Too tightly integrated with an ERP package, e.g. OpenERP, requiring this huge complicated install to get the tiny amount of functionality that I need; or
  • Too tightly integrated with their view layer, such that the accounting component can't be used via an API on a "headless" basis.

I remembered the FiveDash project, which hadn't come up in any of my searching, so I looked for it explicitly by name.  As it turns out, the company behind the accounting software has been closed.  The new owners of fivedash.com are selling software-defined radios.  The only thing I could find about the accounting package was their Launchpad repository.

The source code is in decent shape, and it had at least some of the functions I needed, even if it was coupled fairly tightly to web2py and PostgreSQL.  The biggest problem was the license... it was released under version 3 of the GNU General Public License, so if I created a derivative work and integrated it into Ancho, I would run the risk of "infecting" Ancho with the GPL.

Please note that I don't have anything against the GPL.  It's good for certain projects.  I've met Richard Stallman -- once when he spoke at Northwestern, and possibly another time on the Tremont side of Boston Common.  (Although that may have been a hobo. It was hard to tell.)   I ate some pizza with Bradley Kuhn at a meeting in Chicago.  They are nice people and deserving of many thanks.  However, my observation has been that software projects which are non-academic, somewhat esoteric, and potentially of commercial use tend to get broader adoption if they use BSD or Apache style licensing.  So that is what I'd planned to use for Ancho.

So with nothing to lose, I emailed the project maintainer, through Launchpad, at his last known address.  Somewhat to my surprise, I got an email back a couple of days later indicating that the project had been abandoned by the copyright holder.  They gave me commit access to the repository and gave me permission to change the licensing to permit usage under either the GPL or the Apache License version 2.0.

So, now there is an Apache-licensed accounting package for Python, just waiting for adaptation for my (or anyone else's) use.  Just because I asked.  Pretty sweet.

The next job is to dig into the source code and see how much work it's going to be to:

  • Separate the underlying accounting system from the user interface layer
  • (Maybe) modify the system so it can be used with SQLite as the data store, to make it a little more lightweight.

If anybody's interested in helping, please contact me via Twitter or email at tbecker@fortpedro.com.