Observer pattern / Event oriented programming in Python

I recently decided that in one portion of Ancho's code, it would make sense to use an event-oriented paradigm. I'm familiar with the Observer pattern from the Gang of Four (Design Patterns) book, and I've used event models in things like GUI programming and Javascript. The general idea is that instead of having your code constantly check for some condition, you can tell the event system "let me know when this happens," which can be much more efficient.

Event programming can be hard to wrap your head around at first, but this is in a part of Ancho that casual users probably won't see much. I think it's worth the added complexity.

In Python, there is often only one obvious way to do something. But I've never done event programming in Python before, and initial searches did not yield a single obvious preferred way. So, I'm documenting my search here in hopes that it saves others some effort in answering the same question.

Narrowing the Search

There are some sub-categories of event-based programming:

  1. Networked message passing, job queues, etc. This would include things like Celery on top of an underlying task queue. (Actually seems like a good summary of all the queue systems out there.)
  2. Responding to events from the network, dealing with concurrency and non-blocking I/O. This would include things like Twisted, Eventlet, GEvent, and the like.
  3. Libraries for handling in-process event propagation without regard for networking or I/O. Mostly these are about performance, concurrency issues prevention of memory leaks. These can be synchronous (calls to event handlers must return before the next handler is called) or asynchronous (calls to event handlers execute in their own threads and may be executing simultaneously.) For my purposes, I had planned to use synchronous calls.

For my purposes, I am exclusively interested in the last of the above categories.

Found, but Rejected

  • Python library's signal module. This is really more for dealing with BSD-style signals, which are interruptions from outside the current process, like what happens when you hit Ctrl-C.
  • PyDispatcher. This software was last updated in 2011, which is either because it's really stable and perfect (not likely) or because it's no longer actively used or maintained. Since I'm looking for something that is actively under development, this means that PyDispatcher is out. Besides, Django stopped using PyDispatcher several years ago, in favor of a modified version that was roughly 10 times faster.
  • Michael Foord's Event Hook. This is more of a recipe than an actual piece of code. I'm looking for something that's being maintained.
  • Circuits. This appears to be more heavyweight than what I'm looking for. It is described as a "component architecture" and includes a web server and networking i/o components, which would put it more in my second category.
  • Py-Notify. Last release was in 2008. The author's benchmarking, however, is of interest: compared with PyGTK (a C-based implementation) PyNotify is considerably slower. That's no surprise, but I might use a similar benchmarking method to compare the libraries that seem like viable candidates.
  • Observed. Interesting, but it is more language-level in that it sends events for all method calls.  You don't seem have the level of control to create your own events. Not quite what I was looking for.
  • PySignals. Based on the django.dispatch code that replaced PyDispatcher; but apparently not touched since 2010.


These are all close enough to what I'm looking for for a fuller evaluation.

  • zope.event. Updated in 2014, and seems to be reguarly developed. I'm not sure if it has any dependencies on the rest of Zope; if so, that's a big negative. I have also read that it does not support arguments to be passed through to handlers.  That seems like a potential downside, depending on how my usage patterns shake out.
  • Django's signals implementation. Like Zope.Event, it's probably integrated somewhat with other code I don't need, but it's well-tested in production environments.
  • PyPubSub. Last updated in February 2014, so that's fairly recent. Support activity seems weak.  I don't see many references to it online.  I'll try it anyway, because its implementation is a little different (publish/subscribe with a central dispatcher) than the others.
  • Events. Bills itself as "Python event handling the C# style." The fact that it's C# doesn't do anything for me, but I do like that it's well-documented, has frequent commit activity, and has explicit support for python 2.7, 3.3 and 3.4. 
  • Smokesignal. This seems to be an indirect descendant of Django's signals, as PySignals was, except that this is still actively maintained. As of this writing the last commit was 11 days ago.

So, what I plan to do is to read up on these candidates; develop a test workload similar to my use case that can be implemented on each of them; benchmark them for performance; and report back on their overall fitness and ease of use. My initial guess is that Zope.Event and Django.Dispatch will not be easy fits, because of their connections to those other frameworks... but we'll see.

The heat will be on!

Mesos, Cassandra, Spark, etc.

A hobo stove. (Vagrant. Hobo. Get it?) Photo Credit:  Flickr / Creative Commons

A hobo stove. (Vagrant. Hobo. Get it?) Photo Credit: Flickr / Creative Commons

Over the weekend I developed a Vagrant setup that installed a working one-node Mesos installation, with Zookeeper, with Cassandra and Spark frameworks, based on Ubuntu 14.04 LTS (Trusty Tahr) 64-bit. Docker container support is also enabled. This was not super-easy to get working, and I went down some blind alleys. There are still a few outstanding issues.

  • I couldn't get Mesos to build at all until I raised the VM's memory allocation to 2GB. This seems obvious in hindsight, but it would have saved me some time if the docs had indicated a RAM requirement for the build step.
  • I couldn't get Mesos to work under OpenJDK 7. The build completes successfully, but the tests fail. The docs, which seem a little out of date, ask you to use OpenJDK 6 -- maybe that's for some good reason? I had always heard that OpenJDK was not very performant, but it seems that these days it is mostly the same as Oracle's JDK except for a few proprietary details, mostly having to do with web browser integration.
  • It wasn't super clear to me which version of Spark to download; Spark has to be built for particular Hadoop versions.  I guessed that the version "pre-built for Hadoop 1.x" would be what I needed, because I'd seen some documents indicating that Mesos and Hadoop 2.x do not play well together. It seemed to work.
  • I had some difficulties getting Spark Shell to recognize my environment variables; in the end I got it working by making a symbolic link from where it expected to be, and where it actually was.
  • Mesos and Spark both seem to assume that you're going to use HDFS alongside Mesos, but neither of them install HDFS for you. I didn't get around to that because frankly I don't (yet) understand how all the various components of HDFS work.

I also got most of the same setup working on Ubuntu 12.04 LTS (Precise Pangolin) 64-bit but I would prefer to standardize on 14.04 if possible. Precise is over two years old now. I don't have any religious attachment to Ubuntu, it's just what I am familiar with. I've used LTS releases of Ubuntu in production for over 8 years now.

I realize that a one-node VM with all this stuff is not very realistic, compared to the multi-node, many-core, many-GB-of-RAM, SSD machines you would want to deploy on for production. But certain elements of the Ancho infrastructure will run on this, and it is important to me -- and to hypothetical future project contributors -- to be able to quickly replicate a development environment, on a typical developer's laptop.

Once I get the kinks worked out, I would probably publish this as a ready-to-use Vagrant box, because it takes a non-trivial amount of time for all this stuff to build.

Ancho Architecture Plan

In trying to get Ancho's wheels turning again, I've spent some time in the last week trying to figure out how to translate my concepts into running software. As you may recall, one of Ancho's design goals is to allow fairly large models to run faster by running them in parallel on a cluster.

These are all just my perceptions, so if you're reading this and I got something completely wrong, please let me know.

Resource Manager

Mesos is a project of the Berkeley AMPLab, part of their so-called BDAS (pronounced "bad-ass") stack. Its purpose is to allow multiple compute scheduling frameworks to share the same computing cluster without having to statically partition nodes to specific frameworks. The other alternative here was YARN, which is the resource manager used in most Hadoop distributions. I read some academic papers on Mesos, as well as for Google's Omega scheduler.

The Omega paper helped me understand the architectural differences between Mesos and YARN.  Mesos listens for announcements that computing slots (units CPU & RAM) are available, offers them to running frameworks based on policies you can set, and the frameworks can accept or reject the available resources. This has some drawbacks, as identified in the Omega paper, but it generally works better in multi-framework environments than YARN's philosophy of centralizing control over what runs where. It's easier in practice for each framework to figure out what it wants or doesn't want.

So, I've chosen Mesos as my preferred underlying resource manager.

Hadoop Ecosystem vs. BDAS

The choices here were basically the Hadoop ecosystem (MapReduce, HDFS, Hive, Pig, Mahout, etc) or something else. In all honesty, I have always had trouble understanding how the various parts of Hadoop fit together. There are so many moving parts that I would want to use a curated distribution (Cloudera, Hortonworks or MapR) and Hortonworks is the only distro that's fully open.

The Hadoop distributions are all YARN-based. Mesos makes more sense to me conceptually, and if Ancho will need to be a framework running on top of the resource manager, I need to program against something I understand.  Also, Mesos explicitly supports tasks that run in Docker-based executors. I like Docker, I feel like I understand what it's doing, and I want to use it for sandboxing user-written model code, so this has a lot of appeal.

BDAS isn't quite as neatly packaged, but it also seems like it's easier to put together just the components that I need without needing a Master's degree in compiler theory.

Distributed Storage

A long time ago I decided that of all the NoSQL databases, I prefer Cassandra. As above, I feel like I understand what it's doing, and its data model fits well with what Ancho models will generate and then need to analyze. Cassandra is essentially a very large multi-level hashtable. It makes more sense to me than trying to use something with file semantics like HDFS, even if database semantics are added on top of it with a layer like Hive. Also, it seems like it should have better performance for my use cases.

Distributed Computing

Apache Spark seems like the way to go on this, for several reasons. It's fast, it's directly integrated with Mesos, it can read and write directly to Cassandra, and its Python API is a first-class Spark citizen. I feel like I understand the "Resilient Distributed Dataset" (RDD) abstraction that forms the core of Spark.

On top of Spark, their machine learning library MLlib seems to offer a lot of the functions I will need for computing summary statistics after an Ancho model has run. (It also says that it interoperates with NumPy, although I haven't seen any details about this.) It would be nice to have access to Pandas for this, but Pandas is not designed to run across distributed data as far as I know, and SparklingPandas (Pandas on top of Spark) doesn't seem to be very actively developed.

Of the things I can imagine needing a distributed computing framework for, Spark can do it all.


So, my architecture diagram ends up looking something like this:

Diagram generated with

Diagram generated with

User Story

The usage scenario would run something like this:

  • Users author their models in Python against an Ancho API. Development, testing and small scale runs of the model can be done locally without any cluster infrastructure.
  • To run the model in a cluster, the user would submit the model to an Ancho Cluster (running atop Mesos as a framework) by one of two means:
    • Submitting the needed parameters by POSTing a JSON file (or something similar) to a REST-like interface. There would probably be a command line tool for this, but the API would allow jobs to be submitted by an automated process.
    • Logging into a web site that allows users to submit jobs, monitor their execution, and get the results.
  • Parameters submitted to the Ancho Cluster for a model run would include:
    • Location of your model code -- Ancho can accept a tarball, or check it out of a source control repository, or pull it from S3, or whatever.
    • Any parameters you need to submit to your Run: starting values, number of sequences to run, target values, etc.
  • Once the Run is submitted, the Ancho Cluster framework would start one or more docker executors across the Mesos cluster to run your model. Those executors would download the model, install any Python packages required (as specified in pip-requirements.txt) and begin running Sequences. After each Sequence completes, it reports its results back to the Ancho framework (and stored in Cassandra), which eventually declares the Run complete and decommissions most of the executors.
  • A few remaining executors will be told to generate the summary statistics about the Run using Spark/MLlib. I originally considered having the framework do the summary work, but this would make it impossible for users to define their own summary functions. I think that Cassandra's security model will allow me to restrict user code to accessing just the user's own data (only certain keyspaces/tables), but I may need to revisit this.
  • The final results will be exposed as a data structure to users of the RESTy interface. Web users could be presented something more interactive, such as an IPython session with the result data preloaded, or something prettier and easier to digest, such as a PDF generated via ReportLab.