Mesos, Cassandra, Spark, etc.

A hobo stove. (Vagrant. Hobo. Get it?) Photo Credit:  Flickr / Creative Commons

A hobo stove. (Vagrant. Hobo. Get it?) Photo Credit: Flickr / Creative Commons

Over the weekend I developed a Vagrant setup that installed a working one-node Mesos installation, with Zookeeper, with Cassandra and Spark frameworks, based on Ubuntu 14.04 LTS (Trusty Tahr) 64-bit. Docker container support is also enabled. This was not super-easy to get working, and I went down some blind alleys. There are still a few outstanding issues.

  • I couldn't get Mesos to build at all until I raised the VM's memory allocation to 2GB. This seems obvious in hindsight, but it would have saved me some time if the docs had indicated a RAM requirement for the build step.
  • I couldn't get Mesos to work under OpenJDK 7. The build completes successfully, but the tests fail. The docs, which seem a little out of date, ask you to use OpenJDK 6 -- maybe that's for some good reason? I had always heard that OpenJDK was not very performant, but it seems that these days it is mostly the same as Oracle's JDK except for a few proprietary details, mostly having to do with web browser integration.
  • It wasn't super clear to me which version of Spark to download; Spark has to be built for particular Hadoop versions.  I guessed that the version "pre-built for Hadoop 1.x" would be what I needed, because I'd seen some documents indicating that Mesos and Hadoop 2.x do not play well together. It seemed to work.
  • I had some difficulties getting Spark Shell to recognize my environment variables; in the end I got it working by making a symbolic link from where it expected to be, and where it actually was.
  • Mesos and Spark both seem to assume that you're going to use HDFS alongside Mesos, but neither of them install HDFS for you. I didn't get around to that because frankly I don't (yet) understand how all the various components of HDFS work.

I also got most of the same setup working on Ubuntu 12.04 LTS (Precise Pangolin) 64-bit but I would prefer to standardize on 14.04 if possible. Precise is over two years old now. I don't have any religious attachment to Ubuntu, it's just what I am familiar with. I've used LTS releases of Ubuntu in production for over 8 years now.

I realize that a one-node VM with all this stuff is not very realistic, compared to the multi-node, many-core, many-GB-of-RAM, SSD machines you would want to deploy on for production. But certain elements of the Ancho infrastructure will run on this, and it is important to me -- and to hypothetical future project contributors -- to be able to quickly replicate a development environment, on a typical developer's laptop.

Once I get the kinks worked out, I would probably publish this as a ready-to-use Vagrant box, because it takes a non-trivial amount of time for all this stuff to build.