MapReduce and NumPy?

This is a question, but it's a long question and I'm hoping someone out there might know the answer. I'm looking for something equivalent to NumPy/SciPy that is implemented in a Map/Reduce fashion -- or a better way to conceive this problem.

Say, for instance, that you have a very large array of floating point numbers, and you want to compute summary statistics like standard deviation, skewness and kurtosis. It might fit in memory on modern machines, and it would definitely fit on disk and you could work with it using numpy.memmap, which is is basically an array-like object that is backed by a file on disk. However, the process that generated this very large array was distributed across several nodes, so the data isn't all in one place. So basically we've mapped our function, and we need a reduce step that produces the needed result. Preferably, one that doesn't require copying all the data around between nodes.

Sure, you could copy all the data into one place and do the reduction, but it seems like it would be drastically more efficient to use the distributed infrastructure to solve this problem. Do some of the work at each node, and then do a final step to aggregate their work. The problem is that I don't want to re-implement all these functions myself if someone else has already done it, because I will probably do it badly compared to someone that really understands the math.

So, does a fork or subproject of numpy/scipy exist that can do functions on parts of arrays and then combine them? Does a map/reduce-aware implementation of numpy exist?


Brian Hicks suggested using mrjob or Sparkling Pandas. The Pandas-on-Spark project might do it, but I think I'm going to sink some time into learning Spark's MLlib, which seems like it would do what I need.

How to Open Government?

Photo Credit  via Flickr / Creative Commons

Photo Credit via Flickr / Creative Commons

So, this is something I have been thinking about for a while now.

I have friends who are supporters of open data writ large, for several reasons.  Use of open data techniques creates a general transparency for public analysis, which is good for citizens and for journalists who are trying to report on government and public affairs.  It makes it easier for staff and planning professionals to do their jobs.  And, at least theoretically, it makes it easier for citizens to supervise the actions of their elected officials -- it makes it harder to "hide the ball."

However, I am also an elected official in Crystal City -- a small town in Missouri.  And while all that sounds very nice, most of the efforts I see toward "open data" for government are directed at large cities or major metropolitan areas, not at cities our size.  Our staff does not include a web developer, and our web site is sadly out of date.  We have a high degree of vendor lock-in with our current administrative systems vendor, whose software handles everything from police bookings to water bills.  I have no idea what options they might offer, if any, that would make it easier to publish our data in an open fashion.

Complicating this is my lack of understanding of exactly what open data means.  It definitely seems to mean different things to different people.  Some are talking about crime reporting; some are talking about financial data; some are talking about statistics for things like broken sidewalks and potholes.

So, let's say hypothetically that a small city with no technical staff wanted to participate in an open government / open data initiative.  Here are the questions I have:

  • What does that mean, in language that a typical elected official can understand?  Is there a standard for reporting formats that we can say we comply with?
  • What does that participation get for the local government, specifically, that it was not already getting?  (I've found that we can comply with standards more readily if it means extra grant money or matching funds.)
  • How would we go about doing this, given our lack of technical staff and funds to outsource those functions?  Can we do that with our current technology vendor? If not, who are the players who are providing standardized software for doing this?

I think most elected officials are interested in transparency, but we aren't sure how best to provide it, and we aren't sure what tangible benefits it might offer.  Therefore, it never rises to the top of the priority list.  I would appreciate any pointers or tips that open government folks can offer.

Docker on Raspberry Pi

Photo Credit  via Flickr / Creative Commons

Photo Credit via Flickr / Creative Commons

I have an application I'm building that needs (well, "needs") to run on a Raspberry Pi.  Deploying new versions of a full-stack application to a Pi is a pain, because if you screw something up there's no out-of-band management like there is on a cloud server.  As a result, I've been trying to streamline my devops process to use Docker so I can leave the operating system alone and only change the containerized application.

Good News, Bad News

In case you hadn't noticed, running Docker on Raspberry Pi has gotten a lot easier.  In the latest Arch Linux for Raspberry Pi images, you can actually just:

pacman -S docker

...and it works.  All of the needed features are in the kernel, and the userland tools for Docker 0.10 are in the Arch Linux for ARM repositories.  While I am more comfortable in Debian-based distributions (like Ubuntu or Raspbian), Arch is good for this purpose because it's a much smaller, more barebones OS -- perfect for the underlying layer of a Docker container.

I'm having less luck, however, getting Arch Linux for Raspberry Pi to run in an emulated environment under QEMU.  I would have sworn I had it working a couple of months ago, using the Raspbian image and a kernel I got from XEC Design, but with the latest QEMU I can't seem to get it running again.

So What?

Docker is the new hotness in lightweight application containers.  Where today's mainstream virtualization emulates a separate operating system for each container, and for all practical purposes each container might as well be its own physical box, Docker's containers are more like isolated processes.  You can package up just as much environment as you need, and they can be instantiated so fast they act more like processes than like virtualized machines.

As of right now, Docker is officially only supported on a few x86_64-based Linux distributions.  But it really only relies on Linux kernel features, most of which are portable, plus userland binaries that are written in Go.  Since the Linux kernel source is fairly portable across architectures -- at least to ARM -- and Golang is officially supported on ARM, there should be no reason why it can't work.  And indeed it does... mostly.

That's Great, But... So What?

Raspberry Pi is a great platform for fooling around, but deploying anything to it is kind of a pain.  They would be nice little machines to run low-resource applications on, but for that problem.  Docker apps can be packaged up and distributed fairly neatly, assuming that you've got a base operating system that supports running them.  Now that we do, you can deploy your application without worrying about corruption of the underlying operating system and tools.  That can be a big help if your Raspberry Pi is down in the basement, plugged into a sensor, with no keyboard or mouse hooked up.  I think development for Raspberry Pi and other ARM-based devices is going to get a lot more fun.

I'll be giving a talk about this at the St. Louis Docker Meetup on Wednesday, June 4th in Clayton, MO.