This is a question, but it's a long question and I'm hoping someone out there might know the answer. I'm looking for something equivalent to NumPy/SciPy that is implemented in a Map/Reduce fashion -- or a better way to conceive this problem.
Say, for instance, that you have a very large array of floating point numbers, and you want to compute summary statistics like standard deviation, skewness and kurtosis. It might fit in memory on modern machines, and it would definitely fit on disk and you could work with it using numpy.memmap, which is is basically an array-like object that is backed by a file on disk. However, the process that generated this very large array was distributed across several nodes, so the data isn't all in one place. So basically we've mapped our function, and we need a reduce step that produces the needed result. Preferably, one that doesn't require copying all the data around between nodes.
Sure, you could copy all the data into one place and do the reduction, but it seems like it would be drastically more efficient to use the distributed infrastructure to solve this problem. Do some of the work at each node, and then do a final step to aggregate their work. The problem is that I don't want to re-implement all these functions myself if someone else has already done it, because I will probably do it badly compared to someone that really understands the math.
So, does a fork or subproject of numpy/scipy exist that can do functions on parts of arrays and then combine them? Does a map/reduce-aware implementation of numpy exist?
Brian Hicks suggested using mrjob or Sparkling Pandas. The Pandas-on-Spark project might do it, but I think I'm going to sink some time into learning Spark's MLlib, which seems like it would do what I need.