Using Python with Hadoop

First, some review

Hadoop is a very powerful MapReduce framework based on a white paper released by Google documenting how they have successfully tackled the issue of processing large amounts of data (on the scale of petabytes in many cases) using their proprietary distributed filesystem, GFS. Hadoop is the open source version of this distributed file system ((Technically Hadoop is an umbrella name whereas HDFS is the technical name for the GFS alternative.)), heavily supported by companies like Yahoo, Google, Amazon, Adobe, Facebook, Hulu, IBM, RackSpace, etc. and and has a growing number of related projects hosted by the Apache Foundation.

Why we need to learn “yet another language”

Yet, even with all of the buzz and hoopla many people find it difficult to setup and start writing applications capable of levreging the awesome power of an Hadoop cluster, many find the learning curve of Java and the Hadoop APIs very steep.

Fortunately one of the features available in Hadoop is HadoopStreaming which allows programmers to specify any program (or script) as a mapper and/or reducer. Consequently, one of the most popular scripting languages to use alongside Hadoop is Python ((If you aren’t familiar with Python and want to learn, here is an excellent site for diving into the language and here is an excellent video series walking you through the basics.)).

One of the reasons Python is well suited to this type of work is it’s ability to be functional provided you are careful how you write it. This makes chopping well-written Python map/reduce scripts up into distributable units much easier.

There’s a framework for that

While it is possible to write plain Python scripts, the folks at last.fm have helped create an excellent Python framework for Hadoop called Dumbo to help streamline the process of writing MapReduce jobs in Python. Dumbo seems to be a fairly simple framework with plenty of examples you can adapt to your particular needs.

There’s a framework for that too

Hadoop has many sub-projects, and one that is fairly popular is called HBase which allows a more structured, database-like, approach to storing and retrieving data. An excellent Python framework for quickly parsing data into HBase tables is Zohmg. This framework allows programmers to define tables in a YAML configuration file and corresponding mappers as simple Python scripts.

Bringing it back home

One of the biggest drawbacks to using HadoopStreaming is that it is inherently less optimal than writing MapReduce jobs in Java since the target script or application has to be initialized, the data then has to be serialized, sent to the target application/script, processed, and then sent back (if there are any reducers). All this context switching adds overhead that wouldn’t exist if the MapReduce job were kept in the JVM where Hadoop runs.

Jython is a viable answer for converting existing Python applications into Java bytecode to prevent incurring as much of a performance penalty. This utility can come in handy if you decide that your “quick and dirty” Python script needs to be moved into a production environment.