We have been doing a lot of batch processing with Hadoop MapReduce lately, and we quickly realized how painful it can be to write MapReduce jobs by hand. Some parts of our workflow require up to TEN MapReduce jobs to execute in sequence, requiring a lot of hand-coordination of intermediate data and execution order. Additionally, anyone who has done really complex MapReduce workflows knows how hard it is to keep “thinking” in MapReduce. Luckily, we discovered a great new open source product called Cascading which has alleviated a ton of our pain. Cascading is the brainchild and work of Chris Wensel, and he’s done a great job developing an API which solves many of our problems. Cascading abstracts away MapReduce into a more natural logical model and provides a workflow management layer to handle things like intermediate data and data staleness.
Very good walkthrough of how they take a tuple problem set and use Cascading to simplify the management of pipes, particularly forking and merging pipes together.
You may also want to see Yahoo Research’s Pig as another example of an abstraction layer over MapReduce, which seem to be all the rage now as we need a way to query / join and generally work with these large datasets in an easy way. Yahoo’s Pig seems to rely heavily on SQL like syntax – an approach I’m not as fond of as the approach Cascade takes.