Lately I have picked up some long overdue interest in a highly specialized tool in big data industry called Pig. It sits on top of a slightly better known piece of software, hadoop, which conforms to the mapreduce paradigm in distributed computing. In my rudimentary understanding of these tools, both are written in java, since its virtual machine and close to C++ performance allow efficient cross platform interactions. The reason to introduce Pig is that despite the nice map-reduce architecture, to implement a simple thing like join, sort, or group, still takes a series of carefully designed steps to truly leverage the power of distributed computing. Instead Pig offers to encapsulate these functions under simple bash-style command, like ORDER, JOIN, GROUP, COGROUP, etc, hiding all the details from the end users.
While this all sounds nice and dandy, the fact that pig is advertised as a programming language makes one easily fall for the illusion that it can be treated like scripting languages such as python. As with any wrapper language, another example being ruby-on-rail, one has to respect convention much more than configurability. There are many things pig does not support natively, but can be achieved through so-called UDF (user defined functions), which come in a variety of languages, but most prominently perhaps, java, python, and perl. After a while one discovers that the java udfs seem more natural to use, since the data structures in pig have java APIs that can be easily integrated.
Here I would like the rant a few things about this indispensible tool and how I partially overcame them, in painful but necessary ways. People have complained about pig being nothing without the udfs. I think it’s somewhat exaggerated, since why would they not write their own hadoop sort or join functions then? It is somewhat annoying that writing a java udf can take upward of a few hours, for something that can be achieved within minutes in python. But the fact that java classes can be combined into a jar at least makes the effort well-organized. Another unproductive thing about pig is that you have to specify the schema very dogmatically, unlike in python where variable typing is virtually non-existent. And it’s not just about typing. Because of its (intentionally) close resemblance to sql and other relational database language, the map structure is completely marginalized. For instance according to this thread, you cannot easily create and save a map structure. The only available tool is the TOMAP function which takes an odd approach of using parity of the arguments to identify key and value pairs. I have struggled on this point for one self-defeating midnight, where I tried to parse a json dictionary in my udf, without realizing that I am dealing with tuple instead.
The udf tutorial on apache website is admittedly very well-written. However without going through it carefully, as many of us time-deprived netizens tend to, the interface can look bewildering. Why in the world is the input argument to the main exec function always a tuple? Well, if you think of how you call the udf in the pig script, myudfs.functionxyz(myarg1,myarg2), the arguments are treated collective as a tuple. I guess there is no passing by keyword in the python analogy. Once this is gotten used to, it’s no surprise that tuples crop up many times as you unravel the data structure inside the udf. Every time a bag (known as a DataBag in java) is looped over, the individual elements are tuples, whose first entity can be cast as the appropriate object under the json structure. An additional annoyance is that you need to write your own schema output method, unless it is as simple as a scalar.
But overall, it is time well-invested to write your own sophisticated udf examples early on in learning pig, since you might eventually come back to use them. You can even write print statements in the udf for easy debugging. They will simply be logged in stdout. Downloading the pig source code and looking at it in a capable IDE such as IntelliJ is also a good idea. But i haven’t figured out why IntelliJ doesn’t point out import error automatically, presumably because of an unsuccessful build.
One final note for those frustrated pig debuggers: it’s always a good idea to test the script locally with as minimal an input file as humanly possible. But even if the input is moderately large it’s ok, since the overhead of dealing with remote server in hadoop is significantly reduced once you are in local mode (pig -x local). It might take some effort to set up the pig local mode, as you need to create name node etc properly. But these should be well-documented.