Having spent a good chunk of the last two weeks getting a prototype analytics system running, I thought I would write up my findings. I was pleased to find that installing all the pieces was smooth via Homebrew, but getting them all to play together was less smooth.
The Playing Field
Hadoop is an framework for distributed computing. It's also used interchangibly to reference an entire ecosystem of technologies.
HDFS is the underlying distributed file ssytem that makes Hadoop possible.
HBase is non-relational data store built on top of Hadoop. It provides concepts like rows, columns and keys. The similarity to relational databases stops there.
Zookeeper provides configuration management for Hadoop cluster machines
First, you start the server with start-hbase.sh, then you can enter an interactive shell with hbase shell and create some test tables. HBase schema is a whole separate discussion. For now, we're going to create a table with a column family of "stats". Our primary keys are going to be in the format md5(customer id)[:5] + customer id + date.
Getting Pig to connect to HBase is a little tricky. It involves some monkeying around with CLASSPATH variables. You can run these export commands in bash to set everything up properly. Note, this is for a very specific combination of versions, but you can substibute newer versions easily.
You can enter the pig shell (aka grunt), by simply running pig. You may find it useful if you run into problems to examine pig's classpath with pip -secretDebugCmd, and run pig in verbose mode with pig -debug DEBUG -versbose.
Note: the last cat command is Pig's version of cat. Outside pig, the data is actually stored in a directory called /tmp/test_table.csv/, in separate parts files. But they are just regular text files.
For this example, let's create a larger data set. Here is a simple python script to create a CSV file in the correct format.
You can grab a pre-rendered version and save it locally with curl -L http://chase-seibert.github.com/blog/files/import.csv > /tmp/import.csv. Then, you can import it in pig like so. One confusing note here is that you don't include the ID field in the store command; that's automatic.
If you switch back to hbase shell, you should be able to scan and see those records.
Aggregating with Map/Reduce
There is a lot you can do with the built-in pig latin language. Here is one example, where we are going to get an average count by day for all customers. Because my day is only represented as an encoded portion of my row key, I will break that up as part of the aggregation.
You could save this dataset back to HBase using store D into 'hbase://test_table2' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('stats:date stats:count');. Remember that you need to create the table first.
If you have not setup your CLASSPATH propery (ie, the export statements), you may get any one of the following errors: