Having spent a good chunk of the last two weeks getting a prototype analytics system running, I thought I would write up my findings. I was pleased to find that installing all the pieces was smooth via Homebrew, but getting them all to play together was less smooth.
The Playing Field
- Hadoop is an framework for distributed computing. It’s also used interchangibly to reference an entire ecosystem of technologies.
- HDFS is the underlying distributed file ssytem that makes Hadoop possible.
- HBase is non-relational data store built on top of Hadoop. It provides concepts like rows, columns and keys. The similarity to relational databases stops there.
- Zookeeper provides configuration management for Hadoop cluster machines
- Pig is high level language for map/reduce queries
- Hive is a SQL-like high level language for map/reduce queries
- Thrift is a REST API for HBase
- HappyBase is a python client for Thrift
Getting Started with HBase
First, you start the server with
start-hbase.sh, then you can enter an interactive shell with
hbase shell and create some test tables. HBase schema is a whole separate discussion. For now, we’re going to create a table with a column family of “stats”. Our primary keys are going to be in the format
md5(customer id)[:5] + customer id + date.
Take a look at the full list of HBase shell commands.
Getting Started with Pig
Getting Pig to connect to HBase is a little tricky. It involves some monkeying around with
CLASSPATH variables. You can run these
export commands in bash to set everything up properly. Note, this is for a very specific combination of versions, but you can substibute newer versions easily.
You can enter the pig shell (aka grunt), by simply running
pig. You may find it useful if you run into problems to examine pig’s classpath with
pip -secretDebugCmd, and run pig in verbose mode with
pig -debug DEBUG -versbose.
Note: the last
cat command is Pig’s version of cat. Outside pig, the data is actually stored in a directory called
/tmp/test_table.csv/, in separate parts files. But they are just regular text files.
For this example, let’s create a larger data set. Here is a simple python script to create a CSV file in the correct format.
You can grab a pre-rendered version and save it locally with
curl -L http://chase-seibert.github.com/blog/files/import.csv > /tmp/import.csv. Then, you can import it in
pig like so. One confusing note here is that you don’t include the ID field in the store command; that’s automatic.
If you switch back to hbase shell, you should be able to scan and see those records.
Aggregating with Map/Reduce
There is a lot you can do with the built-in pig latin language. Here is one example, where we are going to get an average count by day for all customers. Because my day is only represented as an encoded portion of my row key, I will break that up as part of the aggregation.
You could save this dataset back to HBase using
store D into 'hbase://test_table2' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('stats:date stats:count');. Remember that you need to create the table first.
If you have not setup your
CLASSPATH propery (ie, the
export statements), you may get any one of the following errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable