HBase/Pig/Python Quickstart on OSX
Having spent a good chunk of the last two weeks getting a prototype analytics system running, I thought I would write up my findings. I was pleased to find that installing all the pieces was smooth via Homebrew, but getting them all to play together was less smooth.
The Playing Field
- Hadoop is an framework for distributed computing. It’s also used interchangibly to reference an entire ecosystem of technologies.
- HDFS is the underlying distributed file ssytem that makes Hadoop possible.
- HBase is non-relational data store built on top of Hadoop. It provides concepts like rows, columns and keys. The similarity to relational databases stops there.
- Zookeeper provides configuration management for Hadoop cluster machines
- Pig is high level language for map/reduce queries
- Hive is a SQL-like high level language for map/reduce queries
- Thrift is a REST API for HBase
- HappyBase is a python client for Thrift
Installing
Getting Started with HBase
First, you start the server with start-hbase.sh
, then you can enter an interactive shell with hbase shell
and create some test tables. HBase schema is a whole separate discussion. For now, we’re going to create a table with a column family of “stats”. Our primary keys are going to be in the format md5(customer id)[:5] + customer id + date
.
Take a look at the full list of HBase shell commands.
Getting Started with Pig
Getting Pig to connect to HBase is a little tricky. It involves some monkeying around with CLASSPATH
variables. You can run these export
commands in bash to set everything up properly. Note, this is for a very specific combination of versions, but you can substibute newer versions easily.
Exporting Data
You can enter the pig shell (aka grunt), by simply running pig
. You may find it useful if you run into problems to examine pig’s classpath with pip -secretDebugCmd
, and run pig in verbose mode with pig -debug DEBUG -versbose
.
Note: the last cat
command is Pig’s version of cat. Outside pig, the data is actually stored in a directory called /tmp/test_table.csv/
, in separate parts files. But they are just regular text files.
Importing Data
For this example, let’s create a larger data set. Here is a simple python script to create a CSV file in the correct format.
You can grab a pre-rendered version and save it locally with curl -L http://chase-seibert.github.com/blog/files/import.csv > /tmp/import.csv
. Then, you can import it in pig
like so. One confusing note here is that you don’t include the ID field in the store command; that’s automatic.
If you switch back to hbase shell, you should be able to scan and see those records.
Aggregating with Map/Reduce
There is a lot you can do with the built-in pig latin language. Here is one example, where we are going to get an average count by day for all customers. Because my day is only represented as an encoded portion of my row key, I will break that up as part of the aggregation.
You could save this dataset back to HBase using store D into 'hbase://test_table2' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('stats:date stats:count');
. Remember that you need to create the table first.
Trouble-shooting
If you have not setup your CLASSPATH
propery (ie, the export
statements), you may get any one of the following errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable