Sqoop/HBase Quickstart on Linux
Sqoop is a tool for bulk copying data between a relational database like MySQL and HDFS or another Hadoop based data store like Hive or HBase. It can either export a table or set of tables, or you can specify a custom SQL query to pull the data out. It’s the best solution out there for moving massive data sets; it can even fan out sqoop workers to a configurable number of Hadoop data nodes, which will all run partitioned versions of the main SQL query in parallel.
Install
To get started, you will need to install Sqoop. The easiest method on Linux is to use the Cloudera repository. You will also need the JDBC MySQL driver, and the JDK (Sqoop compiles a JAR on the fly and sends it out as a MapReduce job).
Run it!
Say you have a MySQL table user
and a HBase table with the same name. If you want to do a straight copy of the data and use the id
column as the HBase rowkey and store all the columns in a HBase column family named data
, all you need to do is:
At least for HBase, you are almost always want to compose some more sophisticated rowkey, to avoid region hotspotting. If you can express that rowkey as a SQL statement, you’re good to go. Instead of --table
, you specify a --query
, and change the column referenced in --hbase-row-key
.
Say we want the rowkey to be the first five characters of the md5 hash of the company_id
field in the user
table, plus the date_added
field formatted as an eight character string, plus the id
field. Example: a8f5c-20130101-47
.
Note: the $CONDITIONS
is a placeholder for the dynamic partitioning of data across server. You may also need to specify a --split-by
column.
Tuning
For a large number of rows, you may find that Sqoop is using a lot of memory to copy the data over. You may even run into a java.lang.OutOfMemoryError: Java heap space
or a java.lang.OutOfMemoryError: GC overhead limit exceeded
. If you do, it’s likely because the MySQL database driver is fetching all of the rows of the table, and keeping them in memory. You can tell it to chunk up the query into pages and use a cursor by changing the connect string to: mysql://$MYSQL_SERVER/$DATABASE?dontTrackOpenResources=true\&defaultFetchSize=1000\&useCursorFetch=true
. See the documentation.
Trouble-shooting
If you get an error saying that sqoop cannot load the MySQL driver, you may need to do a manual sudo cp /usr/share/java/mysql.jar /usr/lib/sqoop/lib
to copy it to the right place.
If you get the error 0000-00-00 00:00:00' can not be represented as java.sql.Timestamp
, you should modify your connect string to add the zeroDateTimeBehavior
flag, ie mysql://$MYSQL_SERVER/$DATABASE?zeroDateTimeBehavior=convertToNull