# hbase-jruby *hbase-jruby* is a Ruby-esque interface for accessing HBase from JRuby. With JRuby, you can of course just use the native Java APIs of HBase, but doing so requires lots of keystrokes even for the most basic operations and can lead to having overly verbose code that will be frowned upon by Rubyists. Anyhow, JRuby is Ruby, not Java, right? *hbase-jruby* provides the followings: - Easy, Ruby-esque interface for the fundamental HBase operations - ActiveRecord-like method chaining for data retrieval - Automatic Hadoop/HBase dependency resolution ## A quick example ```ruby require 'hbase-jruby' HBase.resolve_dependency! 'cdh4.1.2' hbase = HBase.new table = hbase.table(:test_table) # PUT table.put :rowkey1 => { 'cf1:a' => 100, 'cf2:b' => "Hello" } # GET row = table.get(:rowkey1) number = row.fixnum('cf1:a') string = row.string('cf1:b') # SCAN table.range('rowkey1'..'rowkey9'). filter('cf1:a' => 100..200, # cf1:a between 100 and 200 'cf1:b' => 'Hello', # cf1:b = 'Hello' 'cf2:c' => /world/i, # cf2:c matches /world/i 'cf2:d' => ['foo', /^BAR/i]). # cf2:d = 'foo' OR matches /^BAR/i project('cf1:a', 'cf2'). each do |row| puts row.fixnum('cf1:a') end # DELETE table.delete(:rowkey9) ``` ## Installation $ gem install hbase-jruby ## Setting up ### Resolving Hadoop/HBase dependency To be able to access HBase from JRuby, Hadoop/HBase dependency must be satisfied. This can be done by either setting up CLASSPATH variable beforehand or by `require`ing relevant JAR files after launching JRuby. However, that's a lot of work, so *hbase-jruby* provides `HBase.resolve_dependency!` helper method, which automatically resolves Hadoop/HBase dependency. #### Preconfigured dependencies Apache Maven is the de facto standard dependency management mechanism for Java projects. Current version of *hbase-jruby* is shipped with Maven dependency specifications for the following Hadoop/HBase distributions. * [cdh4.1.2](https://github.com/junegunn/hbase-jruby/blob/master/lib/hbase-jruby/pom/cdh4.1.2.xml) * Recommended as of now * [cdh3u5](https://github.com/junegunn/hbase-jruby/blob/master/lib/hbase-jruby/pom/cdh3u5.xml) * Does not support some features ```ruby require 'hbase-jruby' HBase.resolve_dependency! 'cdh4.1.2' ``` (If you're behind an http proxy, set up your ~/.m2/settings.xml file as described in [this page](http://maven.apache.org/guides/mini/guide-proxies.html)) #### Custom dependency If you use other versions of HBase and Hadoop, you can use your own Maven pom.xml file with its Hadoop/HBase dependency. ```ruby HBase.resolve_dependency! '/project/my-hbase/pom.xml' ``` #### Using `hbase classpath` command If you have HBase installed on your system, it's possible to locate the JAR files for that local installation with `hbase classpath` command. You can tell `resolve_dependency!` method to do so by passing it special `:hbase` parameter. ```ruby HBase.resolve_dependency! :hbase ``` ### Connecting to HBase ```ruby # HBase on localhost hbase = HBase.new # HBase on remote host hbase = HBase.new 'hbase.zookeeper.quorum' => 'remote-server.mydomain.net' # Extra configuration hbase = HBase.new 'hbase.zookeeper.quorum' => 'remote-server.mydomain.net', 'hbase.client.retries.number' => 3, 'hbase.client.scanner.caching' => 1000, 'hbase.rpc.timeout' => 120000 # Close HBase connection hbase.close ``` ## Accessing data with HBase::Table instance `HBase#table` method creates an `HBase::Table` instance which represents a table on HBase. ```ruby table = hbase.table(:test_table) ``` `HBase::Table` instance must be closed after use. ```ruby # Always close table instance after use table.close # If block is given, table is automatically closed at the end of the block hbase.table(:test_table) do |table| # ... end ``` ## Basic table administration ### Creating tables ```ruby table = hbase.table(:my_table) # Drop table if exists table.drop! if table.exists? # Create table with two column families table.create! :cf1 => {}, :cf2 => { :compression => :snappy, :bloomfilter => :row } ``` ### Table inspection ```ruby puts table.inspect ``` ## Basic operations ### PUT ```ruby # Putting a single row table.put 'rowkey1', 'cf1:col1' => "Hello", 'cf2:col2' => "World" # Putting multiple rows table.put 'rowkey1' => { 'cf1:col1' => "Hello", 'cf2:col2' => "World" }, 'rowkey2' => { 'cf1:col1' => "Howdy", 'cf2:col2' => "World" }, 'rowkey3' => { 'cf1:col1' => "So long", 'cf2:col2' => "World" } # Putting values with timestamps table.put 'rowkey1' => { 'cf1:col1' => { 1353143856665 => "Hello", 1352978648642 => "Goodbye" }, 'cf2:col2' => "World" } ``` ### GET HBase stores everything as a byte array, so when you fetch data from HBase, you need to explicitly specify the type of each value stored. ```ruby row = table.get('rowkey1') # Rowkey rowk = row.rowkey # Column value as a raw Java byte array col0 = row.raw 'cf1:col0' # Decode column values col1 = row.string 'cf1:col1' col2 = row.fixnum 'cf1:col2' col3 = row.bigdecimal 'cf1:col3' col4 = row.float 'cf1:col4' col5 = row.boolean 'cf1:col5' col6 = row.symbol 'cf1:col6' # Decode multiple columns at once row.string ['cf1:str1', 'cf1:str2'] # [ "Hello", "World" ] ``` #### Batch GET ```ruby # Pass an array of row keys as the parameter rows = table.get(['rowkey1', 'rowkey2', 'rowkey3']) ``` #### Decode all versions with plural-form (-s) methods ```ruby # Decode all versions as Hash indexed by their timestamps row.strings 'cf1:str' # {1353143856665=>"Hello", 1353143856662=>"Goodbye"} # Decode all versions of multiple columns row.strings ['cf1:str1', 'cf1:str2'] # [ # {1353143856665=>"Hello", 1353143856662=>"Goodbye"}, # {1353143856665=>"World", 1353143856662=>"Cruel world"} # ] # Plural-form methods are provided for any other data types as well cols0 = row.raws 'cf1:col0' cols1 = row.strings 'cf1:col1' cols2 = row.fixnums 'cf1:col2' cols3 = row.bigdecimals 'cf1:col3' cols4 = row.floats 'cf1:col4' cols5 = row.booleans 'cf1:col5' cols6 = row.symbols 'cf1:col6' ``` #### Intra-row scan Intra-row scan can be done with `each` method which yields `HBase::Cell` instances. ```ruby # Intra-row scan (all versions) row.each do |cell| family = cell.family qualifier = cell.qualifier(:string) # Column qualifier as String timestamp = cell.timestamp # Cell value as Java byte array bytes = cell.bytes # Typed access # value_as_string = cell.string # value_as_fixnum = cell.fixnum # ... end ``` #### `to_hash` ```ruby # Returns the Hash representation of the record with the specified schema schema = { 'cf1:col1' => :string, 'cf1:col2' => :fixnum, 'cf1:col3' => :bigdecimal, 'cf1:col4' => :float, 'cf1:col5' => :boolean, 'cf1:col6' => :symbol } table.get('rowkey1').to_hash(schema) # Returns all versions for each column indexed by their timestamps table.get('rowkey1').to_hash_with_versions(schema) ``` ### DELETE ```ruby # Deletes a row table.delete('rowkey1') # Deletes all columns in the specified column family table.delete('rowkey1', 'cf1') # Deletes a column table.delete('rowkey1', 'cf1:col1') # Deletes a column with empty qualifier. # (!= deleing the entire columns in the family. See the trailing colon.) table.delete('rowkey1', 'cf1:') # Deletes a version of a column table.delete('rowkey1', 'cf1:col1', 1352978648642) # Deletes multiple versions of a column table.delete('rowkey1', 'cf1:col1', 1352978648642, 1352978649642) # Batch delete table.delete(['rowkey1'], ['rowkey2'], ['rowkey3', 'cf1:col1']) # Truncate table table.truncate! ``` ### Atomic increment of column values ```ruby # Atomically increase cf1:counter by one table.increment('rowkey1', 'cf1:counter', 1) # Atomically increase two columns by one and two respectively table.increment('rowkey1', 'cf1:counter' => 1, 'cf1:counter2' => 2) ``` ### SCAN `HBase::Table` itself is an enumerable object. ```ruby # Full scan table.each do |row| # ... end ``` ## Scoped access SCAN and GET operations are actually implemented in enumerable `HBase::Scoped` class, whose instance is created by `HBase::Table#each` call. ```ruby scoped = table.each scoped.get(1) scoped.to_a ``` An `HBase::Scoped` object provides a set of methods for controlling data retrieval such as `range`, `filter`, `project`, `versions`, `caching`, et cetera. However, it doesn't respond to data manipulation methods (`put`, `delete` and `increment`), and methods for table administration. An `HBase::Table` object also responds to the data retrieval methods described above, but those calls are simply forwarded to a new `HBase::Scoped` object implicitly created. For example, `table.range(start, end)` is just a shorthand notation for `table.each.range(start, end)`. ### Chaining methods Methods of `HBase::Scoped` can be chained as follows. ```ruby # Chaining methods import org.apache.hadoop.hbase.filter.RandomRowFilter table.range('A'..'Z'). # Row key range, project('cf1:a'). # Select cf1:a column project('cf2'). # Select cf2 family as well filter('cf1:a' => 'Hello'). # Filter by cf1:a value filter('cf2:d' => 100..200). # Range filter on cf2:d filter('cf2:e' => [10, 20..30]). # Set-inclusion condition on cf2:e filter(RandomRowFilter.new(0.5)). # Any Java HBase filter while('cf2:f' => { ne: 'OPEN' }). # Early termination of scan limit(10). # Limits the size of the result set versions(2). # Only fetches 2 versions for each value batch(100). # Batch size for scan set to 100 caching(100). # Caching 100 rows to_a # To Array ``` ### *range* `HBase::Scoped#range` method is used to filter rows based on their row keys. ```ruby # 100 ~ 900 (inclusive end) table.range(100..900) # 100 ~ 900 (exclusive end) table.range(100...900) # 100 ~ 900 (exclusive end) table.range(100, 900) # 100 ~ table.range(100) # ~ 900 (exclusive end) table.range(nil, 900) ``` Optionally, prefix filter can be applied as follows. ```ruby # Prefix filter # Row keys with "APPLE" prefix # Start key is automatically set to "APPLE", # stop key "APPLF" to avoid unnecessary disk access table.range(:prefix => 'APPLE') # Row keys with "ACE", "BLUE" or "APPLE" prefix # Start key is automatically set to "ACE", # stop key "BLUF" table.range(:prefix => ['ACE', 'BLUE', 'APPLE']) # Prefix filter with start key and stop key. table.range('ACE', 'BLUEMARINE', :prefix => ['ACE', 'BLUE', 'APPLE']) ``` Subsequent calls to `#range` override the range previously defined. ```ruby # Previous ranges are discarded scope.range(1, 100). range(50..100). range(:prefix => 'A'). range(1, 1000) # Same as `scope.range(1, 1000)` ``` ### *filter* You can configure server-side filtering of rows and columns with `HBase::Scoped#filter` calls. Multiple calls have conjunctive effects. ```ruby # Range scanning the table with filters table.range(nil, 1000). filter( # Numbers and characters: Checks if the value is equal to the given value 'cf1:a' => 'Hello', 'cf1:b' => 1024, # Range of numbers or characters: Checks if the value falls within the range 'cf1:c' => 100..200, 'cf1:d' => 'A'..'C', # Regular expression: Checks if the value matches the regular expression 'cf1:e' => /world$/i, # Hash: Tests the value with 6 types of operators (:gt, :lt, :gte, :lte, :eq, :ne) 'cf1:f' => { gt: 1000, lte: 2000 }, 'cf1:g' => { ne: 1000 }, # Array of the aforementioned types: OR condition (disjunctive) 'cf1:h' => %w[A B C], 'cf1:i' => ['A'...'B', 'C', /^D/, { lt: 'F' }]). # Multiple calls for conjunctive filtering filter('cf1:j' => ['Alice'..'Bob', 'Cat']). # Any number of Java filters can be applied filter(org.apache.hadoop.hbase.filter.RandomRowFilter.new(0.5)). each do |record| # ... end ``` ### *while* `HBase::Scoped#while` method takes the same parameters as `filter` method, the difference is that each filtering condition passed to `while` method is wrapped by `WhileMatchFilter`, which aborts scan immediately when the condition is not met at a certain row. See the following example. ```ruby (0...30).each do |idx| table.put idx, 'cf1:a' => idx % 10 end table.filter('cf1:a' => { lte: 1 }).to_a # 0, 1, 10, 11, 20, 21 table.while('cf1:a' => { lte: 1 }).to_a # 0, 1 # Scan terminates immediately when condition not met. ``` ### *project* `HBase::Scoped#project` allows you to fetch only a subset of columns from each row. Multiple calls have additive effects. ```ruby # Fetches cf1:a and all columns in column family cf2 and cf3 scoped.project('cf1:a', 'cf2'). project('cf3') ``` HBase filters can not only filter rows but also columns. Since column filtering can be thought of as a kind of projection, it makes sense to internally apply column filters in `HBase::Scoped#project`, instead of in `HBase::Scoped#filter`, although it's still perfectly valid to pass column filter to filter method. ```ruby # Column prefix filter: # Fetch columns whose qualifiers start with the specified prefixes scoped.project(:prefix => 'alice'). project(:prefix => %w[alice bob]) # Column range filter: # Fetch columns whose qualifiers within the ranges scoped.project(:range => 'a'...'c'). project(:range => ['i'...'k', 'x'...'z']) # Column pagination filter: # Fetch columns within the specified intra-scan offset and limit scoped.project(:offset => 1000, :limit => 10) ``` When using column filters on *fat* rows with many columns, it's advised that you limit the batch size with `HBase::Scoped#batch` call to avoid fetching all columns at once. However setting batch size allows multiple rows with the same row key are returned during scan. ```ruby # Let's say that we have rows with more than 10 columns whose qualifiers start with `str` puts scoped.range(1..100). project(:prefix => 'str'). batch(10). map { |row| [row.rowkey(:fixnum), row.count].map(&:to_s).join ': ' } # 1: 10 # 1: 10 # 1: 5 # 2: 10 # 2: 2 # 3: 10 # ... ``` ### Scoped SCAN / GET ```ruby scoped = table.versions(1). # Limits the number of versions filter('cf1:a' => 'Hello', # With filters 'cf1:b' => 100...200, 'cf1:c' => 'Alice'..'Bob'). range('rowkey0'..'rowkey2') # Range of rowkeys. project('cf1', 'cf2:x') # Projection # Scoped GET # Nonexistent or filtered rows are returned as nils scoped.get(['rowkey1', 'rowkey2', 'rowkey4']) # Scoped SCAN scoped.each do |row| row.each do |cell| # Intra-row scan end end # Scoped COUNT # When counting the number of rows, use `HTable::Scoped#count` # instead of just iterating through the scope, as it internally # minimizes amount of data fetched with KeyOnlyFilter scoped.count # This should be even faster as it dramatically reduces the number of RPC calls scoped.caching(1000).count ``` ## Basic aggregation using coprocessor You can perform some basic aggregation using the built-in coprocessor called `org.apache.hadoop.hbase.coprocessor.AggregateImplementation`. To enable this feature, call `enable_aggregation!` method, which adds the coprocessor to the table. ```ruby table.enable_aggregation! # Just a shorthand notation for # table.add_coprocessor! 'org.apache.hadoop.hbase.coprocessor.AggregateImplementation' ``` Then you can get the sum, average, minimum, maximum, row count, and standard deviation of the projected columns. ```ruby # cf1:a must hold 8-byte integer values table.project('cf1:a').aggregate(:sum) table.project('cf1:a').aggregate(:avg) table.project('cf1:a').aggregate(:min) table.project('cf1:a').aggregate(:max) table.project('cf1:a').aggregate(:std) table.project('cf1:a').aggregate(:row_count) # Aggregation of multiple columns table.project('cf1:a', 'cf1:b').aggregate(:sum) ``` By default, aggregate method assumes that the projected values are 8-byte integers. For other data types, you can pass your own ColumnInterpreter. ```ruby table.project('cf1:b').aggregate(:sum, MyColumnInterpreter.new) ``` ## Advanced topics ### Lexicographic scan order HBase stores rows in the lexicographic order of the rowkeys in their byte array representations. Thus the type of row key affects the scan order. ```ruby (1..15).times do |i| table.put i, data table.put i.to_s, data end table.range(1..3).map { |r| r.rowkey :fixnum } # [1, 2, 3] table.range('1'..'3').map { |r| r.rowkey :string } # %w[1 10 11 12 13 14 15 2 3] ``` ### Non-string column qualifier If a column qualifier is not a String, *an HBase::ColumnKey instance* should be used instead of a conventional `FAMILY:QUALIFIER` String. ```ruby table.put 'rowkey', 'cf1:col1' => 'Hello world', HBase::ColumnKey(:cf1, 100) => "Byte representation of an 8-byte integer", HBase::ColumnKey(:cf1, bytes) => "Qualifier is an arbitrary byte array" table.get('rowkey').string('cf1:col1') table.get('rowkey').string(HBase::ColumnKey(:cf1, 100)) # ... ``` ### Shorter integers A Ruby Fixnum is an 8-byte integer, which is equivalent `long` type in Java. When you want to use shorter integer types such as int, short, or byte, you can then use the special Hash representation of integers. ```ruby # 4-byte int value as the rowkey table.put({ int: 12345 }, 'cf1:a' => { byte: 100 }, # 1-byte integer 'cf1:b' => { short: 200 }, # 2-byte integer 'cf1:c' => { int: 300 }, # 4-byte integer 'cf1:4' => 400) # Ordinary 8-byte integer result = table.get(int: 12345) result.byte('cf1:a') # 100 result.short('cf1:b') # 200 result.int('cf1:c') # 300 # ... ``` ### Working with byte arrays In HBase, virtually everything is stored as a byte array. Although *hbase-jruby* tries hard to hide the fact, at some point you may need to get your hands dirty with native Java byte arrays. For example, it's [a common practice] [1] to use a composite row key, which is a concatenation of several components of different types. [1]: http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ `HBase::ByteArray` is a boxed class for native Java byte arrays, which makes byte array manipulation much easier. A ByteArray can be created as a concatenation of any number of objects. ```ruby ba = HBase::ByteArray(100, 3.14, {int: 300}, "Hello World") ``` Then you can slice it and decode each part, ```ruby # Slicing first = ba[0, 8] second = ba[8...16] first.decode(:fixnum) # 100 second.decode(:float) # 3.14 ``` append, prepend more elements to it, ```ruby ba.unshift 200, true ba << { short: 300 } ``` concatenate another ByteArray, ```ruby ba += HBase::ByteArray(1024) ``` or shift decoded objects from it. ```ruby ba.shift(:fixnum) ba.shift(:boolean) ba.shift(:fixnum) ba.shift(:float) ba.shift(:int) ba.shift(:string, 11) # Byte length must be given as Strings are not fixed in size ``` `ByteArray#java` method returns the underlying native Java byte array. ```ruby ba.java # Returns the native Java byte array (byte[]) ``` ### Table administration `HBase#Table` provides a few *synchronous* table administration methods. ```ruby # Create a table with configurable table-level properties table.create!( # 1st Hash: Column family specification { :cf1 => { :compression => :snappy }, :cf2 => {} }, # 2nd Hash: Table properties :max_filesize => 256 * 1024 ** 2, :deferred_log_flush => false) # Alter table properties table.alter!( :max_filesize => 512 * 1024 ** 2, :memstore_flushsize => 64 * 1024 ** 2, :readonly => false, :deferred_log_flush => true ) # Add column family table.add_family! :cf3, :compression => :snappy, :bloomfilter => :row # Alter column family table.alter_family! :cf2, :bloomfilter => :rowcol # Remove column family table.delete_family! :cf1 # Add Coprocessor unless table.has_coprocessor?(cp_class_name1) table.add_coprocessor! cp_class_name1 end table.add_coprocessor! cp_class_name2, :path => path, :priority => priority, :params => params # Remove coprocessor table.remove_coprocessor! cp_class_name1 ``` You can perform other types of administrative tasks with native Java [HBaseAdmin object](http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html), which can be obtained by `HBase#admin` method. Optionally, a block can be given so that the HBaseAdmin object is automatically closed at the end of the given block. ```ruby # Advanced table administration with HBaseAdmin object # http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html admin = hbase.admin # ... admin.close # With the block hbase.admin do |admin| # ... end ``` ## Test ```bash # Bash script export HBASE_JRUBY_TEST_ZK='your-hbaase.domain.net' # Test both for 1.8 and 1.9 for v in --1.8 --1.9; do export JRUBY_OPTS=$v rake test done ``` ## Contributing 1. Fork it 2. Create your feature branch (`git checkout -b my-new-feature`) 3. Commit your changes (`git commit -am 'Add some feature'`) 4. Push to the branch (`git push origin my-new-feature`) 5. Create new Pull Request