org.apache.cassandra.hadoop
Class ColumnFamilyInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
      extended by org.apache.cassandra.hadoop.ColumnFamilyInputFormat

public class ColumnFamilyInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>

Hadoop InputFormat allowing map/reduce against Cassandra rows within one ColumnFamily. At minimum, you need to set the CF and predicate (description of columns to extract from each row) in your Hadoop job Configuration. The ConfigHelper class is provided to make this simple: ConfigHelper.setColumnFamily ConfigHelper.setSlicePredicate You can also configure the number of rows per InputSplit with ConfigHelper.setInputSplitSize This should be "as big as possible, but no bigger." Each InputSplit is read from Cassandra with multiple get_slice_range queries, and the per-call overhead of get_slice_range is high, so larger split sizes are better -- but if it is too large, you will run out of memory. The default split size is 64k rows.


Constructor Summary
ColumnFamilyInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
          Create a record reader for a given split.
 java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Logically split the set of input files for the job.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ColumnFamilyInputFormat

public ColumnFamilyInputFormat()
Method Detail

getSplits

public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException
Description copied from class: org.apache.hadoop.mapreduce.InputFormat
Logically split the set of input files for the job.

Each InputSplit is then assigned to an individual Mapper for processing.

Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the RecordReader to read the InputSplit.

Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Parameters:
context - job configuration.
Returns:
an array of InputSplits for the job.
Throws:
java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit,
                                                                                                                                         org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)
                                                                                                                                  throws java.io.IOException,
                                                                                                                                         java.lang.InterruptedException
Description copied from class: org.apache.hadoop.mapreduce.InputFormat
Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<java.nio.ByteBuffer,java.util.SortedMap<java.nio.ByteBuffer,IColumn>>
Parameters:
inputSplit - the split to be read
taskAttemptContext - the information about the task
Returns:
a new record reader
Throws:
java.io.IOException
java.lang.InterruptedException


Copyright © 2010 The Apache Software Foundation