Parent

Files

Ankusa::CassandraStorage

Attributes

cassandra[R]

Public Class Methods

new(host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100) click to toggle source

Necessary to set max classes since current implementation of ruby cassandra client doesn’t support table scans. Using crufty get_range method at the moment.

# File lib/ankusa/cassandra_storage.rb, line 21
def initialize(host='127.0.0.1', port=9160, keyspace = 'ankusa', max_classes = 100)
  @cassandra  = Cassandra.new('system', "#{host}:#{port}")
  @klass_word_counts, @klass_doc_counts = {}
  @keyspace    = keyspace
  @max_classes = max_classes
  init_tables
end

Public Instance Methods

classnames() click to toggle source

Fetch the names of the distinct classes for classification: eg. :spam, :good, etc

# File lib/ankusa/cassandra_storage.rb, line 33
def classnames
  @cassandra.get_range(:totals, {:start => '', :finish => '', :count => @max_classes}).inject([]) do |cs, key_slice|
    cs << key_slice.key.to_sym
  end
end
close() click to toggle source

Doesn’t do anything

# File lib/ankusa/cassandra_storage.rb, line 172
def close
end
doc_count_totals() click to toggle source
# File lib/ankusa/cassandra_storage.rb, line 165
def doc_count_totals
  get_summary "doc_count"
end
drop_tables() click to toggle source

Drop ankusa keyspace, reset internal caches

FIXME: truncate doesn’t work with cassandra-beta2

# File lib/ankusa/cassandra_storage.rb, line 49
def drop_tables
  @cassandra.truncate!('classes')
  @cassandra.truncate!('totals')
  @cassandra.drop_keyspace(@keyspace)
  @klass_word_counts, @klass_doc_counts = {}
end
get_doc_count(klass) click to toggle source

Fetch total documents for a given class and cache it

# File lib/ankusa/cassandra_storage.rb, line 107
def get_doc_count(klass)
  @klass_doc_counts[klass] = @cassandra.get(:totals, klass.to_s, "doc_count").values.last.to_f
end
get_total_word_count(klass) click to toggle source

Fetch total word count for a given class and cache it

# File lib/ankusa/cassandra_storage.rb, line 100
def get_total_word_count(klass)
  @klass_word_counts[klass] = @cassandra.get(:totals, klass.to_s, "wordcount").values.last.to_f
end
get_vocabulary_sizes() click to toggle source

Does a table ‘scan’ of summary table pulling out the ‘vocabsize’ column from each row. Generates a hash of (class, vocab_size) key value pairs

# File lib/ankusa/cassandra_storage.rb, line 93
def get_vocabulary_sizes
  get_summary "vocabsize"
end
get_word_counts(word) click to toggle source

Fetch hash of word counts as a single row from cassandra. Here column_name is the class and column value is the count

# File lib/ankusa/cassandra_storage.rb, line 82
def get_word_counts(word)
  # fetch all (class,count) pairs for a given word
  row = @cassandra.get(:classes, word.to_s)
  return row.to_hash if row.empty?
  row.inject({}){|counts, col| counts[col.first.to_sym] = [col.last.to_f,0].max; counts}
end
incr_doc_count(klass, count) click to toggle source

Increment total document count for a given class by ‘count’

# File lib/ankusa/cassandra_storage.rb, line 157
def incr_doc_count(klass, count)
  klass = klass.to_s
  doc_count = @cassandra.get(:totals, klass, "doc_count").values.last.to_i
  doc_count += count
  @cassandra.insert(:totals, klass, {"doc_count" => doc_count.to_s})
  @klass_doc_counts[klass.to_sym] = doc_count
end
incr_total_word_count(klass, count) click to toggle source

Increment total word count for a given class by ‘count’

# File lib/ankusa/cassandra_storage.rb, line 146
def incr_total_word_count(klass, count)
  klass = klass.to_s
  wordcount = @cassandra.get(:totals, klass, "wordcount").values.last.to_i
  wordcount += count
  @cassandra.insert(:totals, klass, {"wordcount" => wordcount.to_s})
  @klass_word_counts[klass.to_sym] = wordcount
end
incr_word_count(klass, word, count) click to toggle source

Increment the count for a given (word,class) pair. Evidently, cassandra does not support atomic increment/decrement. Psh. HBase uses ZooKeeper to implement atomic operations, ain’t it special?

# File lib/ankusa/cassandra_storage.rb, line 116
def incr_word_count(klass, word, count)
  # Only wants strings
  klass = klass.to_s
  word  = word.to_s

  prior_count = @cassandra.get(:classes, word, klass).values.last.to_i
  new_count   = prior_count + count
  @cassandra.insert(:classes, word, {klass => new_count.to_s})

  if (prior_count == 0 && count > 0)
    #
    # we've never seen this word before and we're not trying to unlearn it
    #
    vocab_size = @cassandra.get(:totals, klass, "vocabsize").values.last.to_i
    vocab_size += 1
    @cassandra.insert(:totals, klass, {"vocabsize" => vocab_size.to_s})
  elsif new_count == 0
    #
    # we've seen this word before but we're trying to unlearn it
    #
    vocab_size = @cassandra.get(:totals, klass, "vocabsize").values.last.to_i
    vocab_size -= 1
    @cassandra.insert(:totals, klass, {"vocabsize" => vocab_size.to_s})
  end
  new_count
end
init_tables() click to toggle source

Create required keyspace and column families

# File lib/ankusa/cassandra_storage.rb, line 60
def init_tables
  # Do nothing if keyspace already exists
  if @cassandra.keyspaces.include?(@keyspace)
    @cassandra.keyspace = @keyspace
  else
    freq_table    = Cassandra::ColumnFamily.new({:keyspace => @keyspace, :name => "classes"}) # word  => {classname => count}
    summary_table = Cassandra::ColumnFamily.new({:keyspace => @keyspace, :name => "totals"})  # class => {wordcount => count}
    ks_def = Cassandra::Keyspace.new({
        :name               => @keyspace,
        :strategy_class     => 'org.apache.cassandra.locator.SimpleStrategy',
        :replication_factor => 1,
        :cf_defs            => [freq_table, summary_table] 
      })
    @cassandra.add_keyspace ks_def
    @cassandra.keyspace = @keyspace
  end
end
reset() click to toggle source
# File lib/ankusa/cassandra_storage.rb, line 39
def reset
  drop_tables
  init_tables
end

Protected Instance Methods

get_summary(name) click to toggle source

Fetch 100 rows from summary table, yes, increase if necessary

# File lib/ankusa/cassandra_storage.rb, line 180
def get_summary(name)
  counts = {}
  @cassandra.get_range(:totals, {:start => '', :finish => '', :count => @max_classes}).each do |key_slice|
    # keyslice is a clunky thrift object, map into a ruby hash
    row = key_slice.columns.inject({}){|hsh, c| hsh[c.column.name] = c.column.value; hsh}
    counts[key_slice.key.to_sym] = row[name].to_f
  end
  counts
end

[Validate]

Generated with the Darkfish Rdoc Generator 2.