Sha256: b0d99ef96bcf1a423d520bc25e46a510b6ba98158e7bf761f3591fcc40a5f255
Contents?: true
Size: 1.99 KB
Versions: 3
Compression:
Stored size: 1.99 KB
Contents
#!/usr/bin/env ruby require 'rubygems' require 'wukong/script' module WordCount class Mapper < Wukong::Streamer::LineStreamer # # Emit each word in each line. # def process line tokenize(line).each{|word| yield [word, 1] } end # # Split a string into its constituent words. # # This is pretty simpleminded: # * downcase the word # * Split at any non-alphanumeric boundary, including '_' # * However, preserve the special cases of 's, 'd or 't at the end of a # word. # # tokenize("Ability is a poor man's wealth #johnwoodenquote") # # => ["ability", "is", "a", "poor", "man's", "wealth", "johnwoodenquote"] # def tokenize str return [] if str.blank? str = str.downcase; # kill off all punctuation except [stuff]'s or [stuff]'t # this includes hyphens (words are split) str = str. gsub(/[^a-zA-Z0-9\']+/, ' '). gsub(/(\w)\'([std])\b/, '\1!\2').gsub(/\'/, ' ').gsub(/!/, "'") # Busticate at whitespace words = str.split(/\s+/) words.reject!{|w| w.blank? } words end end # # A bit kinder to your memory manager: accumulate the sum record-by-record: # class Reducer2 < Wukong::Streamer::AccumulatingReducer def start!(*args) @key_count = 0 end def accumulate(*args) @key_count += 1 end def finalize yield [ key, @key_count ] end end # # You can stack up all the values in a list then sum them at once. # # This isn't good style, as it means the whole list is held in memory # class Reducer1 < Wukong::Streamer::ListReducer def finalize yield [ key, values.map(&:last).map(&:to_i).inject(0){|x,tot| x+tot } ] end end # # ... easiest of all, though: this is common enough that it's already included # require 'wukong/streamer/count_keys' class Reducer3 < Wukong::Streamer::CountKeys end end # Execute the script Wukong.run( WordCount::Mapper, WordCount::Reducer2 )
Version data entries
3 entries across 3 versions & 1 rubygems
Version | Path |
---|---|
wukong-3.0.0.pre | old/examples/simple_word_count.rb |
wukong-2.0.2 | examples/simple_word_count.rb |
wukong-2.0.1 | examples/simple_word_count.rb |