Sha256: b0d99ef96bcf1a423d520bc25e46a510b6ba98158e7bf761f3591fcc40a5f255

Contents?: true

Size: 1.99 KB

Versions: 3

Compression:

Stored size: 1.99 KB

Contents

#!/usr/bin/env ruby
require 'rubygems'
require 'wukong/script'

module WordCount
  class Mapper < Wukong::Streamer::LineStreamer
    #
    # Emit each word in each line.
    #
    def process line
      tokenize(line).each{|word| yield [word, 1] }
    end

    #
    # Split a string into its constituent words.
    #
    # This is pretty simpleminded:
    # * downcase the word
    # * Split at any non-alphanumeric boundary, including '_'
    # * However, preserve the special cases of 's, 'd or 't at the end of a
    #   word.
    #
    #   tokenize("Ability is a poor man's wealth #johnwoodenquote")
    #   # => ["ability", "is", "a", "poor", "man's", "wealth", "johnwoodenquote"]
    #
    def tokenize str
      return [] if str.blank?
      str = str.downcase;
      # kill off all punctuation except [stuff]'s or [stuff]'t
      # this includes hyphens (words are split)
      str = str.
        gsub(/[^a-zA-Z0-9\']+/, ' ').
        gsub(/(\w)\'([std])\b/, '\1!\2').gsub(/\'/, ' ').gsub(/!/, "'")
      # Busticate at whitespace
      words = str.split(/\s+/)
      words.reject!{|w| w.blank? }
      words
    end
  end

  #
  # A bit kinder to your memory manager: accumulate the sum record-by-record:
  #
  class Reducer2 < Wukong::Streamer::AccumulatingReducer
    
    def start!(*args)
      @key_count =  0
    end
    
    def accumulate(*args)
      @key_count += 1
    end
    
    def finalize
      yield [ key, @key_count ]
    end
  end

  #
  # You can stack up all the values in a list then sum them at once.
  #
  # This isn't good style, as it means the whole list is held in memory
  #
  class Reducer1 < Wukong::Streamer::ListReducer
    def finalize
      yield [ key, values.map(&:last).map(&:to_i).inject(0){|x,tot| x+tot } ]
    end
  end

  #
  # ... easiest of all, though: this is common enough that it's already included
  #
  require 'wukong/streamer/count_keys'
  class Reducer3 < Wukong::Streamer::CountKeys
  end
end

# Execute the script
Wukong.run(
  WordCount::Mapper,
  WordCount::Reducer2
  )

Version data entries

3 entries across 3 versions & 1 rubygems

Version Path
wukong-3.0.0.pre old/examples/simple_word_count.rb
wukong-2.0.2 examples/simple_word_count.rb
wukong-2.0.1 examples/simple_word_count.rb