Sha256: 5a4271751ff03d0d2b43ee7c219eba60188eeef4b2a0d83f1945d93065ac955e

Contents?: true

Size: 784 Bytes

Versions: 2

Compression:

Stored size: 784 Bytes

Contents

require 'yaml'

module Lda
  class Document
    attr_reader :corpus, :words, :counts, :length, :total, :tokens

    def initialize(corpus)
      @corpus = corpus

      @words  = Array.new
      @counts = Array.new
      @tokens = Array.new
      @length = 0
      @total  = 0
    end

    #
    # Recompute the total and length values.
    #
    def recompute
      @total = @counts.inject(0) { |sum, i| sum + i }
      @length = @words.size
    end

    def has_text?
      false
    end

    def handle(tokens)
      tokens
    end

    def tokenize(text)
      clean_text = text.gsub(/[^A-Za-z'\s]+/, ' ').gsub(/\s+/, ' ').downcase        # remove everything but letters and ' and leave only single spaces
      @tokens = handle(clean_text.split(' '))
      nil
    end
  end
end

Version data entries

2 entries across 2 versions & 1 rubygems

Version Path
lda-ruby-0.3.7 lib/lda-ruby/document/document.rb
lda-ruby-0.3.6 lib/lda-ruby/document/document.rb