Sha256: 163ca4ddb3f9b9e8707dfdc658c87d7a6b35b09874fee0f33ff5f79bf5cca93b

Contents?: true

Size: 748 Bytes

Versions: 5

Compression:

Stored size: 748 Bytes

Contents

module Lda
  class Document
    attr_reader :corpus, :words, :counts, :length, :total, :tokens

    def initialize(corpus)
      @corpus = corpus

      @words  = Array.new
      @counts = Array.new
      @tokens = Array.new
      @length = 0
      @total  = 0
    end

    #
    # Recompute the total and length values.
    #
    def recompute
      @total = @counts.inject(0) { |sum, i| sum + i }
      @length = @words.size
    end

    def has_text?
      false
    end

    def handle(tokens)
      tokens
    end

    def tokenize(text)
      clean_text = text.gsub(/[^A-Za-z'\s]+/, ' ').gsub(/\s+/, ' ')        # remove everything but letters and ' and leave only single spaces
      @tokens = handle(clean_text.split(' '))
    end
  end
end

Version data entries

5 entries across 5 versions & 2 rubygems

Version Path
ealdent-lda-ruby-0.3.0 lib/lda-ruby/document/document.rb
ealdent-lda-ruby-0.3.1 lib/lda-ruby/document/document.rb
lda-ruby-0.3.5 lib/lda-ruby/document/document.rb
lda-ruby-0.3.4 lib/lda-ruby/document/document.rb
lda-ruby-0.3.1 lib/lda-ruby/document/document.rb