Sha256: a98f9c25f244a72d819bb610bbbaf51c60591f75425c902596f4c27e101a6788

Contents?: true

Size: 838 Bytes

Versions: 1

Compression:

Stored size: 838 Bytes

Contents

require 'nokogiri'
require 'tmpdir'

class RTesseract
  module Box
    def self.temp_dir
      @file_path = Pathname.new(Dir.tmpdir)
    end

    def self.run(source, options)
      name = "rtesseract_#{SecureRandom.uuid}"
      options.tessedit_create_hocr = 1

      RTesseract::Command.new(source, temp_dir.join(name).to_s, options).run

      parse(temp_dir.join("#{name}.hocr").read)
    end

    def self.parse(content)
      html = Nokogiri::HTML(content)
      html.css('span.ocrx_word, span.ocr_word').map do |word|
        @attributes = word.attributes['title'].value.to_s.gsub(';', '').split(' ')

        {
          word: word.text,
          x_start: @attributes[1].to_i,
          y_start: @attributes[2].to_i,
          x_end: @attributes[3].to_i,
          y_end: @attributes[4].to_i
        }
      end
    end
  end
end

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
rtesseract-3.0.0 lib/rtesseract/box.rb