Sha256: 9177a9ea4599c8a989ce3cf91b4888c292beeb9645e71e7cfd071f879afcf2e8

Contents?: true

Size: 917 Bytes

Versions: 6

Compression:

Stored size: 917 Bytes

Contents

class OCRSDK::PDF < OCRSDK::Image
  # We're on a shaky ground regarding what kind of pdfs
  # should be recognized and what shouldn't.
  # Currently we count that if there are
  #   images * 20 > length of text
  # then this document might need recognition.
  # Assumption is that there might be a title,
  # page numbers or credits along with images.
  def recognizeable?
    reader = PDF::Reader.new @image_path

    images = 0
    text   = 0
    chars  = Set.new
    reader.pages.each do |page|
      text   += page.text.length
      chars  += page.text.split('').map(&:ord).uniq
      images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image)
    end

    # count number of distinct characters
    # in case of "searchable", but incorrectly recognized document
    images * 20 > text || chars.length < 10
  rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError
    false
  end
end

Version data entries

6 entries across 6 versions & 1 rubygems

Version Path
ocrsdk-0.3.1 lib/ocrsdk/pdf.rb
ocrsdk-0.3.0 lib/ocrsdk/pdf.rb
ocrsdk-0.2.0 lib/ocrsdk/pdf.rb
ocrsdk-0.1.2 lib/ocrsdk/pdf.rb
ocrsdk-0.1.1 lib/ocrsdk/pdf.rb
ocrsdk-0.1.0 lib/ocrsdk/pdf.rb