Sha256: 9177a9ea4599c8a989ce3cf91b4888c292beeb9645e71e7cfd071f879afcf2e8
Contents?: true
Size: 917 Bytes
Versions: 6
Compression:
Stored size: 917 Bytes
Contents
class OCRSDK::PDF < OCRSDK::Image # We're on a shaky ground regarding what kind of pdfs # should be recognized and what shouldn't. # Currently we count that if there are # images * 20 > length of text # then this document might need recognition. # Assumption is that there might be a title, # page numbers or credits along with images. def recognizeable? reader = PDF::Reader.new @image_path images = 0 text = 0 chars = Set.new reader.pages.each do |page| text += page.text.length chars += page.text.split('').map(&:ord).uniq images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image) end # count number of distinct characters # in case of "searchable", but incorrectly recognized document images * 20 > text || chars.length < 10 rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError false end end
Version data entries
6 entries across 6 versions & 1 rubygems
Version | Path |
---|---|
ocrsdk-0.3.1 | lib/ocrsdk/pdf.rb |
ocrsdk-0.3.0 | lib/ocrsdk/pdf.rb |
ocrsdk-0.2.0 | lib/ocrsdk/pdf.rb |
ocrsdk-0.1.2 | lib/ocrsdk/pdf.rb |
ocrsdk-0.1.1 | lib/ocrsdk/pdf.rb |
ocrsdk-0.1.0 | lib/ocrsdk/pdf.rb |