Sha256: 37360537420fa8765353ff7db36a9437badcf26631e50e93b709016c4377387d
Contents?: true
Size: 1.09 KB
Versions: 3
Compression:
Stored size: 1.09 KB
Contents
class OCRSDK::PDF < OCRSDK::Image # We're on a shaky ground regarding what kind of pdfs # should be recognized and what shouldn't. # Currently we count that if there are # images * 20 > length of text # then this document might need recognition. # # Assumption is that there might be a title, # page numbers or credits along with images. # # In case of title page we also skip the first page # which should not affect documents which will not # need to be recognized # def recognizeable? reader = PDF::Reader.new @image_path images = 0 text = 0 chars = Set.new start = reader.pages.length > 1 ? 1 : 0 reader.pages[start..-1].each do |page| text += page.text.length chars += page.text.split('').map(&:ord).uniq images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image) end # count number of distinct characters # in case of "searchable", but incorrectly recognized document images * 20 > text || chars.length < 10 rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError false end end
Version data entries
3 entries across 3 versions & 1 rubygems
Version | Path |
---|---|
ocrsdk-0.3.4 | lib/ocrsdk/pdf.rb |
ocrsdk-0.3.3 | lib/ocrsdk/pdf.rb |
ocrsdk-0.3.2 | lib/ocrsdk/pdf.rb |