Sha256: 7b4a9d362026b56c7cacda6d3906866d883d2bfc7e748386f984ad9fb3618683
Contents?: true
Size: 1.27 KB
Versions: 3
Compression:
Stored size: 1.27 KB
Contents
#!/usr/bin/env ruby # coding: utf-8 # A sample script that attempts to extract bates numbers from a PDF file. # Bates numbers are often used to markup documents being used in legal # cases. For more info, see http://en.wikipedia.org/wiki/Bates_numbering # # Acrobat 9 introduced a markup syntax that directly specifies the bates # number for each page. For earlier versions, the easiest way to find # the number is to look for words that match a pattern. # # This example attempts to extract numbers using the Acrobat 9 syntax. # As a fall back, you can use a regular expression to look for words # that match the numbers you expect in the page content. require 'rubygems' require 'pdf/reader' class BatesReceiver attr_reader :numbers def initialize @numbers = [] end def begin_marked_content(*args) return unless args.size >= 2 return unless args.first == :Artifact return unless args[1][:Subtype] == :BatesN @numbers << args[1][:Contents] end alias :begin_marked_content_with_pl :begin_marked_content end PDF::Reader.open("bates.pdf") do |reader| reader.pages.each do |page| receiver = BatesReceiver.new page.walk(receiver) if receiver.numbers.empty? puts page.scan(/CC.+/) else puts receiver.numbers.inspect end end end
Version data entries
3 entries across 3 versions & 2 rubygems
Version | Path |
---|---|
fireinc-pdf-reader-0.11.0 | examples/extract_bates.rb |
fireinc-pdf-reader-0.11.0.alpha | examples/extract_bates.rb |
pdf-reader-0.11.0.alpha | examples/extract_bates.rb |