The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe. It provides programmatic access to the contents of a PDF file with a high degree of flexibility. The PDF 1.7 specification is a weighty document and not all aspects are currently supported. We welcome submission of PDF files that exhibit unsupported aspects of the spec to assist with improving out support. = Installation The recommended installation method is via Rubygems. gem install pdf-reader = Usage PDF::Reader is designed with a callback-style architecture. The basic concept is to build a receiver class and pass that into PDF::Reader along with the PDF to process. As PDF::Reader walks the file and encounters various objects (pages, text, images, shapes, etc) it will call methods on the receiver class. What those methods do is entirely up to you - save the text, extract images, count pages, read metadata, whatever. For a full list of the supported callback methods and a description of when they will be called, refer to PDF::Reader::Content. See the code examples below for a way to print a list of all the callbacks generated by a file to STDOUT. = Exceptions There are two key exceptions that you will need to watch out for when processing a PDF file: MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the file should be valid, or that a corrupt file didn't raise an exception, please forward a copy of the file to the maintainers and we can attempt improve the code. UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently support. Again, we welcome submissions of PDF files that exhibit these features to help us with future code improvements. = Maintainers - Peter Jones - James Healy = Examples The easiest way to explain how this works in practice is to show some examples. == Page Counter A simple app to count the number of pages in a PDF File. require 'rubygems' require 'pdf/reader' class PageReceiver attr_accessor :page_count def initialize @page_count = 0 end # Called when page parsing ends def end_page @page_count += 1 end end receiver = PageReceiver.new pdf = PDF::Reader.file("somefile.pdf", receiver) puts "#{receiver.page_count} pages" == List all callbacks generated by a single PDF WARNING: this will generate a *lot* of output, so you probably want to pipe it through less or to a text file. require 'rubygems' require 'pdf/reader' receiver = PDF::Reader::RegisterReceiver.new pdf = PDF::Reader.file("somefile.pdf", receiver) receiver.callbacks.each do |cb| puts cb end == Basic RSpec of a generated PDF require 'rubygems' require 'pdf/reader' require 'pdf/writer' require 'spec' class PageTextReceiver attr_accessor :content def initialize @content = [] end # Called when page parsing starts def begin_page(arg = nil) @content << "" end def show_text(string, *params) @content.last << string.strip end # there's a few text callbacks, so make sure we process them all alias :super_show_text :show_text alias :move_to_next_line_and_show_text :show_text alias :set_spacing_next_line_show_text :show_text end context "My generated PDF" do specify "should have the correct text on 2 pages" do # generate our PDF pdf = PDF::Writer.new pdf.text "Chunky", :font_size => 32, :justification => :center pdf.start_new_page pdf.text "Bacon", :font_size => 32, :justification => :center pdf.save_as("chunkybacon.pdf") # process the PDF receiver = PageTextReceiver.new PDF::Reader.file("chunkybacon.pdf", receiver) # confirm the text appears on the correct pages receiver.content.size.should eql(2) receiver.content[0].should eql("Chunky") receiver.content[1].should eql("Bacon") end end == Extract ISBNs Parse all text in the requested PDF file and print out any valid book ISBNs. Requires the rbook-isbn gem. require 'rubygems' require 'pdf/reader' require 'rbook/isbn' class ISBNReceiver # there's a few text callbacks, so make sure we process them all def show_text(string, *params) process_words(string.split(/\W+/)) end def super_show_text(string, *params) process_words(string.split(/\W+/)) end def move_to_next_line_and_show_text (string) process_words(string.split(/\W+/)) end def set_spacing_next_line_show_text (aw, ac, string) process_words(string.split(/\W+/)) end private # check if any items in the supplied array are a valid ISBN, and print any # that are to console def process_words(words) words.each do |word| word.strip! puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word) end end end receiver = ISBNReceiver.new PDF::Reader.file("somefile.pdf", receiver) = Resources - PDF::Reader Homepage: http://software.pmade.com/pdfreader - PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/ - PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html - PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html