Sha256: 07d59c6e974ed5d4805f5d2ff1ecb4c3f44368611c455990302791be032eced2

Contents?: true

Size: 1.32 KB

Versions: 1

Compression:

Stored size: 1.32 KB

Contents

= rTika

A JRuby wrapper around the excellent Apache Tika content extraction library.
Feed rTika your files and get extracted text and metadata in return.

== Usage
Make sure you're on JRuby first.

  require 'rubygems'
  require 'rtika'

  result = RTika::FileParser.parse("mywordfile.doc")
  puts result.content # prints out the document's contents
  puts result.title   # fetches title from the doc's metadata
  puts result.author  # fetches author from the doc's metadata

  result = RTika::StringParser.parse("<html>
		<head><title>MYTITLE</title></head>
		<body>this is my very ... long ... string</body></html>")
  puts result.content # returns <body> contents
  puts result.title   # returns <title> contents

Options
  
  :remove_boilerplate => true
    # uses the Boilerpipe library that ships with Tika to remove headers & footers

== Note on Patches/Pull Requests
 
* Fork the project.
* Make your feature addition or bug fix.
* Add tests for it. This is important so I don't break it in a
  future version unintentionally.
* Commit, do not mess with rakefile, version, or history.
  (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
* Send me a pull request. Bonus points for topic branches.

== Copyright

Copyright (c) 2010 Pradeep Elankumaran. See LICENSE for details.

Version data entries

1 entries across 1 versions & 1 rubygems

Version Path
rtika-0.3.0 README.rdoc