README.md in henkei-1.28.5.2 vs README.md in henkei-2.2.0.1

- old
+ new

@@ -1,6 +1,6 @@ -[![Github Build Status](https://github.com/abrom/henkei/actions/workflows/test.yml/badge.svg)](https://github.com/abrom/henkei/actions/workflows/test.yml) +[![Travis Build Status](http://img.shields.io/travis/abrom/henkei.svg?style=flat)](https://travis-ci.org/abrom/henkei) [![Maintainability](https://api.codeclimate.com/v1/badges/d06e8c917cf7d8c07234/maintainability)](https://codeclimate.com/github/abrom/henkei/maintainability) [![Test Coverage](https://api.codeclimate.com/v1/badges/d06e8c917cf7d8c07234/test_coverage)](https://codeclimate.com/github/abrom/henkei/test_coverage) [![Gem Version](http://img.shields.io/gem/v/henkei.svg?style=flat)](#) # Henkei 変形 @@ -19,10 +19,19 @@ - Portable Document Format (.pdf) For the complete list of supported formats, please visit the Apache Tika [Supported Document Formats](http://tika.apache.org/0.9/formats.html) page. +## Upgrading from v1.x to v2.x + +Apache Tika v2.x brings with it some changes. One key change is that the Tika client and server applications have +been split up. To keep the gem size down Henkei will only include the client app. That is to say, each time you +call to Henkei, a new Java process will be started, run your command, then terminate. + +Another change is the metadata keys. A lot of duplicate keys have been removed in favour of a more standards +based approach. A list of the old vs new key names can be found [here](https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0#MigratingtoTika2.0.0-Metadata) + ## Usage Text, metadata and MIME type information can be extracted by calling `Henkei.read` directly: ```ruby @@ -65,9 +74,23 @@ ```ruby post '/:name/:filename' do henkei = Henkei.new params[:data][:tempfile] henkei.text end +``` + +### Reading text from inside images (OCR) + +You can enable OCR by specifying the optional `include_ocr: true` when calling to the `text` or `html` instance methods, +as well as the `read` class method. Note that Tika does indicate this will greatly increase processing time. + +```ruby +henkei = Henkei.new 'sample.pages' +text_with_ocr = henkei.text(include_ocr: true) +html_with_ocr = henkei.html(include_ocr: true) + +data = File.read 'sample.pages' +text_with_ocr = Henkei.read :text, data, include_ocr: true ``` ### Reading metadata Metadata is returned as a hash.