README.md in henkei-1.28.5.2 vs README.md in henkei-2.2.0.1
- old
+ new
@@ -1,6 +1,6 @@
-[![Github Build Status](https://github.com/abrom/henkei/actions/workflows/test.yml/badge.svg)](https://github.com/abrom/henkei/actions/workflows/test.yml)
+[![Travis Build Status](http://img.shields.io/travis/abrom/henkei.svg?style=flat)](https://travis-ci.org/abrom/henkei)
[![Maintainability](https://api.codeclimate.com/v1/badges/d06e8c917cf7d8c07234/maintainability)](https://codeclimate.com/github/abrom/henkei/maintainability)
[![Test Coverage](https://api.codeclimate.com/v1/badges/d06e8c917cf7d8c07234/test_coverage)](https://codeclimate.com/github/abrom/henkei/test_coverage)
[![Gem Version](http://img.shields.io/gem/v/henkei.svg?style=flat)](#)
# Henkei 変形
@@ -19,10 +19,19 @@
- Portable Document Format (.pdf)
For the complete list of supported formats, please visit the Apache Tika
[Supported Document Formats](http://tika.apache.org/0.9/formats.html) page.
+## Upgrading from v1.x to v2.x
+
+Apache Tika v2.x brings with it some changes. One key change is that the Tika client and server applications have
+been split up. To keep the gem size down Henkei will only include the client app. That is to say, each time you
+call to Henkei, a new Java process will be started, run your command, then terminate.
+
+Another change is the metadata keys. A lot of duplicate keys have been removed in favour of a more standards
+based approach. A list of the old vs new key names can be found [here](https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0#MigratingtoTika2.0.0-Metadata)
+
## Usage
Text, metadata and MIME type information can be extracted by calling `Henkei.read` directly:
```ruby
@@ -65,9 +74,23 @@
```ruby
post '/:name/:filename' do
henkei = Henkei.new params[:data][:tempfile]
henkei.text
end
+```
+
+### Reading text from inside images (OCR)
+
+You can enable OCR by specifying the optional `include_ocr: true` when calling to the `text` or `html` instance methods,
+as well as the `read` class method. Note that Tika does indicate this will greatly increase processing time.
+
+```ruby
+henkei = Henkei.new 'sample.pages'
+text_with_ocr = henkei.text(include_ocr: true)
+html_with_ocr = henkei.html(include_ocr: true)
+
+data = File.read 'sample.pages'
+text_with_ocr = Henkei.read :text, data, include_ocr: true
```
### Reading metadata
Metadata is returned as a hash.