#  OCR-File
A tool to combine PDF tools, OCR tools and image processing into a
single interface as both a CLI and a library.

## Installation

Add this line to your application's Gemfile:

```ruby
gem 'ocr-file'
```

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install ocr-file

### Other required dependencies
You will need to install `tesseract` with your desired language on your system,
`pdftoppm` needs to be available and also `image-magick`.

## Usage
```ruby
  require 'ocr-file'

  config = {
    # Images from PDF
    filetype: 'png',
    quality: 100,
    dpi: 300,
    # Text to PDF
    font: 'Helvetica',
    font_size: 5, #8 # 12
    text_x: 20,
    text_y: 800,
    minimum_word: 5,
    # Cloud-Vision OCR
    image_annotator: nil, # Needed for Cloud-Vision
    type_of_ocr: OcrFile::OcrEngines::CloudVision::DOCUMENT_TEXT_DETECTION,
    ocr_engine: 'tesseract', # 'cloud-vision'
    # Image Pre-Processing
    image_pre_preprocess: true,
    effects: ['bw', 'norm'],
    threshold: 0.25,
    # PDF to Image Processing
    optimise_pdf: true,
    extract_pdf_images: true, # if false will screenshot each PDF page
    temp_filename_prefix: 'image',
    # Console Output
    verbose: true,
  }

  doc = OcrFile::Document.new(
    original_file_path: '/path-to-original-file/', # supports PDFs and images
    save_file_path: '/folder-to-save-to/',
    config: config # Not needed as defaults are used when not provided
  )

  doc.to_s # Returns text, removes temp files and wont save
  doc.to_pdf # Saves a PDF (either searchable over the images or dumped text)
  doc.to_text # Saves a text file with OCR text

  # How to generate PDFs of images or text files:
  original_file_path = 'file.txt' OR 'file.png'

  doc = OcrFile::Document.new(
    original_file_path: original_file_path, # supports PDFs and images
    save_file_path: '/folder-to-save-to/',
    config: config # Not needed as defaults are used when not provided
  )

  doc.to_pdf

  # How to merge files into a single PDF:
  filepaths = []
  documents = file_paths.map { |path| OcrFile::ImageEngines::PdfEngine.open_pdf(path, password: '') }
  merged_document = OcrFile::ImageEngines::PdfEngine.merge(documents)
  OcrFile::ImageEngines::PdfEngine.save_pdf(merged_document, save_file_path, optimise: true)
```

### Notes / Tips
Set `extract_pdf_images` to `false` for higher quality OCR. However this will consume more temporary space per PDF page and also be considerably slower.

Image pre-processing is not yet implemented.

## Development

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).

### TODOs
- input validation
- CLI
- image processing
- password
- Base64 encoding
- requirements checking (installed dependencies etc ...)
- Tests
- Configurable temp folder cleanup
- Improve console output

### Tests
To run tests execute:

    $ rake test

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/trex22/ocr-file. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.

## License

The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

## Code of Conduct

Everyone interacting in the OCR-File: project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/trex22/ocr-file/blob/master/CODE_OF_CONDUCT.md).