README.md in ocr-file-0.0.4 vs README.md in ocr-file-0.0.6
- old
+ new
@@ -47,10 +47,11 @@
automatic_reprocess: true, # Will possibly do double + the operations but can produce better results automatically
# PDF to Image Processing
optimise_pdf: true,
extract_pdf_images: true, # if false will screenshot each PDF page
temp_filename_prefix: 'image',
+ spelling_correction: true, # Will attempt to fix text at the end (not used for searchable pdf output)
# Console Output
verbose: true,
timing: true,
}
@@ -74,10 +75,11 @@
)
doc.to_pdf
# How to merge files into a single PDF:
+ # The files can be images or other PDFs
filepaths = []
documents = file_paths.map { |path| OcrFile::ImageEngines::PdfEngine.open_pdf(path, password: '') }
merged_document = OcrFile::ImageEngines::PdfEngine.merge(documents)
OcrFile::ImageEngines::PdfEngine.save_pdf(merged_document, save_file_path, optimise: true)
```
@@ -118,9 +120,14 @@
- Tests
- Configurable temp folder cleanup
- Improve console output
- Fix spaces in file names
- Better verbosity
+- Docker
+- pdftk / pdf merge for text and bookmarks etc ...
+ - https://github.com/tesseract-ocr/tesseract/issues/660
+ - tesseract -c naked_pdf=true
+-
### Tests
To run tests execute:
$ rake test