103

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

1
  • 2
    There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway... Commented Mar 14, 2017 at 6:04

4 Answers 4

143

OCRmyPDF

ocrmypdf does a good job and can be used like this:

# Create a new searchable PDF/A file from a scanned PDF or image file:
ocrmypdf input_file output.pdf

# Replace a scanned PDF file with a searchable PDF file:
ocrmypdf file.pdf file.pdf

# Skip pages of a mixed-format input PDF file that already contain text:
ocrmypdf --skip-text input.pdf output.pdf

# Clean, de-skew, and rotate pages of a poor scan:
ocrmypdf --clean --deskew --rotate-pages path/to/input_file path/to/output.pdf

# Set the metadata of the searchable PDF file:
ocrmypdf --title "title" --author "author" --subject "subject" --keywords "keyword; key phrase; ..." path/to/input_file path/to/output.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf     # Ubuntu or Debian
sudo dnf -y install ocrmypdf  # Fedora

Alternatively:

sudo snap install ocrmypdf    # Ubuntu, usually more fresh than the apt package
8
  • 7
    Used ocrmypdf on Fedora 30 (via dnf install) - worked like a charm. Commented Jan 23, 2020 at 13:02
  • 3
    very good, thanks. Unlike the other ocr proposed in this thread, this ocr gives an output only slighlty bigger than the original (image pdf). It would even better if it could give an output smaller (only text): it is possible? Commented Apr 16, 2020 at 16:36
  • 7
    OCRmyPDF worked like a dream for me too. It’s based on Tesseract under the hood, so (among other things) handles many languages well: I just used it for a document in a mixture of English and Georgian (ქართული ენა) and got near-perfect results. Commented Feb 17, 2021 at 9:33
  • 2
    is there an option to output a simple .txt? Commented Oct 22, 2024 at 11:20
  • 3
    yes, there is: ocrmypdf --sidecar output.txt input.pdf output.pdf Commented Oct 22, 2024 at 13:05
24

pdfsandwich

Last version is from 2018-08-10.

After learning that Tesseract can now also produce searchable PDFs, I found the script pdfsandwich.

After installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich
./configure
make
sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable PDF.

Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.

3
8

OCRFeeder

An easy tool available on Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc., as well.

6

I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.

Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

Instructions to install & use pdf2searchablepdf:

Tested on Ubuntu 18.04 on 11 Nov 2019 and on Ubuntu 20.04 Nov. 2020.

Install:

git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh

sudo apt update
sudo apt install tesseract-ocr

Use:

# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]

# Make a PDF searchable:
pdf2searchablepdf mypdf.pdf

# Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs

You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text!

Done. It has no python dependencies, as it's currently written entirely in bash.

See pdf2searchablepdf -h for the help menu and more options and examples.

References or Related Resources:

  1. PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
  2. https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
  3. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
  4. https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
  5. pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/
10
  • Good utility. One thing you might do is add support for file names with spaces in them. Right now, that doesn't work (you get a usage message for pdftoppm). Just adding a few quotation marks in some of the commands should do it. Commented Jan 4, 2020 at 1:01
  • 1
    Thanks for the feedback! I'll see when I can make the change and test it. I opened an issue here: github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/6 Commented Jan 5, 2020 at 8:51
  • 1
    @WilsonF, done! v0.4.0 just released to resolve this issue. github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/releases Commented Mar 15, 2020 at 3:54
  • 1
    Excellent utility. Faster and more stable than my brief experiments with ocrmypdf and pdfsandwich. This on Ubuntu 18.04 and only a couple of PDF scanned documents as images. My issues arose when having relatively high resolutions (300dpi). Commented Dec 1, 2021 at 14:57
  • 1
    @GabrielStaples pdf2searchablepdf is fast and stable. I had issue with ocrmypdf and pdfsandwich when doing only a couple of tests with resolutions >= 300dpi. Commented Dec 3, 2021 at 15:13

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.