How to OCR a PDF file and get the text stored within the PDF?

Question

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution - but unfortunately I am already lost when retrieving exact-image.

There is a problem with the nice pdfocr script that the page you are linking to recommends: it relies upon pdftk which is essentially deprecated (for two reasons, its dependence on libgcj and on iText5+). So a different solution is needed anyway... — Maxim
– Maxim, Commented Mar 14, 2017 at 6:04

Pablo A · Accepted Answer · 2025-10-18 03:55:07Z

143

OCRmyPDF

ocrmypdf does a good job and can be used like this:

# Create a new searchable PDF/A file from a scanned PDF or image file:
ocrmypdf input_file output.pdf

# Replace a scanned PDF file with a searchable PDF file:
ocrmypdf file.pdf file.pdf

# Skip pages of a mixed-format input PDF file that already contain text:
ocrmypdf --skip-text input.pdf output.pdf

# Clean, de-skew, and rotate pages of a poor scan:
ocrmypdf --clean --deskew --rotate-pages path/to/input_file path/to/output.pdf

# Set the metadata of the searchable PDF file:
ocrmypdf --title "title" --author "author" --subject "subject" --keywords "keyword; key phrase; ..." path/to/input_file path/to/output.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf     # Ubuntu or Debian
sudo dnf -y install ocrmypdf  # Fedora

Alternatively:

sudo snap install ocrmypdf    # Ubuntu, usually more fresh than the apt package

edited Oct 18 at 3:55

Pablo A

3,2551 gold badge26 silver badges46 bronze badges

answered Feb 3, 2018 at 19:23

Eduard Florinescu

12.6k19 gold badges61 silver badges70 bronze badges

7

Used ocrmypdf on Fedora 30 (via dnf install) - worked like a charm.

Heinrich Ulbricht
– Heinrich Ulbricht

2020-01-23 13:02:10 +00:00
Commented Jan 23, 2020 at 13:02
3

very good, thanks. Unlike the other ocr proposed in this thread, this ocr gives an output only slighlty bigger than the original (image pdf). It would even better if it could give an output smaller (only text): it is possible?

Duns
– Duns

2020-04-16 16:36:07 +00:00
Commented Apr 16, 2020 at 16:36
7

OCRmyPDF worked like a dream for me too. It’s based on Tesseract under the hood, so (among other things) handles many languages well: I just used it for a document in a mixture of English and Georgian (ქართული ენა) and got near-perfect results.

PLL
– PLL

2021-02-17 09:33:54 +00:00
Commented Feb 17, 2021 at 9:33
2

is there an option to output a simple .txt?

mario
– mario

2024-10-22 11:20:38 +00:00
Commented Oct 22, 2024 at 11:20
3

yes, there is: ocrmypdf --sidecar output.txt input.pdf output.pdf

mario
– mario

2024-10-22 13:05:59 +00:00
Commented Oct 22, 2024 at 13:05

| Show 3 more comments

Pablo A · Accepted Answer · 2025-10-18 03:36:35Z

24

pdfsandwich

^{Last version is from 2018-08-10.}

After learning that Tesseract can now also produce searchable PDFs, I found the script pdfsandwich.

After installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script's guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich
./configure
make
sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable PDF.

Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.

edited Oct 18 at 3:36

Pablo A

3,2551 gold badge26 silver badges46 bronze badges

answered Aug 4, 2016 at 15:39

ingli

2,0393 gold badges17 silver badges33 bronze badges

for a related, but separate question, building on this one, see unix.stackexchange.com/questions/306051/…

ingli
– ingli

2016-08-27 18:25:37 +00:00
Commented Aug 27, 2016 at 18:25
4

FWIW: pdfsandwich is also available in Ubuntu's apt package repository. Other distros might have it as well.

Laurence Gonsalves
– Laurence Gonsalves

2018-03-14 06:25:55 +00:00
Commented Mar 14, 2018 at 6:25
Just came across fedoramagazine.org/4-cool-new-projects-try-copr-october-2018 showing a COPR package for fedora that packages pdfsandwich

ingli
– ingli

2018-10-26 08:59:36 +00:00
Commented Oct 26, 2018 at 8:59

Add a comment |

Pablo A · Accepted Answer · 2025-10-18 03:44:14Z

8

OCRFeeder

An easy tool available on Ubuntu is 'ocrfeeder' it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/'unpaper', etc., as well.

https://wiki.gnome.org/OCRFeeder Older info
https://github.com/GNOME/ocrfeeder Read-only mirror

edited Oct 18 at 3:44

Pablo A

3,2551 gold badge26 silver badges46 bronze badges

answered Oct 18, 2018 at 4:14

jdpipe

1811 silver badge4 bronze badges

Add a comment |

Gabriel Staples · Accepted Answer · 2022-01-20 07:07:04Z

6

I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.

Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

Instructions to install & use `pdf2searchablepdf`:

Tested on Ubuntu 18.04 on 11 Nov 2019 and on Ubuntu 20.04 Nov. 2020.

Install:

git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh

sudo apt update
sudo apt install tesseract-ocr

Use:

# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]

# Make a PDF searchable:
pdf2searchablepdf mypdf.pdf

# Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs

You'll now have a pdf called mypdf_searchable.pdf, which contains searchable text!

Done. It has no python dependencies, as it's currently written entirely in bash.

See pdf2searchablepdf -h for the help menu and more options and examples.

References or Related Resources:

PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/

edited Jan 20, 2022 at 7:07

answered Nov 11, 2019 at 9:22

Gabriel Staples

3,0523 gold badges34 silver badges52 bronze badges

Good utility. One thing you might do is add support for file names with spaces in them. Right now, that doesn't work (you get a usage message for pdftoppm). Just adding a few quotation marks in some of the commands should do it.

Wilson F
– Wilson F

2020-01-04 01:01:42 +00:00
Commented Jan 4, 2020 at 1:01
1

Thanks for the feedback! I'll see when I can make the change and test it. I opened an issue here: github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/6

Gabriel Staples
– Gabriel Staples

2020-01-05 08:51:31 +00:00
Commented Jan 5, 2020 at 8:51
1

@WilsonF, done! v0.4.0 just released to resolve this issue. github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/releases

Gabriel Staples
– Gabriel Staples

2020-03-15 03:54:53 +00:00
Commented Mar 15, 2020 at 3:54
1

Excellent utility. Faster and more stable than my brief experiments with ocrmypdf and pdfsandwich. This on Ubuntu 18.04 and only a couple of PDF scanned documents as images. My issues arose when having relatively high resolutions (300dpi).

Patrick Refondini
– Patrick Refondini

2021-12-01 14:57:27 +00:00
Commented Dec 1, 2021 at 14:57
1

@GabrielStaples pdf2searchablepdf is fast and stable. I had issue with ocrmypdf and pdfsandwich when doing only a couple of tests with resolutions >= 300dpi.

Patrick Refondini
– Patrick Refondini

2021-12-03 15:13:20 +00:00
Commented Dec 3, 2021 at 15:13

| Show 5 more comments

Stack Exchange Network

How to OCR a PDF file and get the text stored within the PDF?

4 Answers 4

OCRmyPDF

pdfsandwich

OCRFeeder

Instructions to install & use `pdf2searchablepdf`:

Install:

Use:

References or Related Resources:

You must log in to answer this question.

Linked

Hot Network Questions

4 Answers 4

OCRmyPDF

Instructions to install & use pdf2searchablepdf:

Install:

Use:

References or Related Resources:

You must log in to answer this question.

Linked

Related

Instructions to install & use `pdf2searchablepdf`: