Extract text from image

Extracting text from an image can be done with image processing. In scientific terms this is called Optical Character Recognition (OCR).

A popular OCR engine is named tesseract. Tesseract is an optical character recognition engine for various operating systems.

Related course: Complete Machine Learning Course with Python

OCR with tesseract

You can do OCR in Python by using the tesseract binary. The first step is to install tesseract on your system. Then you can run the code below.

It starts the tesseract process with the input image as argument. The output of the program is returned by the function. The program simply outputs the content to the screen (print).

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

str = ocr('image.png')
print(str)

You can use any image to test the program, but it should be a very clear image. It shouldn’t have rotation, blur or a background. Plain black and white is required. If your image is not clear, you need to do some image preprocessing before running tesseract.

Run the program to see the text. All is shown in the terminal.

The famous “Lorem ipsum” text is in the image.

Besides calling the OCR engine directly, you could use one of these modules:

pytesseract
pyocr
tesserwrap
pytesser

They all use the same OCR engine beneath: tesseract.

If you are new to Machine Learning, I highly recommend this book

Download Machine Learning examples