Extracting text from an image can be done with image processing. In scientific terms this is called Optical Character Recognition (OCR).
A popular OCR engine is named tesseract. Tesseract is an optical character recognition engine for various operating systems.
Related course: Complete Machine Learning Course with Python
OCR with tesseract
You can do OCR in Python by using the tesseract binary. The first step is to install tesseract on your system. Then you can run the code below.
It starts the tesseract process with the input image as argument. The output of the program is returned by the function. The program simply outputs the content to the screen (print).
import os |
You can use any image to test the program, but it should be a very clear image. It shouldn’t have rotation, blur or a background. Plain black and white is required. If your image is not clear, you need to do some image preprocessing before running tesseract.
Run the program to see the text. All is shown in the terminal.
The famous “Lorem ipsum” text is in the image.
Besides calling the OCR engine directly, you could use one of these modules:
- pytesseract
- pyocr
- tesserwrap
- pytesser
They all use the same OCR engine beneath: tesseract.
If you are new to Machine Learning, I highly recommend this book

how can find those modules (pytesser , tesserwrap)
You can use the pip package manager to install those modules. They are available on PyPi. For PyTesser and TesserWrap. You should have the pip package manager installed on your computer, if not install it using your package manager or during the setup process.