123

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

9
  • 2
    Please check out stackoverflow.com/help/how-to-ask and stackoverflow.com/help/mcve and update your answer so it is in a better format and aligns to the guidelines. Commented Oct 21, 2014 at 19:03
  • Which distribution of Python are you using, 2.7.x or 3.x.x? It should be noted that the author explicitly detailed that PDFminer doesn't work with Python 3.x.x. That might be the reason you're getting import errors. You should use pdfminer3k if so, as it is the standing Python 3 import of said library. Commented Oct 21, 2014 at 19:13
  • @Nanashi, sorry, I forgot to add my Python version. It's 2.7 so that isn't the issue. I have been looking through the source-code and it looks like they restructured some things which is why the imports are breaking. I can't find any documentation for PDFMiner either or I would just be working off of that :( Commented Oct 21, 2014 at 19:14
  • 1
    I have just literally installed PDFminer off from GitHub and it imports fine. Can you kindly post your code and post your full error traceback as well? Commented Oct 21, 2014 at 19:18
  • 2
    Possible duplicate of How do I use pdfminer as a library Commented Mar 2, 2016 at 7:17

6 Answers 6

217

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

Sign up to request clarification or add additional context in comments.

15 Comments

works fine, but, how can I deal with spaces in for example names? suppose I have a pdf that contains 4 columns where I have first- and lastname in one col, now it get parsed with firstname in one row and lastname in one row, here's an example docdro.id/rRyef3x
Currently getting an import error with this code: ImportError: No module named 'pdfminer.pdfpage'
Thanks it works on python v2.7.12 and on ubuntu 16.04, though it would be better to load the pdf document with encoding utf-8, because my sample pdf has some encoding issue so try this after encoding with utf-8 and it resolve the issue... import sys reload(sys) sys.setdefaultencoding('utf-8')
@DuckPuncher, Is it still working now? I had to change the file(path, 'rb') to `open(path, 'rb') to get mine to work.
Still working for Python3.7 users. Installed pdfminer.six==20181108 package. Best solution so far for my case and I compared numerous solutions.
|
35

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:
    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark

3 Comments

PyPDF2 had a lot of improvements since this answer was given. Especially the text extraction was improved a lot. In my benchmark the text extraction of PyPDF2 is now better than the one of pdfminer
Do you have any tips on extracting Russian text? I just get stuff like this: ПРОЕКТНА(cid:601) ДЕКЛАРА(cid:592)И(cid:601) (cid:651) 30-000198 (cid:616)(cid:620) 06.06.2024 A small amount of Cyrillic text and these weird (cid:number) If I open the the PDF in Libre Office Draw, the parts (cid:number) are not visible at all. However, the PDF opens and displays correctly.
This doesn't address the scope of the question, which was specific to Python 2.7.
33

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)



    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text

9 Comments

It doesn't work for me: ModuleNotFoundError: No module named 'pdfminer.pdfpage' i am using python 3.6
@Atti, just in case, make sure that you have pdfminer2 installed, as there is another package pdfminer (I hate this). It works for pdfminer2==20151206 version when doing pip3 freeze.
thanks i got it working eventually, i installed pdfminer.six from conda forge
For Python 3, pdfminer.six is the recommended package - github.com/pdfminer/pdfminer.six
Is this still current. I'm getting the same ImportError: message
|
29

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

(All the examples assume your PDF file is called example.pdf)

Commandline

If you want to extract text just once you can use the commandline tool pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.

from pdfminer.high_level import extract_text

# Extract text from a pdf.
text = extract_text('example.pdf')

# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')

Composable api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Similar question and answer here. I'll try to keep them in sync.

2 Comments

pdf2text should be imported from import tools.pdf2 module not the pdfminer.high_level
This doesn't address the scope of the question, which was specific to Python 2.7
1

this code is tested with pdfminer for python 3 (pdfminer-20191125)

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

def parsedocument(document):
    # convert all horizontal text into a lines list (one entry per line)
    # document is a file stream
    lines = []
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            for element in layout:
                if isinstance(element, LTTextBoxHorizontal):
                    lines.extend(element.get_text().splitlines())
    return lines

5 Comments

I have PDF files which I am able to convert using the Nitro Pro tool. When I try to convert the same PDF using the code posted here, however, I get output which suggests that there is a permissions error. Here is the output: ('from the SAGE Social Science Collections. All Rights Reserved.\n\n\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c')
What do you mean a file stream?
@Vincent with open(file,'rb') as stream: [...]
do you manage to get this file as a table/pandas ideally? groupe-psa.com/en/publication/monthly-world-sales-march-2020
This doesn't address the scope of the question, which was specific to Python 2.7
0

For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.

1 Comment

Or to PyPDF2 :-)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.