Extracting text from a PDF file using PDFMiner in python? [closed]

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

This question does not appear to be about programming within the scope defined in the help center.

Closed last year.

The community reviewed whether to reopen this question last year and left it closed:

Original close reason(s) were not resolved

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python.

It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this.

As it is, I'm just looking at source-code to see if I can figure it out.

Please check out stackoverflow.com/help/how-to-ask and stackoverflow.com/help/mcve and update your answer so it is in a better format and aligns to the guidelines. — Parker
– Parker, Commented Oct 21, 2014 at 19:03
Which distribution of Python are you using, 2.7.x or 3.x.x? It should be noted that the author explicitly detailed that PDFminer doesn't work with Python 3.x.x. That might be the reason you're getting import errors. You should use pdfminer3k if so, as it is the standing Python 3 import of said library. — WGS
– WGS, Commented Oct 21, 2014 at 19:13
@Nanashi, sorry, I forgot to add my Python version. It's 2.7 so that isn't the issue. I have been looking through the source-code and it looks like they restructured some things which is why the imports are breaking. I can't find any documentation for PDFMiner either or I would just be working off of that :( — RattleyCooper
– RattleyCooper, Commented Oct 21, 2014 at 19:14
I have just literally installed PDFminer off from GitHub and it imports fine. Can you kindly post your code and post your full error traceback as well? — WGS
– WGS, Commented Oct 21, 2014 at 19:18

Trenton McKinney · Accepted Answer · 2019-10-04 04:10:06Z

217

Here is a working example of extracting text from a PDF file using the current version of PDFMiner(September 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

PDFMiner's structure changed recently, so this should work for extracting text from the PDF files.

Edit : Still working as of the June 7th of 2018. Verified in Python Version 3.x

Edit: The solution works with Python 3.7 at October 3, 2019. I used the Python library pdfminer.six, released on November 2018.

edited Oct 4, 2019 at 4:10

Trenton McKinney

63.2k41 gold badges169 silver badges212 bronze badges

answered Oct 21, 2014 at 19:47

RattleyCooper

5,1875 gold badges31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Deusdeorum Over a year ago

works fine, but, how can I deal with spaces in for example names? suppose I have a pdf that contains 4 columns where I have first- and lastname in one col, now it get parsed with firstname in one row and lastname in one row, here's an example docdro.id/rRyef3x

Jeffrey Swan Over a year ago

Currently getting an import error with this code: ImportError: No module named 'pdfminer.pdfpage'

sib10 Over a year ago

Thanks it works on python v2.7.12 and on ubuntu 16.04, though it would be better to load the pdf document with encoding utf-8, because my sample pdf has some encoding issue so try this after encoding with utf-8 and it resolve the issue... import sys reload(sys) sys.setdefaultencoding('utf-8')

craned Over a year ago

@DuckPuncher, Is it still working now? I had to change the file(path, 'rb') to `open(path, 'rb') to get mine to work.

aze45sq6d Over a year ago

Still working for Python3.7 users. Installed pdfminer.six==20181108 package. Best solution so far for my case and I compared numerous solutions.

|

Cornelius Roemer · Accepted Answer · 2022-08-08 11:36:52Z

35

This works in May 2020 using PDFminer six in Python3.

Installing the package

$ pip install pdfminer.six

Importing the package

from pdfminer.high_level import extract_text

Using a PDF saved on disk

text = extract_text('report.pdf')

Or alternatively:

with open('report.pdf','rb') as f:
    text = extract_text(f)

Using PDF already in memory

If the PDF is already in memory, for example if retrieved from the web with the requests library, it can be converted to a stream using the io library:

import io

response = requests.get(url)
text = extract_text(io.BytesIO(response.content))

Performance and Reliability compared with PyPDF2

PDFminer.six works more reliably than PyPDF2 (which fails with certain types of PDFs), in particular PDF version 1.7

However, text extraction with PDFminer.six is significantly slower than PyPDF2 by a factor of 6.

I timed text extraction with timeit on a 15" MBP (2018), timing only the extraction function (no file opening etc.) with a 10 page PDF and got the following results:

PDFminer.six: 2.88 sec
PyPDF2:       0.45 sec

pdfminer.six also has a huge footprint, requiring pycryptodome which needs GCC and other things installed pushing a minimal install docker image on Alpine Linux from 80 MB to 350 MB. PyPDF2 has no noticeable storage impact.

Update (2022-08-04): According to Martin Thoma, PyPDF2 has improved a lot in the past 2 years, so do give it a try as well. Here's his benchmark

edited Aug 8, 2022 at 11:36

answered May 17, 2020 at 19:07

Cornelius Roemer

10.3k6 gold badges62 silver badges121 bronze badges

3 Comments

Martin Thoma Over a year ago

PyPDF2 had a lot of improvements since this answer was given. Especially the text extraction was improved a lot. In my benchmark the text extraction of PyPDF2 is now better than the one of pdfminer

Pedroski Over a year ago

Do you have any tips on extracting Russian text? I just get stuff like this: ПРОЕКТНА(cid:601) ДЕКЛАРА(cid:592)И(cid:601) (cid:651) 30-000198 (cid:616)(cid:620) 06.06.2024 A small amount of Cyrillic text and these weird (cid:number) If I open the the PDF in Libre Office Draw, the parts (cid:number) are not visible at all. However, the PDF opens and displays correctly.

TylerH Over a year ago

This doesn't address the scope of the question, which was specific to Python 2.7.

manish Prasad · Accepted Answer · 2020-03-07 16:08:44Z

33

terrific answer from DuckPuncher, for Python3 make sure you install pdfminer2 and do:

import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)



    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    return text

edited Mar 7, 2020 at 16:08

manish Prasad

6766 silver badges16 bronze badges

answered Jun 10, 2017 at 18:34

juan Isaza

4,0353 gold badges33 silver badges39 bronze badges

9 Comments

Atti Over a year ago

It doesn't work for me: ModuleNotFoundError: No module named 'pdfminer.pdfpage' i am using python 3.6

juan Isaza Over a year ago

@Atti, just in case, make sure that you have pdfminer2 installed, as there is another package pdfminer (I hate this). It works for pdfminer2==20151206 version when doing pip3 freeze.

Atti Over a year ago

thanks i got it working eventually, i installed pdfminer.six from conda forge

Mike Driscoll Over a year ago

For Python 3, pdfminer.six is the recommended package - github.com/pdfminer/pdfminer.six

user9410826 Over a year ago

Is this still current. I'm getting the same ImportError: message

|

Pieter · Accepted Answer · 2021-11-21 15:50:52Z

29

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

(All the examples assume your PDF file is called example.pdf)

Commandline

If you want to extract text just once you can use the commandline tool pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.

from pdfminer.high_level import extract_text

# Extract text from a pdf.
text = extract_text('example.pdf')

# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')

Composable api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Similar question and answer here. I'll try to keep them in sync.

edited Nov 21, 2021 at 15:50

answered May 17, 2020 at 16:45

Pieter

3,4671 gold badge21 silver badges30 bronze badges

2 Comments

famas23 Over a year ago

pdf2text should be imported from import tools.pdf2 module not the pdfminer.high_level

TylerH Over a year ago

This doesn't address the scope of the question, which was specific to Python 2.7

Brault Gilbert · Accepted Answer · 2019-12-20 10:43:32Z

1

this code is tested with pdfminer for python 3 (pdfminer-20191125)

from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

def parsedocument(document):
    # convert all horizontal text into a lines list (one entry per line)
    # document is a file stream
    lines = []
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(document):
            interpreter.process_page(page)
            layout = device.get_result()
            for element in layout:
                if isinstance(element, LTTextBoxHorizontal):
                    lines.extend(element.get_text().splitlines())
    return lines

answered Dec 20, 2019 at 10:43

Brault Gilbert

1191 silver badge3 bronze badges

5 Comments

b00kgrrl Over a year ago

I have PDF files which I am able to convert using the Nitro Pro tool. When I try to convert the same PDF using the code posted here, however, I get output which suggests that there is a permissions error. Here is the output: ('from the SAGE Social Science Collections. All Rights Reserved.\n\n\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c')

Vincent Over a year ago

What do you mean a file stream?

Rodrigo Formighieri Over a year ago

@Vincent with open(file,'rb') as stream: [...]

Je Je Over a year ago

do you manage to get this file as a table/pandas ideally? groupe-psa.com/en/publication/monthly-world-sales-march-2020

TylerH Over a year ago

This doesn't address the scope of the question, which was specific to Python 2.7

TylerH · Accepted Answer · 2024-10-25 15:39:07Z

0

For anyone trying to use pdfminer, you should switch to pdfminer.six which is the currently maintained version.

edited Oct 25, 2024 at 15:39

TylerH

21.3k84 gold badges84 silver badges121 bronze badges

answered Sep 16, 2022 at 18:20

julie

3532 silver badges7 bronze badges

1 Comment

Martin Thoma Over a year ago

Or to PyPDF2 :-)

Collectives™ on Stack Overflow

Extracting text from a PDF file using PDFMiner in python? [closed]

6 Answers 6

15 Comments

Installing the package

Importing the package

Using a PDF saved on disk

Using PDF already in memory

Performance and Reliability compared with PyPDF2

3 Comments

9 Comments

2 Comments

5 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

15 Comments

Installing the package

Importing the package

Using a PDF saved on disk

Using PDF already in memory

Performance and Reliability compared with PyPDF2

3 Comments

9 Comments

2 Comments

5 Comments

1 Comment

Linked

Related