23

How can I convert PDF files to HTML with Python?

I was thinking something alone the lines of what Google does (or seems to do) to index PDF files.

My final goal is to setup Apache to show the HTML for the PDF files, so anything leading me in that direction would also be appreciated.

0

1 Answer 1

7

The poppler package provides a pdf2html utility that you might be able to use. There is also a Python binding to libpoppler.

Sign up to request clarification or add additional context in comments.

3 Comments

The python binding is mostly for rendering PDF in a GTK widget/ui, so I am not sure it would help here.
I haven't actually used it, but it does expose poppler_page_get_text, which might be useful to the OP.
Right, but seems a whole big waste of GTK/Glib bindings if that's all the O.P. wants, especially as there are other easier ways that don't depend on a UI toolkit (eg pdf2html you mention). I should say I generally like the bindings, and was the original author. Maybe not in this case though.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.