How to read Arabic text from PDF using Python script

Question

I have a code written in Python that reads from PDF files and convert it to text file.

The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.

The system converts Arabic PDF files but the text file is empty. and display this error:

Traceback (most recent call last): File "C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)

Code:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)

            list.append(t)

m=len(list)
print (m)
i=0
while i<=m-1:

    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail

    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1

Is there an exception or does the script exit silently? Does it work as expected for PDFs that contain only text written with Latin script? — lenz
– lenz, Commented Dec 20, 2017 at 9:01
@lenz THE SCRIPT work as expected with no error on non ARABIC content but when it comes to ARABIC it convert PDF to empty text file — Rany Fahed
– Rany Fahed, Commented Dec 20, 2017 at 9:17
Oh I see. You have to write content = content.encode('utf-8') on line 68. String methods never modify strings in-place, you always have to capture the return value. — lenz
– lenz, Commented Dec 20, 2017 at 10:48
Rany, did this work? Because once you fixed your code, I suggest you delete this post, since it's very unlikely to help future readers. Your problem turned out to have nothing to do with encoding, Arabic, or PDF – it's simply a bug that shows up when the content contains non-ASCII characters. — lenz
– lenz, Commented Dec 20, 2017 at 16:35
@lenz the error is gone but still the converted file is empty — Rany Fahed
– Rany Fahed, Commented Dec 21, 2017 at 6:43

Mark Tolonen · Accepted Answer · 2017-12-21 07:41:17Z

2

You have a couple of problems:

content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.

Example (works for both Python 2 and 3):

 import io
 f = io.open(name,'w',encoding='utf8')
 f.write(content)

If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.

Example:

import io
with io.open(name,'w',encoding='utf8') as f:
    f.write(content)

In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.

answered Dec 21, 2017 at 7:41

Mark Tolonen

181k26 gold badges182 silver badges278 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Rany Fahed Over a year ago

I USED your answer to fix my code. now it convert the arabic PDF text INTO TXT FILE BUT WITH unreadable characters.

lenz Over a year ago

@RanyFahed What software do you use to inspect the text file? The viewer/editor might be using the wrong encoding.

Mark Tolonen Over a year ago

@RanyFahed Also since it looks like you are on Windows, many Windows programs assume a localized encoding such as Windows-1252 on U.S. Windows. You can use utf-8-sig to write a byte order mark (BOM) signature and some programs recognize this to know to use UTF-8.

Rany Fahed Over a year ago

@lenz for PDF files i am using PDF Complete for TXT files i am using NotePad ++

Rany Fahed Over a year ago

@MarkTolonen i did not understand your comment

|

Ameen Reda · Accepted Answer · 2021-11-19 15:25:58Z

0

you can use anthor library called pdfplumber instead of using pypdf or PyPDF2

import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
    my_page = pdf.pages[10]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

answered Nov 19, 2021 at 15:25

Ameen Reda

11 bronze badge

Collectives™ on Stack Overflow

How to read Arabic text from PDF using Python script

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related