4

I have a code written in Python that reads from PDF files and convert it to text file.

The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.

The system converts Arabic PDF files but the text file is empty. and display this error:

Traceback (most recent call last): File "C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)

Code:

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime

def check_path(prompt):
    ''' (str) -> str
    Verifies if the provided absolute path does exist.
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    

print "\n"

folder = check_path("Provide absolute path for the folder: ")

list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)

            list.append(t)

m=len(list)
print (m)
i=0
while i<=m-1:

    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"

    tail=tail.replace(".pdf",".txt")
    name=head+var+tail

    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1
6
  • Is there an exception or does the script exit silently? Does it work as expected for PDFs that contain only text written with Latin script? Commented Dec 20, 2017 at 9:01
  • @lenz THE SCRIPT work as expected with no error on non ARABIC content but when it comes to ARABIC it convert PDF to empty text file Commented Dec 20, 2017 at 9:17
  • Oh I see. You have to write content = content.encode('utf-8') on line 68. String methods never modify strings in-place, you always have to capture the return value. Commented Dec 20, 2017 at 10:48
  • Rany, did this work? Because once you fixed your code, I suggest you delete this post, since it's very unlikely to help future readers. Your problem turned out to have nothing to do with encoding, Arabic, or PDF – it's simply a bug that shows up when the content contains non-ASCII characters. Commented Dec 20, 2017 at 16:35
  • @lenz the error is gone but still the converted file is empty Commented Dec 21, 2017 at 6:43

2 Answers 2

2

You have a couple of problems:

  1. content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.

Example (works for both Python 2 and 3):

 import io
 f = io.open(name,'w',encoding='utf8')
 f.write(content)
  1. If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.

Example:

import io
with io.open(name,'w',encoding='utf8') as f:
    f.write(content)

In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.

Sign up to request clarification or add additional context in comments.

6 Comments

I USED your answer to fix my code. now it convert the arabic PDF text INTO TXT FILE BUT WITH unreadable characters.
@RanyFahed What software do you use to inspect the text file? The viewer/editor might be using the wrong encoding.
@RanyFahed Also since it looks like you are on Windows, many Windows programs assume a localized encoding such as Windows-1252 on U.S. Windows. You can use utf-8-sig to write a byte order mark (BOM) signature and some programs recognize this to know to use UTF-8.
@lenz for PDF files i am using PDF Complete for TXT files i am using NotePad ++
@MarkTolonen i did not understand your comment
|
0

you can use anthor library called pdfplumber instead of using pypdf or PyPDF2

import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
    my_page = pdf.pages[10]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.