Encoding issue when writing to text file, with Python

Question

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.

While there are a number of questions out there on this topic, I didn't find a direct answer to my problem. Detecting the system defaults won't help me in this case, because I need the program to be portable.

Here's the code:

def txt_to_JSON(csv_list):
    ...some manipulation of the list...
    return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list 
for i in range(0,len(lines)):
    lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()

It's worth noting that when working with files in Python, it's best to use the with statement. — Gareth Latty
– Gareth Latty, Commented Apr 24, 2013 at 14:48
@PauloBu He's reading Hebrew characters, but he's using ASCII in his program. This is most likely the problem. — Aleph
– Aleph, Commented Apr 24, 2013 at 15:05
I'm glad. If you want to have some background to explain to your leader these links will be very helpful, specially the first: joelonsoftware.com/articles/Unicode.html , stackoverflow.com/questions/3951722/… and stackoverflow.com/questions/643694/utf-8-vs-unicode — Paulo Bu
– Paulo Bu, Commented Apr 24, 2013 at 17:18

Community · Accepted Answer · 2017-05-23 10:25:44Z

1

All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet

I highly recommend Ned Batchelder's presentation http://nedbatchelder.com/text/unipain.html for details.

There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?

TLDR: Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.

Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.

In your case

filename = 'where your data lives'
with open(filename, 'rb') as f:
   encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")

# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)

encoded_result = result.encode("UTF-16")  #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
    f.write(encoded_result)

edited May 23, 2017 at 10:25

CommunityBot

11 silver badge

answered Apr 24, 2013 at 18:34

Thomas Fenzl

4,4221 gold badge19 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ygesher Over a year ago

Thanks for the input. When I do this, however, the output file (created by f.write()) is still encoded as ANSI, so I get UnicodeEncodeError when it gets to the Hebrew characters. And btw, utf_16 is the proper notation.

ygesher Over a year ago

Following your link, I changed the encoding from 'utf_16' to 'utf_16_le', and got a similar error, just relating to very beginning of the file rather than the non-ascii characters.

Thomas Fenzl Over a year ago

what program do you use to open the output file?

ygesher Over a year ago

I use notepad. How would this affect the encoding?

Thomas Fenzl Over a year ago

the program has to decode the file to interpret what's in it. Can you put both files, or similar files with nonsense original text, someplace? I'd like to take a look

|

Community · Accepted Answer · 2017-05-23 11:50:19Z

0

You need to tell Python to use the Unicode character encoding to decode the Hebrew characters. Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

edited May 23, 2017 at 11:50

CommunityBot

11 silver badge

answered Apr 24, 2013 at 15:18

user2286078

1 Comment

ygesher Over a year ago

Sorry, I didn't find a solution there. I tried using the codecs module, but nothing changed in the output.

Collectives™ on Stack Overflow

Encoding issue when writing to text file, with Python

2 Answers 2

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related