3

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.

While there are a number of questions out there on this topic, I didn't find a direct answer to my problem. Detecting the system defaults won't help me in this case, because I need the program to be portable.

Here's the code:

def txt_to_JSON(csv_list):
    ...some manipulation of the list...
    return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list 
for i in range(0,len(lines)):
    lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
12
  • It's worth noting that when working with files in Python, it's best to use the with statement. Commented Apr 24, 2013 at 14:48
  • Do you know what's the encoding of the input file? Commented Apr 24, 2013 at 14:53
  • @PauloBu He's reading Hebrew characters, but he's using ASCII in his program. This is most likely the problem. Commented Apr 24, 2013 at 15:05
  • What version of Python? Commented Apr 24, 2013 at 15:08
  • 1
    I'm glad. If you want to have some background to explain to your leader these links will be very helpful, specially the first: joelonsoftware.com/articles/Unicode.html , stackoverflow.com/questions/3951722/… and stackoverflow.com/questions/643694/utf-8-vs-unicode Commented Apr 24, 2013 at 17:18

2 Answers 2

1

All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet

I highly recommend Ned Batchelder's presentation http://nedbatchelder.com/text/unipain.html for details.

There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?

TLDR: Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.

Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.

In your case

filename = 'where your data lives'
with open(filename, 'rb') as f:
   encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")

# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)

encoded_result = result.encode("UTF-16")  #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
    f.write(encoded_result)
Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for the input. When I do this, however, the output file (created by f.write()) is still encoded as ANSI, so I get UnicodeEncodeError when it gets to the Hebrew characters. And btw, utf_16 is the proper notation.
Following your link, I changed the encoding from 'utf_16' to 'utf_16_le', and got a similar error, just relating to very beginning of the file rather than the non-ascii characters.
what program do you use to open the output file?
I use notepad. How would this affect the encoding?
the program has to decode the file to interpret what's in it. Can you put both files, or similar files with nonsense original text, someplace? I'd like to take a look
|
0

You need to tell Python to use the Unicode character encoding to decode the Hebrew characters. Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

1 Comment

Sorry, I didn't find a solution there. I tried using the codecs module, but nothing changed in the output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.