Python UnicodeDecodeError

Question

I am writing a Python program to read in a DOS tree command outputted into a text document. When I reach the 533th iteration of the loop, Eclipse gives an error:

Traceback (most recent call last):
  File "E:\Peter\Documents\Eclipse Workspace\MusicManagement\InputTest.py", line 24, in  <module>
    input = myfile.readline()
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3551: character maps  to undefined

I have read other posts, and setting the encoding to latin-1 does not resolve this issue, as it returns a UnicodeDecodeError on another character, and the same with trying to use utf-8.

The following is the code:

import os
from Album import *

os.system("tree F:\\Music > tree.txt")

myfile = open('tree.txt')
myfile.readline()
myfile.readline()
myfile.readline()

albums = []
x = 0

while x < 533:
    if not input: break
    input = myfile.readline()
    if len(input) < 14:
        artist = input[4:-1]
    elif input[13] != '-':
        artist = input[4:-1]
    else:
        albums.append(Album(artist, input[15:-1], input[8:12]))
    x += 1

for x in albums:
    print(x.artist + ' - ' + x.title + ' (' + str(x.year) + ')')

You need to figure out what encoding tree.com used; according to this post that could be UTF-16. — Martijn Pieters
– Martijn Pieters, Commented Jan 31, 2013 at 21:32
In this case using python os.walk rather than the DOS command might be easier. — mmmmmm
– mmmmmm, Commented Jan 31, 2013 at 21:34
If the encoding used maps single bytes to single characters and maps bytes 0 through 127 to the same values as ASCII, then you can probably deduce what the encoding being used is. Just read the line as bytes, remove byte 0x81 or replace it with a blank, and decode the resulting byte string as though it were ASCII encoded. Then see if you can guess what the missing character is using a bit of human intuition, and go research what what codec might map 0x81 to that character. — Mark Amery
– Mark Amery, Commented Jan 31, 2013 at 21:49
Also, given that this file comes from DOS, a possible guess for the codec that hasn't yet been suggested is Code Page 437, which is named 'cp437' in Python. See: en.wikipedia.org/wiki/Code_page_437 That would make your mystery character a ü though, which is a fairly unusual character (unless you're German). — Mark Amery
– Mark Amery, Commented Jan 31, 2013 at 21:51
cp437 got me a lot further into the file than any other encoding has. I'm currently looking further into what the encoding may be. Thanks for getting me on the right track though. — pbecker13
– pbecker13, Commented Jan 31, 2013 at 21:58

Community · Accepted Answer · 2017-05-23 12:32:56Z

9

You need to figure out what encoding tree.com used; according to this post that could any of the MS-DOS codepages.

You could go through each of the MS-DOS encodings; most of those have a codec in the python standard library. I'd try cp437 and cp500 first; the latter is the MS-DOS predecessor of cp1252 I think.

Pass the encoding to open():

myfile = open('tree.txt', encoding='cp437')

You really should look into using os.walk() instead of using tree.com for this task though, it'll save you having to deal with issues like these at least.

edited May 23, 2017 at 12:32

CommunityBot

11 silver badge

answered Jan 31, 2013 at 21:41

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

pbecker13 Over a year ago

Traceback (most recent call last): File "E:\Peter\Documents\Eclipse Workspace\MusicManagement\InputTest.py", line 15, in <module> myfile.readline() File "C:\Python33\lib\codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "C:\Python33\lib\encodings\utf_16.py", line 67, in _buffer_decode raise UnicodeError("UTF-16 stream does not start with BOM") UnicodeError: UTF-16 stream does not start with BOM

Martijn Pieters Over a year ago

@pbecker13: You could force it with utf_16_le (little endian), see if that works. I doubt it is UTF-16 actually if you didn't see 0-bytes all over the place. It's just that the NTFS file system uses UTF-16 by default and I suspect that tree will use that when outputting non-ASCII names.

Martijn Pieters Over a year ago

@pbecker13: Added some more options to look into.

Mark Ransom Over a year ago

@MartijnPieters, although the filenames may be stored as UTF-16 they will be converted to a code page when written to a file. The trick is to determine which code page.

pbecker13 Over a year ago

Finally got the entire file to read, using Code Page 850. 'cp850'. Thanks for the help!

|

Emanuele Paolini · Accepted Answer · 2013-01-31 21:34:22Z

1

In this line:

myfile = open('tree.txt')

you should specify the encoding of your file. On windows try:

myfile = open('tree.txt',encoding='cp1250')

answered Jan 31, 2013 at 21:34

Emanuele Paolini

10.2k5 gold badges45 silver badges69 bronze badges

2 Comments

pbecker13 Over a year ago

This still throws the same error. It's just the output of a tree command in DOS, I am not sure why it would output something where encoding is such an issue.

Martijn Pieters Over a year ago

The default encoding Python used was cp1250, you can see that from the traceback.

Collectives™ on Stack Overflow

Python UnicodeDecodeError

2 Answers 2

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related