8

I am writing a Python program to read in a DOS tree command outputted into a text document. When I reach the 533th iteration of the loop, Eclipse gives an error:

Traceback (most recent call last):
  File "E:\Peter\Documents\Eclipse Workspace\MusicManagement\InputTest.py", line 24, in  <module>
    input = myfile.readline()
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3551: character maps  to undefined

I have read other posts, and setting the encoding to latin-1 does not resolve this issue, as it returns a UnicodeDecodeError on another character, and the same with trying to use utf-8.

The following is the code:

import os
from Album import *

os.system("tree F:\\Music > tree.txt")

myfile = open('tree.txt')
myfile.readline()
myfile.readline()
myfile.readline()

albums = []
x = 0

while x < 533:
    if not input: break
    input = myfile.readline()
    if len(input) < 14:
        artist = input[4:-1]
    elif input[13] != '-':
        artist = input[4:-1]
    else:
        albums.append(Album(artist, input[15:-1], input[8:12]))
    x += 1

for x in albums:
    print(x.artist + ' - ' + x.title + ' (' + str(x.year) + ')')
5
  • You need to figure out what encoding tree.com used; according to this post that could be UTF-16. Commented Jan 31, 2013 at 21:32
  • 1
    In this case using python os.walk rather than the DOS command might be easier. Commented Jan 31, 2013 at 21:34
  • If the encoding used maps single bytes to single characters and maps bytes 0 through 127 to the same values as ASCII, then you can probably deduce what the encoding being used is. Just read the line as bytes, remove byte 0x81 or replace it with a blank, and decode the resulting byte string as though it were ASCII encoded. Then see if you can guess what the missing character is using a bit of human intuition, and go research what what codec might map 0x81 to that character. Commented Jan 31, 2013 at 21:49
  • Also, given that this file comes from DOS, a possible guess for the codec that hasn't yet been suggested is Code Page 437, which is named 'cp437' in Python. See: en.wikipedia.org/wiki/Code_page_437 That would make your mystery character a ü though, which is a fairly unusual character (unless you're German). Commented Jan 31, 2013 at 21:51
  • cp437 got me a lot further into the file than any other encoding has. I'm currently looking further into what the encoding may be. Thanks for getting me on the right track though. Commented Jan 31, 2013 at 21:58

2 Answers 2

9

You need to figure out what encoding tree.com used; according to this post that could any of the MS-DOS codepages.

You could go through each of the MS-DOS encodings; most of those have a codec in the python standard library. I'd try cp437 and cp500 first; the latter is the MS-DOS predecessor of cp1252 I think.

Pass the encoding to open():

myfile = open('tree.txt', encoding='cp437')

You really should look into using os.walk() instead of using tree.com for this task though, it'll save you having to deal with issues like these at least.

Sign up to request clarification or add additional context in comments.

7 Comments

Traceback (most recent call last): File "E:\Peter\Documents\Eclipse Workspace\MusicManagement\InputTest.py", line 15, in <module> myfile.readline() File "C:\Python33\lib\codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "C:\Python33\lib\encodings\utf_16.py", line 67, in _buffer_decode raise UnicodeError("UTF-16 stream does not start with BOM") UnicodeError: UTF-16 stream does not start with BOM
@pbecker13: You could force it with utf_16_le (little endian), see if that works. I doubt it is UTF-16 actually if you didn't see 0-bytes all over the place. It's just that the NTFS file system uses UTF-16 by default and I suspect that tree will use that when outputting non-ASCII names.
@pbecker13: Added some more options to look into.
@MartijnPieters, although the filenames may be stored as UTF-16 they will be converted to a code page when written to a file. The trick is to determine which code page.
Finally got the entire file to read, using Code Page 850. 'cp850'. Thanks for the help!
|
1

In this line:

myfile = open('tree.txt')

you should specify the encoding of your file. On windows try:

myfile = open('tree.txt',encoding='cp1250')

2 Comments

This still throws the same error. It's just the output of a tree command in DOS, I am not sure why it would output something where encoding is such an issue.
The default encoding Python used was cp1250, you can see that from the traceback.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.