2

I'm sure this is terribly wrong, and I'm having a couple of problems. I've written out an array of WIN32_FIND_DATAW structures to disk, one after another, and I'd like to consume and parse them in my Python script.

The code I'm currently using is:

>>> fp = open('findData', 'r').read()
>>> data = ctypes.cast(fp, ctypes.POINTER(wintypes.WIN32_FIND_DATAW))
>>> print str(data[0].cFileName)

The first problem is that the third line doesn't print a nice string like I would expect. Instead of printing $Recycle.Bin it prints UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

This is the result of just printing the data stored there:

>>> data[0].cFileName
u'\U00520024\U00630065\U00630079\U0065006c\U0042002e\U006e0069'

This looks relatively reasonable. $ is ASCII 0x24, R is ASCII 0x52 and so on.

So why can't I print it like a string?

My second question is that doing:

>>> data[1].cFileName

Gives me ridiculous data. I'm fairly sure I'm not using that ctypes.cast correctly. How should I be doing it to access these? To clarify, in C, I'd just point a PWIN32_FIND_DATAW pointer to the beginning of the buffer and access the individual structs in the array using similar code, and I'm trying to do the same in Python.

Update

Doing:

>>> data[0].cFileName.encode('windows-1252')

Yields this error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-5: character maps to <undefined>

Update

The beginning of the first entry (data[0] up to the first part of cFileName) looks like the following:

user@ubuntu:~/data$ hexdump -C findData | head -n 6
00000000  16 00 00 00 dc 5a 9f d2  31 04 ca 01 ba 81 89 1a  |.....Z..1.......|
00000010  81 e2 cd 01 ba 81 89 1a  81 e2 cd 01 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 24 00 52 00  |............$.R.|
00000030  65 00 63 00 79 00 63 00  6c 00 65 00 2e 00 42 00  |e.c.y.c.l.e...B.|
00000040  69 00 6e 00 00 00 00 00  00 00 00 00 00 00 00 00  |i.n.............|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

I can post more data if needed.

17
  • 1
    have you tried reading it in binary (rb)? Commented Mar 24, 2013 at 8:09
  • 1
    Seems that the two-byte unicode Windows unicode characters are treated as 4-byte linux unicode characters. The first character in the string isn't 0x24 but 0x520024. Where did the original file come from? Could you post some of the data you are trying to read? Commented Mar 24, 2013 at 8:36
  • 2
    How are you even importing ctypes.wintypes on Linux? Did you create a new wintypes module by copying from the original? A c_wchar is 2 bytes on Windows, but 4 bytes on other platforms. Please show what you're using for WIN32_FIND_DATAW on Linux. Commented Mar 24, 2013 at 8:49
  • 1
    Did a quick look at the python sources and it confirms that the native wchar_t is used for ctypes.c_wchar. Trying to find a solution. Commented Mar 24, 2013 at 9:00
  • 1
    Sorry, I didn't think that through. In structs c_char arrays can be annoying because they try to create Python strings instead of just returning the array. So it's stopping at the first null. You'd need to use c_ubyte instead. Then it's bytarray(data[0].cFileName).decode('utf-16le'). Commented Mar 24, 2013 at 9:51

2 Answers 2

3

As already mentioned in the comments, this is due to differences between windows and linux. The ctypes module tries to fit into the local environment, hence the mismatch. The best solution is to use the struct module to handle it in a platform independent manner. The following code shows how this can be done for a single record.

# Setup test data based on incomplete sample
bytes = "\x16\x00\x00\x00\xdc\x5a\x9f\xd2\x31\x04\xca\x01\xba\x81\x89\x1a\x81\xe2\xcd\x01\xba\x81\x89\x1a\x81\xe2\xcd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x24\x00\x52\x00\x65\x00\x63\x00\x79\x00\x63\x00\x6c\x00\x65\x00\x2e\x00\x42\x00\x69\x00\x6e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
bytes = bytes + "\x00"*(592-len(bytes))

import struct
import codecs

# typedef struct _WIN32_FIND_DATA {
#   DWORD    dwFileAttributes;
#   FILETIME ftCreationTime;
#   FILETIME ftLastAccessTime;
#   FILETIME ftLastWriteTime;
#   DWORD    nFileSizeHigh;
#   DWORD    nFileSizeLow;
#   DWORD    dwReserved0;
#   DWORD    dwReserved1;
#   TCHAR    cFileName[MAX_PATH];
#   TCHAR    cAlternateFileName[14];


fmt = "<L3Q4L520s28s"

attrs, creation, access, write, sizeHigh, sizeLow, reserved0, reserved1, name, alternateName = struct.unpack(fmt, bytes)
name = codecs.utf_16_le_decode(name)[0].strip('\x00')
alternateName = codecs.utf_16_le_decode(alternateName)[0].strip('\x00')
print name

NOTE: This assumes that the size of MAX_PATH is 260 (which should be true, but you never know).

To read all values from the file you need to read blocks of 592 bytes at a time and then decode it as above.

Sign up to request clarification or add additional context in comments.

7 Comments

This is awesome. When I run this on my own data set, it prints ␀刀攀挀礀挀氀攀⸀䈀椀渀 though. Your script prints correctly for me with your data. And it's on 32-bit Ubuntu.
Looks like using utf_16_be_decode fixes my problem, although I have no idea why. Thank you two so much for all your help!
@eryksun Thanks again. Haven't used structs that much before.
@omghai2u Found an extra character in my sample data. Updated it and my code based on previous suggestsions from eryksun. Also note that the size of the data needs to be 592 and not 590.
@eryksun Thanks for the tip re 'collections.namedtuple`. Leaving that for the reader :)
|
0

You should be using the struct module from the standard library http://docs.python.org/2/library/struct.html since you are parsing a binary file format. The ctypes module is used for integrating shared libraries (DLLs) with a binary API into a Python app. I'm not saying that what you are trying to do is not possible, but using ctypes is more complicated that simply parsing C structs from a binary file.

Just remember that in C there is no such thing as a PWIN32_FIND_DATAW pointer. This is just a typedef that will resolve down to one of the raw C datatypes such as a 32-bit pointer, a 64-bit pointer, etc. The data in the file represents the raw base C datatypes.

In answer to comment... Avoid looking for shortcuts. You really do need deep understanding of the bits that are being written to the file and how they are organized. For that you will likely need to do some hexdumps and check the actual data representation. According to MS http://msdn.microsoft.com/en-ca/library/windows/desktop/aa365740(v=vs.85).aspx this is not a real complex structure. If the structure in wintypes doesn't work for you it is possible that you have found a bug. It is also possible that the on-disk structure is not identical to the in-ram structure. Often an in-ram data structure includes padding to maintain alignment on 16 or 64 byte boundaries. But programmers have been known to NOT dump the struct as is, but to pick it apart and output to a file minus the padding. Since ctypes/wintypes is intended for making binary api calls to a DLL its bias would be to include padding in the data layout. But the file might not include this.

1 Comment

Sounds great. I was just hoping to use the WIN32_FIND_DATA structure already in wintypes. To use the struct module, my question now becomes how do I create the WIN32_FIND_DATA structure with struct? And how will I unpack the multiple WIN32_FIND_DATA structures in that file?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.