Python: How to decode file names retrieved from 'dir' command using subprocess?

Question

I am trying to get directory listing on Windows 10 file system using the subprocess.Popen function and dir command in Python 3.8.2. To be more specific, I have this piece of code:

import subprocess

process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

When I run the above in a directory that has file names with Unicode characters (such as "háčky a čárky.txt"), I get the following error:

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode('utf-16'))
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 42: truncated data

Obviously, the problem is with the encoding. I have tried using 'utf-8' instead of 'utf-16', but with no success. When I remove the decode('utf-16') call and use just print(line), I get the following output:

b' Volume in drive C is OSDisk\r\n'
b' Volume Serial Number is 9E2B-67E3\r\n'
b'\r\n'
b' Directory of C:\\Users\\asamec\\Dropbox\\DIY\\Python\\AccessibleRunner\\AccessibleRunner\r\n'
b'\r\n'
b'05/14/2021  09:19 AM    <DIR>          .\r\n'
b'05/14/2021  09:19 AM    <DIR>          ..\r\n'
b'05/13/2021  09:46 PM             5,697 AccessibleRunner.py\r\n'
b'05/14/2021  09:18 AM               214 error.py\r\n'
b'05/13/2021  05:48 PM             5,642 h\xa0cky a c\xa0rky.txt.py\r\n'
b'               3 File(s)         11,553 bytes\r\n'
b'               2 Dir(s)  230,706,778,112 bytes free\r\n'

When I remove the 'utf-16' argument and leave just print(line.decode()), I get the following error:

Traceback (most recent call last):
  File "error.py", line 5, in <module>
    print(line.decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 40: invalid start byte

So the question is how should I decode the processes' standard output so that I can print the correct characters?

Update

Running the chcp 65001 command in the Windows command line before running the python script is the solution. But, the following gives me the same error s above:

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp 65001 & dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-16'))
process.stdout.close()

However, when running this same Python script for the second time, it starts to work as the code page is already set to 65001. So the question now is how can I set the Windows console code page not prior to running the Python script, but rather in that Python script?

There are plenty more direct ways to get the contents of a directory than trying to parse the stdout of dir - why mess around with the funny edge cases of this method? — esqew
– esqew, Commented May 13, 2021 at 18:14
I am building a simple command line and dir is just an example of a command that could be run in that tool. — Adam
– Adam, Commented May 13, 2021 at 18:50
What if you use print(line) ### .decode('utf-16'))? Please include that info for "háčky a čárky.txt" to your minimal reproducible example. For me it's UTF-8 b'h\xc3\xa1\xc4\x8dky a \xc4\x8d\xc3\xa1rky.txt\r\n' because my REG QUERY "HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage" -v *CP returns 65001 in ACP as well as OEMCP; yours could be different… print(line.decode()) should work. — JosefZ
– JosefZ, Commented May 13, 2021 at 21:05
@JosefZ I have updated the question to address your suggestions. — Adam
– Adam, Commented May 14, 2021 at 8:08
Do you have set the PYTHONIOENCODING environment variable? Mine is PYTHONIOENCODING=utf-8. — JosefZ
– JosefZ, Commented May 14, 2021 at 11:18

JosefZ · Accepted Answer · 2021-05-14 10:46:46Z

0

Set console to UTF-8 before running the script (use CHCP 65001):

The script runs smoothly then: .\SO\67524114.py

Active code page: 65001
HL~Real~Def.txt
html.txt
háčky a čárky.txt

I can reproduce the issue using the following call:

>NUL chcp 852
.\SO\67524114.py

Active code page: 852
HL~Real~Def.txt
html.txt
Traceback (most recent call last):
  File "D:\bat\SO\67524114.py", line 7, in <module>
    print(line.decode('utf-8').strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 1: invalid start byte

Modified script used for testing:

import subprocess

process = subprocess.Popen(['cmd', '/c', 'chcp&dir /B h*.txt'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8').strip())

process.stdout.close()

answered May 14, 2021 at 10:46

JosefZ

30.5k6 gold badges52 silver badges96 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Adam Over a year ago

Thanks, this is almost the solution, however not exactly. Please, see the updated question.

Adam · Accepted Answer · 2021-05-14 22:36:20Z

0

As @JosefZ suggested in his answer, the UTF-8 code page must be set in the Windows command line prior to running the dir command. Below is the complete solution for my question:

import subprocess

subprocess.call(['chcp', '65001'], shell = True)
process = subprocess.Popen(['dir'], shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
for line in iter(process.stdout.readline, b''):
  print(line.decode('utf-8'))
process.stdout.close()

answered May 14, 2021 at 22:36

Adam

2,07125 silver badges26 bronze badges

Comments

Brainor · Accepted Answer · 2022-03-01 08:20:24Z

Since 2016.9, module subprocess version 3.6 has encoding parameter in function subprocess.run(), so that you can set specified encoding.

So, if you don't want to change the encoding of the CMD:

Type chcp in your CMD and get the active code page.
e.g. 936.
Get the encoding from Code Page Identifiers.
Identifier(936): .NET Name(gb2312)
gb2312 is the encoding name python can recognize for the most cases. But you can check the Standard Encodings of Python 3.10 to be sure, thanks to Mark Amery.
Add encoding='gb2312' to your subprocess.run() function.
process_list = subprocess.run('dir', shell = True, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, text=True, encoding='gb2312').stdout.split('\n')[:-1]
The subprocess.Popen constructor also has encoding parameter if you really want to stick to Popen, while it's recommended that "The recommended approach to invoking subprocesses is to use the run() function for all use cases it can handle."

If you want to change the encoding of the CMD, refer to the answer by JosefZ.

Collectives™ on Stack Overflow

Python: How to decode file names retrieved from 'dir' command using subprocess?

Update

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Update

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related