Performance: fastest way of reading in files with Python

Question

So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:

#for txt files
with open("TXT\\" + path, 'r') as content_file:
    content = content_file.read().split(' ')

#for doc files using pydoc
contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ')

#for pdf files using pypdf2
for i in range(0, pdf.getNumPages()):
    content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
contents = content.encode("ascii","ignore").decode("utf-8").split(' ')

#for xml files using lxml
tree = etree.parse(path)
contents = etree.tostring(tree, encoding='utf8', method='text')
contents = contents.decode("utf-8").split(' ')

But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.

Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.

If you have working code that is underperforming, you really should try a profiler to identify where the code is slow, rather than guessing. — user1277476
– user1277476, Commented Nov 26, 2014 at 23:14
Do you have a profiler you like to use? or how would I go about doing it other then sticking print time since start of program everywhere? — MooingRawr
– MooingRawr, Commented Nov 26, 2014 at 23:17
Even without a profiler, you can very quickly verify whether you're right about the I/O operations: Just test how long it takes to read the files and do nothing, vs. how long it takes to do your processing. If it's about the same, you're right, it's definitely the I/O time. If it's a lot faster… well, it might still be I/O time (e.g., maybe the module you're using does a lot of inefficient seeks or small reads), but it might not be, so you need to profile. — abarnert
– abarnert, Commented Nov 26, 2014 at 23:22

abarnert · Accepted Answer · 2014-12-01 19:35:07Z

1

If you're blocked on file I/O, as you suspect, there's probably not much you can do.

But parallelizing to different threads might help if you have great bandwidth but terrible latency. Especially if you're dealing with, say, a networked filesystem or a multi-platter logical drive. So, it can't hurt to try.

But there's no reason to do it per file type; just use a single pool to handle all your files. For example, using the futures module:*

import concurrent.futures

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_file, list_of_filenames)

A ThreadPoolExecutor is slightly smarter than a basic thread pool, because it lets you build composable futures, but here you don't need any of that, so I'm just using it as a basic thread pool because Python doesn't have one of those.**

The constructor creates 4 threads, and all the queues and anything else needed to manage putting tasks on those threads and getting results back.

Then, the map method just goes through each filename in list_of_filenames, creates a task out of calling process_file on that filename, submits it to the pool, and then waits for all of the tasks to finish.

In other words, this is the same as writing:

results = [process_file(filename) for filename in list_of_filenames]

… except that it uses four threads to process the files in parallel.

There are some nice examples in the docs if this isn't clear enough.

_{* If you're using Python 2.x, you'll need to install a backport before you can use this. Or you can use multiprocessing.dummy.Pool instead, as noted below.}

_{** Actually, it does, in multiprocessing.dummy.Pool, but that's not very clearly documented.}

edited Dec 1, 2014 at 19:35

answered Nov 26, 2014 at 23:04

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MooingRawr Over a year ago

Not really sure what this is doing. Would you mind to explain it, or I'll look it up when I have time to. Thanks for the reply.

user2379410 Over a year ago

map(...) should be executor.map(...)?

abarnert Over a year ago

@moarningsun: Yes, definitely. Thanks for catching that!

Collectives™ on Stack Overflow

Performance: fastest way of reading in files with Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related