So I have about 400 files ranging from 10kb to 56mb in size, file type being .txt/.doc(x)/.pdf/.xml and I have to read them all. My read in files are basically:
#for txt files
with open("TXT\\" + path, 'r') as content_file:
content = content_file.read().split(' ')
#for doc files using pydoc
contents = '\n'.join([para.text for para in doc.paragraphs]).encode("ascii","ignore").decode("utf-8").split(' ')
#for pdf files using pypdf2
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
contents = content.encode("ascii","ignore").decode("utf-8").split(' ')
#for xml files using lxml
tree = etree.parse(path)
contents = etree.tostring(tree, encoding='utf8', method='text')
contents = contents.decode("utf-8").split(' ')
But I notice even reading 30 text files with under 50kb size each and doing operations on it will take 41 seconds. But If I read a single text file with 56mb takes me 9 seconds. So I'm guessing that it's the file I/O that's slowing me down instead of my program.
Any idea on how to speed up this process? Maybe break down each file type into 4 different threads? But how would you go about doing that since they are sharing the same list and that single list will be written to a file when they are done.
readthe files and do nothing, vs. how long it takes to do your processing. If it's about the same, you're right, it's definitely the I/O time. If it's a lot faster… well, it might still be I/O time (e.g., maybe the module you're using does a lot of inefficient seeks or small reads), but it might not be, so you need to profile.