Loading huge text file in python

Question

I need to process a large text file (4 GB). Which is having data as:

12 23 34
22 78 98
76 56 77

Where I need to read each line as do some work based on the lines. Currently I am doing as:

sample = 'filename.txt'

with open(sample) as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]
      do_someprocess()

It is taking huge time to execute. Is there any other better way to do this in python??

What does do_someprocess() do? Are you sure that split() and int() are the functions taking the most time? — nathancahill
– nathancahill, Commented Nov 17, 2014 at 5:43
You can run python -m cProfile myscript.py so you're sure to optimize the right functions. — nathancahill
– nathancahill, Commented Nov 17, 2014 at 5:44

John La Rooy · Accepted Answer · 2014-11-17 05:43:49Z

1

If do_someprocess() takes a long time compared to reading the lines and you have extra CPU cores you could use the multiprocessing module.

Try using pypy if possible. For some compute intensive tasks it is dozens of times faster than cpython

If there are a lot of duplicate ints in the file, it can surprisingly be faster to use a dict mapping than int() as it saves the time to create new int objects.

First step is to profile as @nathancahill suggests in the comments. Then focus your efforts on the parts where the biggest gains can be made.

answered Nov 17, 2014 at 5:43

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

d-coder · Accepted Answer · 2014-11-17 09:57:40Z

0

split() returns you a list. And then you are trying to access the first,second and third element by

line = [int(i) for i in line]
  a = line[0]
  b = line[1]
  c = line[2]

Instead of that you can directly say a,b,c = line.split() then a will contain line[0], b will contain line[1] and c will contain line[2]. This should save you some time.

with open(sample) as f:
    for line in f:
      a,b,c = line.split() 
      do_someprocess()

An example:

with open("sample.txt","r") as f:
    for line in f:
        a,b,c = line.split()
        print a,b,c

.txt file

12 34 45
78 67 45

Output:

12 34 45
78 67 45

EDIT : I thought of elaborating on it.I have used timeit() module to compare the time taken by the code to run. Please let me know if I'm doing something wrong here.The following is the OP way of writing the code.

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

Output:

8.94879606286   ## seconds to complete 100000 times.

The following is my way of writing the code.

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

Outputs :

7.60287380216 ## seconds to complete same number of times.

edited Nov 17, 2014 at 9:57

answered Nov 17, 2014 at 5:41

d-coder

14.2k4 gold badges28 silver badges39 bronze badges

4 Comments

Chinmay Kanchi Over a year ago

Note that this will fail if line has more than 3 elements. Better to say a, b, c = line.split()[:3]

d-coder Over a year ago

OP just mentioned only three data values in his/her example.

John La Rooy Over a year ago

Your code is skipping the step of converting the values to int

d-coder Over a year ago

@gnibbler You were right. I have edited my code now.

Collectives™ on Stack Overflow

Loading huge text file in python

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related