0

I need to process a large text file (4 GB). Which is having data as:

12 23 34
22 78 98
76 56 77

Where I need to read each line as do some work based on the lines. Currently I am doing as:

sample = 'filename.txt'

with open(sample) as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]
      do_someprocess()

It is taking huge time to execute. Is there any other better way to do this in python??

2
  • 1
    What does do_someprocess() do? Are you sure that split() and int() are the functions taking the most time? Commented Nov 17, 2014 at 5:43
  • 1
    You can run python -m cProfile myscript.py so you're sure to optimize the right functions. Commented Nov 17, 2014 at 5:44

2 Answers 2

1

If do_someprocess() takes a long time compared to reading the lines and you have extra CPU cores you could use the multiprocessing module.

Try using pypy if possible. For some compute intensive tasks it is dozens of times faster than cpython

If there are a lot of duplicate ints in the file, it can surprisingly be faster to use a dict mapping than int() as it saves the time to create new int objects.

First step is to profile as @nathancahill suggests in the comments. Then focus your efforts on the parts where the biggest gains can be made.

Sign up to request clarification or add additional context in comments.

Comments

0

split() returns you a list. And then you are trying to access the first,second and third element by

line = [int(i) for i in line]
  a = line[0]
  b = line[1]
  c = line[2]

Instead of that you can directly say a,b,c = line.split() then a will contain line[0], b will contain line[1] and c will contain line[2]. This should save you some time.

with open(sample) as f:
    for line in f:
      a,b,c = line.split() 
      do_someprocess()

An example:

with open("sample.txt","r") as f:
    for line in f:
        a,b,c = line.split()
        print a,b,c

.txt file

12 34 45
78 67 45

Output:

12 34 45
78 67 45

EDIT : I thought of elaborating on it.I have used timeit() module to compare the time taken by the code to run. Please let me know if I'm doing something wrong here.The following is the OP way of writing the code.

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

Output:

8.94879606286   ## seconds to complete 100000 times.

The following is my way of writing the code.

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

Outputs :

7.60287380216 ## seconds to complete same number of times.

4 Comments

Note that this will fail if line has more than 3 elements. Better to say a, b, c = line.split()[:3]
OP just mentioned only three data values in his/her example.
Your code is skipping the step of converting the values to int
@gnibbler You were right. I have edited my code now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.