0

Is there some way for taking input from a file other than using for loop ? I am using,

data = fileinput.input()
c = [int(i) for i in data]
c.sort()

But for very large amount of data, it takes too long to process. input is of the format,

58457907
37850775
19743393
70718573
....
4
  • you are essentially processing the file 3 times - do you really need to sort it at the end? - try just opening the file and reading thru it, the first line reads the entire file then you process each line, and then you sort it - no wonder it takes a long time Commented Sep 24, 2014 at 17:44
  • Almost any construct has an implicit loop. Why are you avoiding an explicit loop? Commented Sep 24, 2014 at 17:52
  • its a 888 kb text file with around 100002 lines. and sorting is required for further processing... Commented Sep 24, 2014 at 17:53
  • 1
    fileinput adds some overhead. If you are sensitive to time, you may consider opening the files yourself. Commented Sep 24, 2014 at 18:36

2 Answers 2

4

If I create a 'large' file:

from random import randint 

with open('/tmp/nums.txt', 'w') as fout:
    a,b=100002/10000, 100002*10000
    for i in range(100002):
        fout.write('{}\n'.format(randint(a,b)))

I can read it, convert it to integers, and sort in place the data thus:

with open('/tmp/nums.txt') as fin:    
    nums=[int(e) for e in fin]
    nums.sort()

The total time for this operation is 50 ms on my computer. Is 50 ms a long time?


With a more formal timing:

def f1():
    with open('/tmp/nums.txt') as fin:    
        nums=[int(e) for e in fin]
        nums.sort()
    return nums

def f2():
    with open('/tmp/nums.txt') as fin:  
        return sorted(map(int, fin))

def f3():
    with open('/tmp/nums.txt') as fin:  
        nums=list(map(int, fin))
        nums.sort()    
    return nums    

if __name__ =='__main__':
    import timeit     
    import sys
    if sys.version_info.major==2:
        from itertools import imap as map

    result=[]    
    for f in (f1, f2, f3):
        fn=f.__name__
        fs="f()"
        ft=timeit.timeit(fs, setup="from __main__ import f", number=3)
        r=eval(fs)
        result.append((ft, fn, str(r[0:5])+'...'+str(r[-6:-1]) ))         

    result.sort(key=lambda t: t[0])    

    for i, t in enumerate(result):
        ft, fn, r = t
        if i==0:
            fr='{}: {:.4f} secs is fastest\n\tf(x)={}\n========'.format(fn, ft, r)   
        else:
            t1=result[0][0]
            dp=(ft-t1)/t1
            fr='{}: {:.4f} secs - {} is {:.2%} faster\n\tf(x)={}'.format(fn, ft, result[0][1], dp, r)

        print(fr)

You can see that the differences between these are not huge (except for PyPy where f3 clearly has an advantage):

Python 2.7.8:

f3: 0.2630 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.2641 secs - f3 is 0.41% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2779 secs - f3 is 5.67% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

Python 3.4.1:

f2: 0.1873 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f3: 0.1881 secs - f2 is 0.41% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2071 secs - f2 is 10.59% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

PyPy:

f3: 0.1300 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.1428 secs - f3 is 9.81% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2223 secs - f3 is 70.94% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

PyPy3:

f3: 0.2483 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.2588 secs - f3 is 4.23% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2878 secs - f3 is 15.88% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
Sign up to request clarification or add additional context in comments.

4 Comments

using fileinput.FileInput instead of open added about 65% more time in some tests I did.
Yeah its much faster then my previous one.. thanks.. but here i am using a input file.. which will be provided explicitly using parameters..
@AbhishekSharma: Is your program being provided the name of a file or the 100,002 numbers though stdin? Fileinput supports either/both.
I first wrote the code with stdin but for testing purpose i used testcase file.. which took a lot of time in execution...
3

Using readlines and map using with to open the file seems more efficient on a test of a file with 200 lines.

In [3]: %%timeit
with open("in.txt",'rb') as f:
    lines = map(int,f)
    lines.sort()
   ...: 
10000 loops, best of 3: 183 µs per loop


In [5]: %%timeit
data = fileinput.input("in.txt")
c = [int(i) for i in data]
c.sort()
   ...: 
1000 loops, best of 3: 443 µs per loop

3 Comments

@Robᵩ, literally identical timings but probably a better idea
On my PC, lines = sorted(itertools.imap(int,f)) is marginally fastest, although lines = sorted(int(x) for x in f) comes close. And I hate using map.
just tried sorted it was also identical will try with itertools, 180 µs using imap

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.