Taking input from file into list in python

Question

Is there some way for taking input from a file other than using for loop ? I am using,

data = fileinput.input()
c = [int(i) for i in data]
c.sort()

But for very large amount of data, it takes too long to process. input is of the format,

you are essentially processing the file 3 times - do you really need to sort it at the end? - try just opening the file and reading thru it, the first line reads the entire file then you process each line, and then you sort it - no wonder it takes a long time — gkusner
– gkusner, Commented Sep 24, 2014 at 17:44
Almost any construct has an implicit loop. Why are you avoiding an explicit loop? — dawg
– dawg, Commented Sep 24, 2014 at 17:52
its a 888 kb text file with around 100002 lines. and sorting is required for further processing... — Abhishek Sharma
– Abhishek Sharma, Commented Sep 24, 2014 at 17:53
fileinput adds some overhead. If you are sensitive to time, you may consider opening the files yourself. — tdelaney
– tdelaney, Commented Sep 24, 2014 at 18:36

dawg · Accepted Answer · 2014-09-24 18:32:11Z

4

If I create a 'large' file:

from random import randint 

with open('/tmp/nums.txt', 'w') as fout:
    a,b=100002/10000, 100002*10000
    for i in range(100002):
        fout.write('{}\n'.format(randint(a,b)))

I can read it, convert it to integers, and sort in place the data thus:

with open('/tmp/nums.txt') as fin:    
    nums=[int(e) for e in fin]
    nums.sort()

The total time for this operation is 50 ms on my computer. Is 50 ms a long time?

With a more formal timing:

def f1():
    with open('/tmp/nums.txt') as fin:    
        nums=[int(e) for e in fin]
        nums.sort()
    return nums

def f2():
    with open('/tmp/nums.txt') as fin:  
        return sorted(map(int, fin))

def f3():
    with open('/tmp/nums.txt') as fin:  
        nums=list(map(int, fin))
        nums.sort()    
    return nums    

if __name__ =='__main__':
    import timeit     
    import sys
    if sys.version_info.major==2:
        from itertools import imap as map

    result=[]    
    for f in (f1, f2, f3):
        fn=f.__name__
        fs="f()"
        ft=timeit.timeit(fs, setup="from __main__ import f", number=3)
        r=eval(fs)
        result.append((ft, fn, str(r[0:5])+'...'+str(r[-6:-1]) ))         

    result.sort(key=lambda t: t[0])    

    for i, t in enumerate(result):
        ft, fn, r = t
        if i==0:
            fr='{}: {:.4f} secs is fastest\n\tf(x)={}\n========'.format(fn, ft, r)   
        else:
            t1=result[0][0]
            dp=(ft-t1)/t1
            fr='{}: {:.4f} secs - {} is {:.2%} faster\n\tf(x)={}'.format(fn, ft, result[0][1], dp, r)

        print(fr)

You can see that the differences between these are not huge (except for PyPy where f3 clearly has an advantage):

Python 2.7.8:

f3: 0.2630 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.2641 secs - f3 is 0.41% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2779 secs - f3 is 5.67% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

Python 3.4.1:

f2: 0.1873 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f3: 0.1881 secs - f2 is 0.41% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2071 secs - f2 is 10.59% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

PyPy:

f3: 0.1300 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.1428 secs - f3 is 9.81% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2223 secs - f3 is 70.94% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

PyPy3:

f3: 0.2483 secs is fastest
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
========
f2: 0.2588 secs - f3 is 4.23% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]
f1: 0.2878 secs - f3 is 15.88% faster
    f(x)=[3025, 18834, 19637, 29124, 42088]...[999964829, 999970030, 999984585, 1000005692, 1000010131]

edited Sep 24, 2014 at 18:32

answered Sep 24, 2014 at 18:02

dawg

105k24 gold badges142 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

tdelaney Over a year ago

using fileinput.FileInput instead of open added about 65% more time in some tests I did.

Abhishek Sharma Over a year ago

Yeah its much faster then my previous one.. thanks.. but here i am using a input file.. which will be provided explicitly using parameters..

dawg Over a year ago

@AbhishekSharma: Is your program being provided the name of a file or the 100,002 numbers though stdin? Fileinput supports either/both.

Abhishek Sharma Over a year ago

I first wrote the code with stdin but for testing purpose i used testcase file.. which took a lot of time in execution...

Padraic Cunningham · Accepted Answer · 2014-09-24 17:57:57Z

3

Using readlines and map using with to open the file seems more efficient on a test of a file with 200 lines.

In [3]: %%timeit
with open("in.txt",'rb') as f:
    lines = map(int,f)
    lines.sort()
   ...: 
10000 loops, best of 3: 183 µs per loop


In [5]: %%timeit
data = fileinput.input("in.txt")
c = [int(i) for i in data]
c.sort()
   ...: 
1000 loops, best of 3: 443 µs per loop

edited Sep 24, 2014 at 17:57

answered Sep 24, 2014 at 17:48

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

3 Comments

Padraic Cunningham Over a year ago

@Robᵩ, literally identical timings but probably a better idea

Robᵩ Over a year ago

On my PC, lines = sorted(itertools.imap(int,f)) is marginally fastest, although lines = sorted(int(x) for x in f) comes close. And I hate using map.

Padraic Cunningham Over a year ago

just tried sorted it was also identical will try with itertools, 180 µs using imap

Collectives™ on Stack Overflow

Taking input from file into list in python

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related