Reading a large file in python

Question

I have a "not so" large file (~2.2GB) which I am trying to read and process...

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

Am i doing something wrong??

Its been like an hour.. since the code is reading the file.. (its still reading..)

And tracking memory usage is already 20GB.. why is it taking so time and memory??

Oh, well, at least you're not having a 50G memory leak like the one I had a while ago :D That said, ever looked at graph manipulation libraries such as NetworkX? They're probably more efficient! — F.X.
– F.X., Commented Nov 5, 2013 at 18:48
Comment out the dict-building code and see how long it takes to read the file. My guess is that it will run quickly then. My other guess is the same as @DSM's: you're probably creating an enormous number of dicts. — Tim Peters
– Tim Peters, Commented Nov 5, 2013 at 18:54
I'm not confident enough to post this as an answer, but shouldn't you use f.readlines() first? — Dunno
– Dunno, Commented Nov 5, 2013 at 18:56
@Dunno: No. readlines() will make the memory issue worse: it will read the entire file into memory before the loop starts, where for line in f: will put just single lines into memory. — bukzor
– bukzor, Commented Nov 5, 2013 at 18:58
@bukzor: I just thought for line in f: won't work properly without using readlines() first. Anyway, thanks and never mind. — Dunno
– Dunno, Commented Nov 5, 2013 at 19:00

bukzor · Accepted Answer · 2013-11-05 18:57:39Z

3

To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.

answered Nov 5, 2013 at 18:57

bukzor

38.8k13 gold badges85 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

PascalVKooten Over a year ago

Where does make_graph() come from?

bukzor Over a year ago

@Dualinity: Copy-pasted for your convenience: Wrap your above code in a make_graph() function (this is best practice anyway) ...

Bakuriu Over a year ago

You shouldn't use imports inside a function. Just put them at the top level.

PascalVKooten Over a year ago

@bukzor I mean, I don't understand what it is? Does it belong to a package?

Bakuriu Over a year ago

@Dualinity At the start of this answer bukzor says "Wrap your above code in a make_graph() function". I.e. make_graph() is defined copy-pasting the code in the question inside a function.

|

Dmitry · Accepted Answer · 2013-11-05 21:31:45Z

There are a few things you can do:

Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.

counter = 0 with open("final_edge_list.txt","r") as f: for line in f: counter += 1 if counter == 200000: break try: ...

On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.
Your graph seems symmetric
```
graph[src][destination] = weight
graph[destination][src] = weight
```
In your graph processing code use symmetry of graph, reduce memory usage by half.
Run profilers on you code using subset of the data, see what happens there. Simplest would be to run
```
python -m cProfile --sort cumulative youprogram.py
```
There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/

Community · Accepted Answer · 2017-05-23 12:29:14Z

Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.

To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:

Python: garbage collection fails?

Suggestions to work around this include:

use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
use a library that implements the functionality in C, e.g., numpy, pandas
use another interpreter, e.g., PyPy

martineau · Accepted Answer · 2013-11-06 09:47:03Z

2

You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
What do you plan to do with your nodes list afterwards?

edited Nov 6, 2013 at 9:47

martineau

124k29 gold badges181 silver badges319 bronze badges

answered Nov 5, 2013 at 22:11

latheiere

4514 silver badges14 bronze badges

Collectives™ on Stack Overflow

Reading a large file in python

4 Answers 4

7 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related