2

I have a "not so" large file (~2.2GB) which I am trying to read and process...

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

Am i doing something wrong??

Its been like an hour.. since the code is reading the file.. (its still reading..)

And tracking memory usage is already 20GB.. why is it taking so time and memory??

9
  • 2
    Oh, well, at least you're not having a 50G memory leak like the one I had a while ago :D That said, ever looked at graph manipulation libraries such as NetworkX? They're probably more efficient! Commented Nov 5, 2013 at 18:48
  • 1
    Comment out the dict-building code and see how long it takes to read the file. My guess is that it will run quickly then. My other guess is the same as @DSM's: you're probably creating an enormous number of dicts. Commented Nov 5, 2013 at 18:54
  • I'm not confident enough to post this as an answer, but shouldn't you use f.readlines() first? Commented Nov 5, 2013 at 18:56
  • 1
    @Dunno: No. readlines() will make the memory issue worse: it will read the entire file into memory before the loop starts, where for line in f: will put just single lines into memory. Commented Nov 5, 2013 at 18:58
  • @bukzor: I just thought for line in f: won't work properly without using readlines() first. Anyway, thanks and never mind. Commented Nov 5, 2013 at 19:00

4 Answers 4

3

To get a rough idea of where the memory is going, you can use the gc.get_objects function. Wrap your above code in a make_graph() function (this is best practice anyway), and then wrap the call to this function with a KeyboardInterrupt exception handler which prints out the gc data to a file.

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

Now whenever you ctrl+c your program, you'll get a new gc.log. Given a few samples you should be able to see the memory issue.

Sign up to request clarification or add additional context in comments.

7 Comments

Where does make_graph() come from?
@Dualinity: Copy-pasted for your convenience: Wrap your above code in a make_graph() function (this is best practice anyway) ...
You shouldn't use imports inside a function. Just put them at the top level.
@bukzor I mean, I don't understand what it is? Does it belong to a package?
@Dualinity At the start of this answer bukzor says "Wrap your above code in a make_graph() function". I.e. make_graph() is defined copy-pasting the code in the question inside a function.
|
2

There are a few things you can do:

  1. Run your code on a subset of data. Measure time required. Extrapolate to the full size of your data. That will give you an estimate how long it will run.

    counter = 0 with open("final_edge_list.txt","r") as f: for line in f: counter += 1 if counter == 200000: break try: ...

    On 1M lines it runs ~8 sec on my machine, so for 2.2Gb file with about 100M lines it suppose to run ~15 min. Though, once you get over you available memory, it won't hold anymore.

  2. Your graph seems symmetric

    graph[src][destination] = weight
    graph[destination][src] = weight
    

    In your graph processing code use symmetry of graph, reduce memory usage by half.

  3. Run profilers on you code using subset of the data, see what happens there. Simplest would be to run

    python -m cProfile --sort cumulative youprogram.py
    

    There is a good article on speed and memory profilers: http://www.huyng.com/posts/python-performance-analysis/

Comments

2

Python's numeric types use quite a lot of memory compared to other programming languages. For my setting it appears to be 24 bytes for each number:

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

Given you have hundreds of millions of lines in that 2.2 GB input file the reported memory consumption should not come unexpected.

To add another thing, some versions of the Python interpreter (including CPython 2.6) are known for keeping so called free lists for allocation performance, especially for objects of type int and float. Once allocated, this memory will not be returned to the operating system until your process terminates. Also have a look at this question I posted when I first discovered this issue:

Suggestions to work around this include:

  • use a subprocess to do the memory hungry computation, e.g., based on the multiprocessing module
  • use a library that implements the functionality in C, e.g., numpy, pandas
  • use another interpreter, e.g., PyPy

Comments

2
  • You don't need graph to be defaultdict(dict), user dict instead; graph[src, destination] = weight and graph[destination, src] = weight will do. Or only one of them.
  • To reduce memory usage, try store resulting dataset in scipy.sparse matrix, it consumes less memory and might be compressed.
  • What do you plan to do with your nodes list afterwards?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.