Memory usage in creating python dict from xml using pattern

Question

I have a largeish xml (40 MB) and use the following function to parse it to a dict

    def get_title_year(xml,low,high):
        """
        Given an XML document extract the title and year of each article.
        Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow 
        """
        dom = web.Element(xml)
        result = {'title':[],'publication year':[]}
        count = 0
        for article in dom.by_tag('article'):
            year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) 
            if low < year < high:
                result['title'].append(article.by_tag('title')[0].content)
                result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
        return result

    ty_dict = get_title_year(PR_file,1912,1970)
    ty_df = pd.DataFrame(ty_dict)
    print ty_df.head()

       publication year                                              title
    0              1913  The Velocity of Electrons in the Photo-electri...
    1              1913  Announcement of the Transfer of the Review to ...
    2              1913  Diffraction and Secondary Radiation with Elect...
    3              1913      On the Comparative Absorption of γ and X Rays
    4              1913             Study of Resistance of Carbon Contacts

When I run this, I end up using 2.5 GB of RAM! Two questions:

Where is all this RAM used? It is not the dictionary or the DataFrame, when I save the dataframe as utf8 csv it is only 3.4 MB.

Also, RAM is not released after the function finishes. Is this normal? I never paid attention to python memory usage in the past, so I cannot say.

I suspect it is the web.Element(). If the PR_file doesn't get disposed, I would say that the whole data structure carrying the XML stays alive. Why it is so big though, I don't know. Maybe a nasty XML structure forcing the DOM to bloat the internal structures. — Vroomfondel
– Vroomfondel, Commented Mar 6, 2014 at 8:59
Try running get_title_year() as a separate process: stackoverflow.com/a/22191166/1407427. — Wojciech Walczak
– Wojciech Walczak, Commented Mar 6, 2014 at 10:29
don't know what web.Element() is, but if it's anything like etree, it's probably allocating multiple python dictionaries per XML node. that will blow it up pretty quickly. python is many things, but memory efficient doesn't tend to be one of them, from what i've seen. — Corley Brigman
– Corley Brigman, Commented Mar 6, 2014 at 13:50
@Wojciech Walczak Using mulitprocessing, the code still uses 2.5 GB while running, but at least it releases them back to the os when finished!! The only thing I had to change from Wojciech's link was the order of the assignment and the join statement, otherwise the code deadlocked see here. — nikosd
– nikosd, Commented Mar 9, 2014 at 7:43

nikosd · Accepted Answer · 2014-03-09 08:00:10Z

This answers only the part about releasing the memory at the end of the function. See Wojciech Walczak's comment and link above! I am posting the code here because I found that in my case (Ubuntu 12.04) putting the p.join() statement before the assignment ty_dict = q.get() (as in the original link) caused the code to deadlock, see here .

    from multiprocessing import Process, Queue

    def get_title_year(xml,low,high,q):
        """
        Given an XML document extract the title and year of each article.
        Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow 
        """
        dom = web.Element(xml)
        result = {'title':[],'publication year':[]}
        for article in dom.by_tag('article'):
            year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1]) 
            if low < year < high:
                result['title'].append(article.by_tag('title')[0].content)
                result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
        q.put(result)

    q = Queue()
    p = Process(target=get_title_year, args=(PR_file,1912,1970, q))
    p.start()
    ty_dict = q.get()
    p.join()
    if p.is_alive():
        p.terminate()

With this version the memory is released back to the os att he end of the statement.

Collectives™ on Stack Overflow

Memory usage in creating python dict from xml using pattern

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related