I have a largeish xml (40 MB) and use the following function to parse it to a dict
def get_title_year(xml,low,high):
"""
Given an XML document extract the title and year of each article.
Inputs: xml (xml string); low, high (integers) defining beginning and ending year of the record to follow
"""
dom = web.Element(xml)
result = {'title':[],'publication year':[]}
count = 0
for article in dom.by_tag('article'):
year = int(re.split('"',article.by_tag('cpyrt')[0].content)[1])
if low < year < high:
result['title'].append(article.by_tag('title')[0].content)
result['publication year'].append(int(re.split('"',article.by_tag('cpyrt')[0].content)[1]))
return result
ty_dict = get_title_year(PR_file,1912,1970)
ty_df = pd.DataFrame(ty_dict)
print ty_df.head()
publication year title
0 1913 The Velocity of Electrons in the Photo-electri...
1 1913 Announcement of the Transfer of the Review to ...
2 1913 Diffraction and Secondary Radiation with Elect...
3 1913 On the Comparative Absorption of γ and X Rays
4 1913 Study of Resistance of Carbon Contacts
When I run this, I end up using 2.5 GB of RAM! Two questions:
Where is all this RAM used? It is not the dictionary or the DataFrame, when I save the dataframe as utf8 csv it is only 3.4 MB.
Also, RAM is not released after the function finishes. Is this normal? I never paid attention to python memory usage in the past, so I cannot say.
get_title_year()as a separate process: stackoverflow.com/a/22191166/1407427.web.Element()is, but if it's anything like etree, it's probably allocating multiple python dictionaries per XML node. that will blow it up pretty quickly. python is many things, but memory efficient doesn't tend to be one of them, from what i've seen.