0

I'm getting a memory error and I'm trying to find the best solution for this problem. Basically I'm downloading a lot of XML files via multiple threads of the same class. My class uses the following command to download the files :

urlretrieve(link, filePath)

I'm saving the path of the downloaded files into a Queue that is synced between the threads.

downloadedFilesQ.put(filePath)

In another class (also multiple threads) I try to parse those XML files and save them as Python objects that I will save in the db in the future. I'm using the following command to parse the file :

    xmldoc = minidom.parse(downloadedFilesQg.get())

The download and parsing flows are running simultaneously. The download flow finishes after about 2 minutes while the parsing flow takes about 15min. After 15 min I'm getting Memory error on the following line :

Exception in thread XMLConverterToObj-21:
Traceback (most recent call last):
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\myuser\PycharmProjects\weat\Parsers\ParseXML.py", line 77, in parseXML
    xmldoc = minidom.parse(xml_file)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 1958, in parse
    return expatbuilder.parse(file)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 911, in parse
    result = builder.parseFile(fp)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
  File "c:\_work\16\s\modules\pyexpat.c", line 417, in StartElement
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 746, in start_element_handler
    _append_child(self.curNode, node)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 291, in _append_child
    childNodes.append(node)
MemoryError

The download flow downloads about 1700 files ~ 1.2GB. Each XML file is between 200Bytes to 9MB (max). Until the memory error my code succeeds to create about 500K python objects of the same class :

from sqlalchemy import Table,Date,TEXT,Column,BIGINT,ForeignKey,Numeric,DateTime,Integer
from base import Base

class Business(Base) :

    __tablename__ = 'business'
    id = Column(BIGINT, primary_key=True)
    BName = Column('business_name',TEXT)
    Owner=Column('owner_id',Integer)
    city=Column('city',TEXT)
    address=Column('address',TEXT)


    def __init__(self,BName,owner,city=None,address=None,workingHours=None):
        self.BName=BName
        self.owner=owner
        self.city=city
        self.address=address

The option I considered about is once I reach 100K python objects , save them to the db and then continue the parsing again. The problem is that multiple business can repeat, therefore I wanted to parse one time all the files and then insert the business into a set (in order to ignore the repeated business).

Are there other things I can try?

1
  • multiple business can repeat - so define the repetitive attribute and skip the duplicated objects Commented Sep 3, 2019 at 8:55

1 Answer 1

1

You appear to keep everything in memory at the same time. RAM, the memory a computer works with, however, is much more limited than storage memory (hard disks). So you might easily store a lot of XML documents on your storage but cannot hold everything in RAM at the same time.

In your case this means that you should change your program fundamentally.

Your program should work in a streaming fashion, meaning, it should load one XML document, parse it, process it somehow, store its results in a data base, and then forget about this document again. The last point is vital to free the RAM the document occupied.

Now you write that you need to figure out which documents are repeated.

To achieve this I propose not to store the whole documents in memory but just a hash value for each. For this you need to provide a decent hash function which creates a unique hash value for a given document. Then you store just the hash value for each document you processed in a set, and each time you encounter a new document which has the same hash value, you will know that this is a repeated document and can handle it accordingly (e. g. ignore it).

While it might be impossible to keep 7000 documents of 9MB size in memory at the same time, it will easily be possible to keep 7000 hash values in memory at the same time.

Sign up to request clarification or add additional context in comments.

1 Comment

I thought that keeping in memory about 1.2 GB of data isnt such a big issue. The problem in my case that the same document isnt reapeated but some items inside the xml can be repeated, therefore a hash of the entire file wont be usefull here.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.