I'm getting a memory error and I'm trying to find the best solution for this problem. Basically I'm downloading a lot of XML files via multiple threads of the same class. My class uses the following command to download the files :
urlretrieve(link, filePath)
I'm saving the path of the downloaded files into a Queue that is synced between the threads.
downloadedFilesQ.put(filePath)
In another class (also multiple threads) I try to parse those XML files and save them as Python objects that I will save in the db in the future. I'm using the following command to parse the file :
xmldoc = minidom.parse(downloadedFilesQg.get())
The download and parsing flows are running simultaneously. The download flow finishes after about 2 minutes while the parsing flow takes about 15min. After 15 min I'm getting Memory error on the following line :
Exception in thread XMLConverterToObj-21:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\myuser\PycharmProjects\weat\Parsers\ParseXML.py", line 77, in parseXML
xmldoc = minidom.parse(xml_file)
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 1958, in parse
return expatbuilder.parse(file)
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 911, in parse
result = builder.parseFile(fp)
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
File "c:\_work\16\s\modules\pyexpat.c", line 417, in StartElement
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 746, in start_element_handler
_append_child(self.curNode, node)
File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 291, in _append_child
childNodes.append(node)
MemoryError
The download flow downloads about 1700 files ~ 1.2GB. Each XML file is between 200Bytes to 9MB (max). Until the memory error my code succeeds to create about 500K python objects of the same class :
from sqlalchemy import Table,Date,TEXT,Column,BIGINT,ForeignKey,Numeric,DateTime,Integer
from base import Base
class Business(Base) :
__tablename__ = 'business'
id = Column(BIGINT, primary_key=True)
BName = Column('business_name',TEXT)
Owner=Column('owner_id',Integer)
city=Column('city',TEXT)
address=Column('address',TEXT)
def __init__(self,BName,owner,city=None,address=None,workingHours=None):
self.BName=BName
self.owner=owner
self.city=city
self.address=address
The option I considered about is once I reach 100K python objects , save them to the db and then continue the parsing again. The problem is that multiple business can repeat, therefore I wanted to parse one time all the files and then insert the business into a set (in order to ignore the repeated business).
Are there other things I can try?