I've got a script I'm using to build a sitemap - for each of the relevant models I'm generating a number of paged sitemaps with a URL for each object, and each week I intend to run the script again to regenerate the sitemap with fresh data.
However, when I run this script on my Ubuntu server, the memory usage keeps climbing until the process is eventually killed by the OS. Here's the function I'm currently having trouble getting through:
def xml_for_page(object):
sOutText = "\t<url>\n"
sURL = object.url()
sOutText += "\t\t<loc>http://www.mysite.com%s</loc>\n" % sURL
sLastModified = object.last_status_change.isoformat()
sOutText += "\t\t<lastmod>%s</lastmod>\n" % sLastModified
sChangeFreq = "monthly"
sOutText += "\t\t<changefreq>%s</changefreq>\n" % sChangeFreq
sOutText += "\t</url>\n"
return sOutText
def generate_object_map():
# Do this in chunks of ITEMS_PER_FILE
bFinished = False
iOffset = 0
iCurrentPage = 0
while not bFinished:
objResults = PageObject.objects.filter(submission_status=SUBMISSION_STATUS_ACCEPTED).order_by('-popularity')[iOffset:iOffset+ITEMS_PER_FILE]
if objResults.count() < 1:
break
sFilename = "%s/sitemap-objects-%i.xml" % (OUTPUT_DIR, iCurrentPage)
fObjects = open(sFilename, 'w')
fObjects.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fObjects.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
for object in objResults:
fObjects.write(xml_for_page(object))
fObjects.write('</urlset>')
fObjects.close()
iOffset += ITEMS_PER_FILE
iCurrentPage += 1
So what's going on here is: each iteration of the while not bFinished loop we create a new file for the page we're on, and query the database for that particular set of objects. We then iterate through those objects and write the XML for that page onto the sitemap file. Once these are written, we close that file and start another one. The reason for this paging behaviour is that, when writing all entries out to one file, I hit the memory limit very quickly. This behaves better than it did then, but when I use resource to track memory usage I can see it climbing after each file written. There are around 200,000 objects of this type in the database, so ideally I need to make this as scalable as possible. However I can't see how memory is being held on to after each iteration of the main loop: the QuerySet object is rewritten after each iteration, and the file handle is closed and reallocated after each iteration as well. I thought Python's GC behaviour would allow the no-longer-used objects to be cleaned up once the variable had been reallocated. Is that not the case?
object.url()be doing something? What if you replaced django models with just a set of strings and wrote that to a file instead, to see if the models, or yourxml_for_page, has anything to do with it. Keep on cutting the example code down, is what I'm saying.deferor.onlyor even.values_liston your models? Lower the amount of db calls, the better.url()function is the thing limiting me here - while I could fetch the ID andlast_modified_datethroughvalues_listtheurlfunction uses values from foreign key fields so I'll still have to fetch the object itself. Or is there a similar function for fetching the results of one particular member function?object.values_list('foreignobject__url')url()function to use on dicts, the script runs several orders of magnitude faster and lighter! If you put that as an answer I'll gladly accept it