2

I've got a script I'm using to build a sitemap - for each of the relevant models I'm generating a number of paged sitemaps with a URL for each object, and each week I intend to run the script again to regenerate the sitemap with fresh data.

However, when I run this script on my Ubuntu server, the memory usage keeps climbing until the process is eventually killed by the OS. Here's the function I'm currently having trouble getting through:

def xml_for_page(object):
    sOutText = "\t<url>\n"

    sURL = object.url()
    sOutText += "\t\t<loc>http://www.mysite.com%s</loc>\n" % sURL

    sLastModified = object.last_status_change.isoformat()
    sOutText += "\t\t<lastmod>%s</lastmod>\n" % sLastModified

    sChangeFreq = "monthly"
    sOutText += "\t\t<changefreq>%s</changefreq>\n" % sChangeFreq

    sOutText += "\t</url>\n"

    return sOutText

def generate_object_map():

    # Do this in chunks of ITEMS_PER_FILE
    bFinished = False
    iOffset = 0
    iCurrentPage = 0

    while not bFinished:
        objResults = PageObject.objects.filter(submission_status=SUBMISSION_STATUS_ACCEPTED).order_by('-popularity')[iOffset:iOffset+ITEMS_PER_FILE]
        if objResults.count() < 1:
            break
        sFilename = "%s/sitemap-objects-%i.xml" % (OUTPUT_DIR, iCurrentPage)
        fObjects = open(sFilename, 'w')
        fObjects.write('<?xml version="1.0" encoding="UTF-8"?>\n')
        fObjects.write('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n')
        for object in objResults:
            fObjects.write(xml_for_page(object))
        fObjects.write('</urlset>')
        fObjects.close()
        iOffset += ITEMS_PER_FILE
        iCurrentPage += 1

So what's going on here is: each iteration of the while not bFinished loop we create a new file for the page we're on, and query the database for that particular set of objects. We then iterate through those objects and write the XML for that page onto the sitemap file. Once these are written, we close that file and start another one. The reason for this paging behaviour is that, when writing all entries out to one file, I hit the memory limit very quickly. This behaves better than it did then, but when I use resource to track memory usage I can see it climbing after each file written. There are around 200,000 objects of this type in the database, so ideally I need to make this as scalable as possible. However I can't see how memory is being held on to after each iteration of the main loop: the QuerySet object is rewritten after each iteration, and the file handle is closed and reallocated after each iteration as well. I thought Python's GC behaviour would allow the no-longer-used objects to be cleaned up once the variable had been reallocated. Is that not the case?

5
  • Could object.url() be doing something? What if you replaced django models with just a set of strings and wrote that to a file instead, to see if the models, or your xml_for_page, has anything to do with it. Keep on cutting the example code down, is what I'm saying Commented Jan 17, 2014 at 16:08
  • If you're going for scalability, why not make use of either .defer or .only or even .values_list on your models? Lower the amount of db calls, the better. Commented Jan 17, 2014 at 16:10
  • The url() function is the thing limiting me here - while I could fetch the ID and last_modified_date through values_list the url function uses values from foreign key fields so I'll still have to fetch the object itself. Or is there a similar function for fetching the results of one particular member function? Commented Jan 17, 2014 at 18:00
  • 1
    docs.djangoproject.com/en/dev/ref/models/querysets/#values-list seems like it'd be fine to use object.values_list('foreignobject__url') Commented Jan 17, 2014 at 19:50
  • I didn't realise that was possible! Thanks for the advice, after a lot of fiddly replication of our url() function to use on dicts, the script runs several orders of magnitude faster and lighter! If you put that as an answer I'll gladly accept it Commented Jan 20, 2014 at 13:57

1 Answer 1

1

The docs seems to suggest that it'd be fine to use object.values_list('foreignobject__url'), which allow for related fields. So if foreignobject is a foreign key on your model that itself contains the field url, you'd be safe to use values_list in order to reduce DB calls.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.