How to Parse a huge xml file (on the go) using Python

Question

I have a huge xml file (the current wikipedia dump). This xml having a size of about 45 GB represents the entire data of the current wikipedia. The first few lines of the file are (output of more):

    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww
    w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/x
    ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la
    ng="en">
      <siteinfo>
        <sitename>Wikipedia</sitename>
        <base>http://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.21wmf6</generator>
        <case>first-letter</case>
        <namespaces>
          <namespace key="-2" case="first-letter">Media</namespace>
          <namespace key="-1" case="first-letter">Special</namespace>
          <namespace key="0" case="first-letter" />
          <namespace key="1" case="first-letter">Talk</namespace>
          <namespace key="2" case="first-letter">User</namespace>
          <namespace key="3" case="first-letter">User talk</namespace>
          <namespace key="4" case="first-letter">Wikipedia</namespace>
          <namespace key="5" case="first-letter">Wikipedia talk</namespace>
          <namespace key="6" case="first-letter">File</namespace>
          <namespace key="7" case="first-letter">File talk</namespace>
          <namespace key="8" case="first-letter">MediaWiki</namespace>
          <namespace key="9" case="first-letter">MediaWiki talk</namespace>
          <namespace key="10" case="first-letter">Template</namespace>
          <namespace key="11" case="first-letter">Template talk</namespace>
          <namespace key="12" case="first-letter">Help</namespace>
          <namespace key="13" case="first-letter">Help talk</namespace>
          <namespace key="14" case="first-letter">Category</namespace>
          <namespace key="15" case="first-letter">Category talk</namespace>
          <namespace key="100" case="first-letter">Portal</namespace>
          <namespace key="101" case="first-letter">Portal talk</namespace>
          <namespace key="108" case="first-letter">Book</namespace>
          <namespace key="109" case="first-letter">Book talk</namespace>
          <namespace key="446" case="first-letter">Education Program</namespace>
          <namespace key="447" case="first-letter">Education Program talk</namespace
    >
          <namespace key="710" case="first-letter">TimedText</namespace>
          <namespace key="711" case="first-letter">TimedText talk</namespace>
        </namespaces>
      </siteinfo>
      <page>
        <title>AccessibleComputing</title>
        <ns>0</ns>
        <id>10</id>
        <redirect title="Computer accessibility" />
        <revision>
          <id>381202555</id>
          <parentid>381200179</parentid>
          <timestamp>2010-08-26T22:38:36Z</timestamp>
          <contributor>
            <username>OlEnglish</username>
            <id>7181920</id>
          </contributor>
          <minor />
          <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.2
    8.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by 
    Gurch</comment>
          <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from C
    amelCase}}</text>
          <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
          <model>wikitext</model>

...and so on

Notice the page element in the tree. It corresponds to a unique page in Wikipedia. The given XML consists of all the pages of Wikipedia in the form of page elements. I need to write a parser where in I need to extract the value of title entry from the page for all pages of wikipedia and suppose (for simplicity) print them.

I am trying to build the same using Python (although I am open to a switch in language if that offers a solution). The only way I know of is to use ElementTree.

However, using the function parse('file.xml') requires the entire document to first be parsed completely and THEN will any results be outputted. As is evident, I know that the entire xml consist of page elements. I want the program to begin printing titles WHILE it is parsing the rest of the xml. Is that even possible. If so, how?

EDIT Note: I cite an example of extracting titles here to keep things simple in the question. However, I do need the xml parsing features since I need to extract the same in future.

Related: stackoverflow.com/questions/3707155/…

Warren Weckesser
– Warren Weckesser

2013-04-08 23:57:33 +00:00
Commented Apr 8, 2013 at 23:57 — Warren Weckesser
– Warren Weckesser, Commented Apr 8, 2013 at 23:57

Jesse Rusak · Accepted Answer · 2013-04-09 00:06:46Z

3

What you want is an event-based XML library, which sends you pieces as it parses incrementally, rather than creating a tree for the whole document. The typical answer is the xml.sax stdlib module though I'm sure there are many others.

edited Apr 9, 2013 at 0:06

answered Apr 8, 2013 at 23:58

Jesse Rusak

57.3k12 gold badges103 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bbayles · Accepted Answer · 2013-04-09 01:59:19Z

1

I've not attempted to use such a large dataset, but I have found the lxml module to be fast and useful.

The lxml.etree tutorial here provides an example that may be instructive.

The key paragraph is:

A very important use cases for iterparse() is parsing large generated XML files, e.g. database dumps. Most often, these XML formats only have one main data item element that hangs directly below the root node and that is repeated thousands of times. In this case, it is best practice to let lxml.etree do the tree building and to only intercept exactly on this one Element, using the normal tree API for data extraction.

answered Apr 9, 2013 at 1:59

bbayles

4,5571 gold badge28 silver badges35 bronze badges

Comments

Sheng · Accepted Answer · 2013-04-08 23:49:46Z

0

Sure, it is possible. In an ugly way, you could read the file by lines in text mode. And then use a regular expression or just simple string search method (keyword as and ) as filter to get the lines in forms of

<title>AccessibleComputing</title>

Then, you could get the titles, and do what you want.

answered Apr 8, 2013 at 23:49

Sheng

3,5651 gold badge20 silver badges21 bronze badges

2 Comments

Jesse Rusak Over a year ago

There are zillions of pitfalls parsing XML with regexes; especially with that much content from Wikipedia, I would bet you will run into some of them.

Sheng Over a year ago

Yes. That is why this is an ugly way. But even not quite sure, I think if the regression is better enough, it could works out. Anyway, XML is text-based. But your method is better.

Collectives™ on Stack Overflow

How to Parse a huge xml file (on the go) using Python

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related