0

I have a huge xml file (the current wikipedia dump). This xml having a size of about 45 GB represents the entire data of the current wikipedia. The first few lines of the file are (output of more):

    <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://ww
    w.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/x
    ml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:la
    ng="en">
      <siteinfo>
        <sitename>Wikipedia</sitename>
        <base>http://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.21wmf6</generator>
        <case>first-letter</case>
        <namespaces>
          <namespace key="-2" case="first-letter">Media</namespace>
          <namespace key="-1" case="first-letter">Special</namespace>
          <namespace key="0" case="first-letter" />
          <namespace key="1" case="first-letter">Talk</namespace>
          <namespace key="2" case="first-letter">User</namespace>
          <namespace key="3" case="first-letter">User talk</namespace>
          <namespace key="4" case="first-letter">Wikipedia</namespace>
          <namespace key="5" case="first-letter">Wikipedia talk</namespace>
          <namespace key="6" case="first-letter">File</namespace>
          <namespace key="7" case="first-letter">File talk</namespace>
          <namespace key="8" case="first-letter">MediaWiki</namespace>
          <namespace key="9" case="first-letter">MediaWiki talk</namespace>
          <namespace key="10" case="first-letter">Template</namespace>
          <namespace key="11" case="first-letter">Template talk</namespace>
          <namespace key="12" case="first-letter">Help</namespace>
          <namespace key="13" case="first-letter">Help talk</namespace>
          <namespace key="14" case="first-letter">Category</namespace>
          <namespace key="15" case="first-letter">Category talk</namespace>
          <namespace key="100" case="first-letter">Portal</namespace>
          <namespace key="101" case="first-letter">Portal talk</namespace>
          <namespace key="108" case="first-letter">Book</namespace>
          <namespace key="109" case="first-letter">Book talk</namespace>
          <namespace key="446" case="first-letter">Education Program</namespace>
          <namespace key="447" case="first-letter">Education Program talk</namespace
    >
          <namespace key="710" case="first-letter">TimedText</namespace>
          <namespace key="711" case="first-letter">TimedText talk</namespace>
        </namespaces>
      </siteinfo>
      <page>
        <title>AccessibleComputing</title>
        <ns>0</ns>
        <id>10</id>
        <redirect title="Computer accessibility" />
        <revision>
          <id>381202555</id>
          <parentid>381200179</parentid>
          <timestamp>2010-08-26T22:38:36Z</timestamp>
          <contributor>
            <username>OlEnglish</username>
            <id>7181920</id>
          </contributor>
          <minor />
          <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.2
    8.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by 
    Gurch</comment>
          <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from C
    amelCase}}</text>
          <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
          <model>wikitext</model>

...and so on

Notice the page element in the tree. It corresponds to a unique page in Wikipedia. The given XML consists of all the pages of Wikipedia in the form of page elements. I need to write a parser where in I need to extract the value of title entry from the page for all pages of wikipedia and suppose (for simplicity) print them.

I am trying to build the same using Python (although I am open to a switch in language if that offers a solution). The only way I know of is to use ElementTree.

However, using the function parse('file.xml') requires the entire document to first be parsed completely and THEN will any results be outputted. As is evident, I know that the entire xml consist of page elements. I want the program to begin printing titles WHILE it is parsing the rest of the xml. Is that even possible. If so, how?

EDIT Note: I cite an example of extracting titles here to keep things simple in the question. However, I do need the xml parsing features since I need to extract the same in future.

1

3 Answers 3

3

What you want is an event-based XML library, which sends you pieces as it parses incrementally, rather than creating a tree for the whole document. The typical answer is the xml.sax stdlib module though I'm sure there are many others.

Sign up to request clarification or add additional context in comments.

Comments

1

I've not attempted to use such a large dataset, but I have found the lxml module to be fast and useful.

The lxml.etree tutorial here provides an example that may be instructive.

The key paragraph is:

A very important use cases for iterparse() is parsing large generated XML files, e.g. database dumps. Most often, these XML formats only have one main data item element that hangs directly below the root node and that is repeated thousands of times. In this case, it is best practice to let lxml.etree do the tree building and to only intercept exactly on this one Element, using the normal tree API for data extraction.

Comments

0

Sure, it is possible. In an ugly way, you could read the file by lines in text mode. And then use a regular expression or just simple string search method (keyword as and ) as filter to get the lines in forms of

<title>AccessibleComputing</title>

Then, you could get the titles, and do what you want.

2 Comments

There are zillions of pitfalls parsing XML with regexes; especially with that much content from Wikipedia, I would bet you will run into some of them.
Yes. That is why this is an ugly way. But even not quite sure, I think if the regression is better enough, it could works out. Anyway, XML is text-based. But your method is better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.