I am trying to parse an xml file that I got by exporting a pdf to xml 1.0 using adobe pro. I am using Python and ElementTree to parse with. The pdf contains a table which spans multiple pages and has several different table headers.
I want to parse and extract the row and column data from the table which begins with the header that contains a particular string (e.g. "MECHANICAL") and stop at the next table heading section (e.g. "COMPLETED"). Thereby excluding all row and column data before and after this section. There is no easy tag to parse, the tag pattern just repeats.
Here is my current python code:
# Python
import sys
import re # regular expression
import xml.etree.ElementTree as xml
tree = xml.parse("C:/Documents and Settings/alilly.CORPORATE/Desktop/python xml parse/excerpt.xml")
print "=================== Find Columns ===================="
for node in tree.iter('TR'):
print "tag=",node.tag
count = len(node.getiterator('TD'))
#if count != 10:
# continue
print "------------"
for col in node.getiterator('TD'):
print " tag=",col.tag, "attrib=", col.attrib, "text=", col.text
print "=================== Find Headers ===================="
# find headers
for node in tree.iter('ImageData'):
print "figure text = ", node.tail
And here is my XML file:
<?xml version="1.0" encoding="UTF-8" ?>
<!-- Created from PDF via Acrobat SaveAsXML -->
<!-- Mapping Table version: 28-February-2003 -->
<TaggedPDF-doc>
<?xpacket begin='?' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>GPL Ghostscript 8.70</pdf:Producer>
<pdf:Keywords/>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:ModifyDate>2011-03-01T09:36:13-05:00</xmp:ModifyDate>
<xmp:CreateDate>2011-03-01T09:36:13-05:00</xmp:CreateDate>
<xmp:CreatorTool>PDFCreator Version 1.0.2</xmp:CreatorTool>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>d417764e-466c-11e0-0000-f7ea6a538d79</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:0c6ada50-6db0-4d59-88e1-fc23aa6ebc14</xmpMM:InstanceID>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>xml</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">my pdf file</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>ltamm</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
<rdf:li xml:lang="x-repair"/>
</rdf:Alt>
</dc:description>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
<?xpacket end='r'?>
<Part>
<H1>Misc </H1>
<Sect>
<H3>This is a test </H3>
<Sect>
<H5>Deletions </H5>
<L>
<LI>
<LI_Title>Special codes </LI_Title>
</LI>
</L>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
Main INTERIOR </Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TD><Figure>
<ImageData src=""/>
</Figure>
</TD>
<TD>S </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
This is the MECHANICAL header</Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TH>Free Flow </TH>
<TD>Ref. Code </TD>
<TD>DESCRIPTION </TD>
<TD>Rooster </TD>
<TD>747 Dog </TD>
<TD>888 Rabbit </TD>
</TR>
<TR>
<TD>xxx GOgo xxB </TD>
<TD>Beany xxx </TD>
<TD>nothing here xxx </TD>
<TD>xxx B </TD>
<TD>snake ddd </TD>
<TD>Cow fff </TD>
<TD>eee </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Squirrel Protection </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>J77 </TD>
<TD>Rocket Launcher </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Lunch </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>Jss5 </TD>
<TD>Now is the time for all good men </TD>
<TD>-</TD>
<TD>A1 </TD>
<TD>A1 </TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
</TR>
<TR>
<TD>Capacity </TD>
<TD/>
<TD>2/3 </TD>
<TD>2/3 </TD>
<TD>2/3 </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
Final COMPLETED PAGE 1 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Graphite </P>
<P>painted fun </P>
<P>Control yourself </P>
<Figure>
<ImageData src=""/>
Meaningless Header PAGE 2 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>)multi-coat </P>
<P>front</P>
<P>single-slot system </P>
<Figure>
<ImageData src=""/>
Almost Done Header PAGE 1 OF 1 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Snow Blizzard. </P>
<P>Done </P>
</Sect>
</Sect>
</Part>
</TaggedPDF-doc>