0

A similar question is asked here (Python XML Parsing) but I could not reach to the content I am interested in.

I need to extract all the information that is enclosed between the tag patent-classification if the classification-scheme tag value is CPC. There are multiple such element and are enclosed inside patent-classifications tag.

In the example given below, there are three such values: C 07 K 16 22 I , A 61 K 2039 505 A and C 07 K 2317 21 A

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:meta name="elapsed-time" value="21"/>
    <exchange-documents>
        <exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1">
            <bibliographic-data>
                <publication-reference>
                    <document-id document-id-type="docdb">
                        <country>US</country>
                        <doc-number>2009234106</doc-number>
                        <kind>A1</kind>
                        <date>20090917</date>
                    </document-id>
                    <document-id document-id-type="epodoc">
                        <doc-number>US2009234106</doc-number>
                        <date>20090917</date>
                    </document-id>
                </publication-reference>
                <classifications-ipcr>
                    <classification-ipcr sequence="1">
                        <text>C07K  16/    44            A I                    </text>
                    </classification-ipcr>
                </classifications-ipcr>
                <patent-classifications>
                    <patent-classification sequence="1">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>C</section>
                        <class>07</class>
                        <subclass>K</subclass>
                        <main-group>16</main-group>
                        <subgroup>22</subgroup>
                        <classification-value>I</classification-value>
                    </patent-classification>
                    <patent-classification sequence="2">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>A</section>
                        <class>61</class>
                        <subclass>K</subclass>
                        <main-group>2039</main-group>
                        <subgroup>505</subgroup>
                        <classification-value>A</classification-value>
                    </patent-classification>
                    <patent-classification sequence="7">
                        <classification-scheme office="" scheme="CPC"/>
                        <section>C</section>
                        <class>07</class>
                        <subclass>K</subclass>
                        <main-group>2317</main-group>
                        <subgroup>92</subgroup>
                        <classification-value>A</classification-value>
                    </patent-classification>
                    <patent-classification sequence="1">
                        <classification-scheme office="US" scheme="UC"/>
                        <classification-symbol>530/387.9</classification-symbol>
                    </patent-classification>
                </patent-classifications>
            </bibliographic-data>
        </exchange-document>
    </exchange-documents>
</ops:world-patent-data>

2 Answers 2

2

Install BeautifulSoup if you don't have it:

$ easy_install BeautifulSoup4

Try this:

from bs4 import BeautifulSoup

xml = open('example.xml', 'rb').read()
bs = BeautifulSoup(xml)

# find patent-classification
patents = bs.findAll('patent-classification')
# filter the ones with CPC
for pa in patents:
    if pa.find('classification-scheme', {'scheme': 'CPC'} ):
        print pa.getText()
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks but where is the xml being used as variable?
well xml variable is where you load your xml. Actually to try the exact code, create a file names example.xml and write in it what you posted on the question and I edited my answer I was missing one line. Thanks
@user1140126 check the answer again I updated it. I was missing one line
1

You can use python xml standard module:

import xml.etree.ElementTree as ET

root = ET.parse('a.xml').getroot()

for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[@scheme='CPC']/.."):
    data = []
    for d in node.getchildren():
        if d.text:
            data.append(d.text)
    print ' '.join(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.