1

I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:

<?xml version='1.0' encoding='UTF-8'?> 
<Terms>   
 <Term>
    <Title>.177 (4.5mm) Airgun</Title>
    <Description>The standard airgun calibre for international target 
                 shooting.</Description>
    <RelatedTerms>
      <Term>
        <Title>Shooting sport equipment</Title>
        <Relationship>Narrower Term</Relationship>
      </Term>
    </RelatedTerms>   
 </Term>
 <Term>
    <Title>1 Kilometre Time Trial</Title>
    <Description>test2</Description>
    <RelatedTerms>
    <Term>
      <Title>1 Kilometre TT</Title>
      <Relationship>Used For</Relationship>
    </Term>
    <Term>
      <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>1km TT</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>One km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
</RelatedTerms>
</Term>

This is the following output that I am expecting in JSON:

{
"thesaurus": [
{
"Description": "The standard airgun calibre for international target shooting.",
"RelatedTerms": [
{
"Relationship": "Narrower Term",
"Title": "Shooting sport equipment"
}
],
"Title": ".177 (4.5mm) Airgun"
}, 

{
"Description": "test2",
"RelatedTerms": [
{
"Relationship": "Used For",
"Title": "1 Kilometre TT"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km TT"
},
{
"Relationship": "Used For",
"Title": "One km Time Trial"
}
],
"Title": "1 Kilometre Time Trial"
},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
elements = btree.find_all('Description')
descriptionTag = []
for element in elements:
    descriptionTag.append(element.text) 

Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag. Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

1 Answer 1

1

to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')

import json
from bs4 import BeautifulSoup

xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, 'r'), "xml")
Terms = btree.select('Terms > Term')
jsonObj = {"thesaurus": []}

for term in Terms:
    termDetail = {
        "Description": term.find('Description').text,
        "Title": term.find('Title').text
    }
    RelatedTerms = term.select('RelatedTerms > Term')
    if RelatedTerms:
        termDetail["RelatedTerms"] = []
        for rterm in RelatedTerms:
            termDetail["RelatedTerms"].append({
                "Title": rterm.find('Title').text,
                "Relationship": rterm.find('Relationship').text
            })
    jsonObj["thesaurus"].append(termDetail)

print json.dumps(jsonObj, indent=4)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.