Create JSON with XML file using BeautifulSoup

Question

I am using Jupyer notebook, running python 3. My task is to extract data from XML file and convert it to json format (perhaps even save the json in an output.dat file). I am using BeautifulSoup to navigate through the nodes. I have the following data:

<?xml version='1.0' encoding='UTF-8'?> 
<Terms>   
 <Term>
    <Title>.177 (4.5mm) Airgun</Title>
    <Description>The standard airgun calibre for international target 
                 shooting.</Description>
    <RelatedTerms>
      <Term>
        <Title>Shooting sport equipment</Title>
        <Relationship>Narrower Term</Relationship>
      </Term>
    </RelatedTerms>   
 </Term>
 <Term>
    <Title>1 Kilometre Time Trial</Title>
    <Description>test2</Description>
    <RelatedTerms>
    <Term>
      <Title>1 Kilometre TT</Title>
      <Relationship>Used For</Relationship>
    </Term>
    <Term>
      <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>1km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>1km TT</Title>
    <Relationship>Used For</Relationship>
  </Term>
  <Term>
    <Title>One km Time Trial</Title>
    <Relationship>Used For</Relationship>
  </Term>
</RelatedTerms>
</Term>

This is the following output that I am expecting in JSON:

{
"thesaurus": [
{
"Description": "The standard airgun calibre for international target shooting.",
"RelatedTerms": [
{
"Relationship": "Narrower Term",
"Title": "Shooting sport equipment"
}
],
"Title": ".177 (4.5mm) Airgun"
}, 

{
"Description": "test2",
"RelatedTerms": [
{
"Relationship": "Used For",
"Title": "1 Kilometre TT"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km Time Trial"
},
{
"Relationship": "Used For",
"Title": "1km TT"
},
{
"Relationship": "Used For",
"Title": "One km Time Trial"
}
],
"Title": "1 Kilometre Time Trial"
},

I am navigating through the tags so that I can create dictionaries as seen in the output example. Since I am new to text scraping, this is quite frustrating.

I was able to extract the "Description" tag with the following code:

xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, encoding="utf8"),"xml")
elements = btree.find_all('Description')
descriptionTag = []
for element in elements:
    descriptionTag.append(element.text)

Like the above Description tag, I am not sure how to create a list of dictionaries for the information stored between the "RelatedTerms" tag. Ideally, I would parse all the tags to a dataframe which would then convert the data to JSON format.

So, can someone please help in determining how to extract the information from "RelatedTerms" tag.

ewwink · Accepted Answer · 2018-11-10 12:29:31Z

to extract RelatedTerms first you have to extract top Term element using btree.select('Terms > Term') now you can loop it and extract Term inside RelatedTerms using term.select('RelatedTerms > Term')

import json
from bs4 import BeautifulSoup

xml_file = './xml.xml'
btree = BeautifulSoup(open(xml_file, 'r'), "xml")
Terms = btree.select('Terms > Term')
jsonObj = {"thesaurus": []}

for term in Terms:
    termDetail = {
        "Description": term.find('Description').text,
        "Title": term.find('Title').text
    }
    RelatedTerms = term.select('RelatedTerms > Term')
    if RelatedTerms:
        termDetail["RelatedTerms"] = []
        for rterm in RelatedTerms:
            termDetail["RelatedTerms"].append({
                "Title": rterm.find('Title').text,
                "Relationship": rterm.find('Relationship').text
            })
    jsonObj["thesaurus"].append(termDetail)

print json.dumps(jsonObj, indent=4)

Collectives™ on Stack Overflow

Create JSON with XML file using BeautifulSoup

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related