How to convert XML to JSON using python?

Question

I have below XML and I have saved in the file called movies.xml. I need to convert to JSON with some values only. For direct conversion I can use xmltodict. I am using etree and etree.XMLParser(). I am trying ot put in the elastic search after this. I have successfully extracted single node using attrib method.

    <?xml version="1.0" encoding="UTF-8" ?>
    <collection>
    <genre category="Action">
        <decade years="1980s">
            <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
                <format multiple="No">DVD</format>
                <year>1981</year>
                <rating>PG</rating>
                <description>
                'Archaeologist and adventurer Indiana Jones 
                is hired by the U.S. government to find the Ark of the 
                Covenant before the Nazis.'
                </description>
            </movie>
               <movie favorite="True" title="THE KARATE KID">
               <format multiple="Yes">DVD,Online</format>
               <year>1984</year>
               <rating>PG</rating>
               <description>None provided.</description>
            </movie>
            <movie favorite="False" title="Back 2 the Future">
               <format multiple="False">Blu-ray</format>
               <year>1985</year>
               <rating>PG</rating>
               <description>Marty McFly</description>
            </movie>
        </decade>
        <decade years="1990s">
            <movie favorite="False" title="X-Men">
               <format multiple="Yes">dvd, digital</format>
               <year>2000</year>
               <rating>PG-13</rating>
               <description>Two mutants come to a private academy for their kind whose resident superhero team must 
               oppose a terrorist organization with similar powers.</description>
            </movie>
            <movie favorite="True" title="Batman Returns">
               <format multiple="No">VHS</format>
               <year>1992</year>
               <rating>PG13</rating>
               <description>NA.</description>
            </movie>
               <movie favorite="False" title="Reservoir Dogs">
               <format multiple="No">Online</format>
               <year>1992</year>
               <rating>R</rating>
               <description>WhAtEvER I Want!!!?!</description>
            </movie>
        </decade>    
    </genre>

    <genre category="Thriller">
        <decade years="1970s">
            <movie favorite="False" title="ALIEN">
                <format multiple="Yes">DVD</format>
                <year>1979</year>
                <rating>R</rating>
                <description>"""""""""</description>
            </movie>
        </decade>
        <decade years="1980s">
            <movie favorite="True" title="Ferris Bueller's Day Off">
                <format multiple="No">DVD</format>
                <year>1986</year>
                <rating>PG13</rating>
                <description>Funny movie about a funny guy</description>
            </movie>
            <movie favorite="FALSE" title="American Psycho">
                <format multiple="No">blue-ray</format>
                <year>2000</year>
                <rating>Unrated</rating>
                <description>psychopathic Bateman</description>
            </movie>
        </decade>
    </genre>
</collection>

My desired output is below

First output  {'Action':['Indiana Jones: The raiders of the lost Ark', 'THE KARATE KID', 'Back 2 the Future','X-Men', 'Batman Returns', 'Reservoir Dogs']}
second output  {'movies':'description'}
third output   {'movies': 'year'}

I have done basic operations from the datacamp, could not get the desired output

from lxml import etree
parser = etree.XMLParser()
tree= etree.parse('movies.xml', parser)
data= tree.find("genre[@category='Action']")
json= {}
for child in enumerate(data.getchildren()):
    temp = {}
    for content in child[1].getchildren():
        temp[content.attrib.get('title')] =  content.text.strip()
        json[child[0]] = temp.keys()
json

What is the problem? Please show us the code you have written so far. — mzjn
– mzjn, Commented Oct 23, 2018 at 15:33
Thriller ? X-Men? there are in different tag. Check your expect output again. — KC.
– KC., Commented Oct 24, 2018 at 4:32
Have a look at stackoverflow.com/questions/tagged/xml+json+python — Adrian W
– Adrian W, Commented Oct 24, 2018 at 17:51

Adrian W · Accepted Answer · 2018-10-24 20:40:43Z

I would recommend to use XSLT to transform the XML to JSON:

import json

from lxml import etree

XSL = '''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <xsl:output method="text"/>

    <xsl:template match="/collection">
        <xsl:text>{</xsl:text>
            <xsl:apply-templates/>
        <xsl:text>}</xsl:text>
    </xsl:template>

    <xsl:template match="genre">
        <xsl:text>"</xsl:text>
            <xsl:value-of select="@category"/>
        <xsl:text>": [</xsl:text>
        <xsl:for-each select="descendant::movie" >
            <xsl:text>"</xsl:text>
                <xsl:value-of select="@title"/>
            <xsl:text>"</xsl:text>
            <xsl:if test="position() != last()">
                <xsl:text>, </xsl:text>
            </xsl:if>
        </xsl:for-each>
        <xsl:text>]</xsl:text>
        <xsl:if test="following-sibling::*">
            <xsl:text>,
</xsl:text>
        </xsl:if>
    </xsl:template>

    <xsl:template match="text()"/>
</xsl:stylesheet>'''

# load input
dom = etree.parse('movies.xml')
# load XSLT
transform = etree.XSLT(etree.fromstring(XSL))

# apply XSLT on loaded dom
json_text = str(transform(dom))

# json_text contains the data converted to JSON format.
# you can use it with the JSON API. Example:
data = json.loads(json_text)
print(data)

Output:

{'Action': ['Indiana Jones: The raiders of the lost Ark', 'THE KARATE KID', 'Back 2 the Future', 'X-Men', 'Batman Returns', 'Reservoir Dogs'], 'Thriller': ['ALIEN', "Ferris Bueller's Day Off", 'American Psycho']}

I don't understand what you want to achieve with "second output" and "third output", though, as these outputs seem to be constants.

Collectives™ on Stack Overflow

How to convert XML to JSON using python?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related