1

Many questions and answers explain how to load a json file from a URL into a pandas data frame, using low level methods for example JSON to pandas DataFrame . I managed to make it work but if possible, I would prefer to use the simple pandas.io.json method to load the data frame directly from a url. The problem is that pandas.io.json.read_json returns a ValueError, which other answers such as ValueError: Expected object or value when reading json as pandas dataframe say can be fixed by setting the orient argument, but what I would need is a different argument that would make it possible to specify that the data frame is one level down in the nested json at the "results" level.

# Sample content of the json file
# {
#     "more": false,
#     "results": [
#         {
#             "id": "all",
#             "text": "All"
#         },
#         {
#             "ID":"4",
#             "text": "Afghanistan"
#         },
#         {
#             "id": "975",
#             "text": "ASEAN"
#         }
#     ]
# }

import urllib.request
import json
import pandas

url_reporter = "https://comtrade.un.org/Data/cache/reporterAreas.json"
url_partner = "https://comtrade.un.org/Data/cache/partnerAreas.json"

# Failed attempt with pandas.io
pandas.io.json.read_json(url_reporter)

# ~/.local/lib/python3.7/site-packages/pandas/io/json/_json.py in _parse_no_numpy(self)
#    1138         if orient == "columns":
#    1139             self.obj = DataFrame(
# -> 1140                 loads(json, precise_float=self.precise_float), dtype=None
#    1141             )
#    1142         elif orient == "split":
#
# ValueError: Expected object or value

pandas.io.json.read_json(url_partner)
# Returns a fattened data frame with json as the cell content in a "results" column

# Successful attempt using lower level packages
req = urllib.request.Request(url=url_reporter)
with urllib.request.urlopen(req) as response:
    json_content = json.load(response)
    df = pandas.json_normalize(json_content['results'])


In [34]: df
Out[34]:
      id                      text
0    all                       All
1      0                     World
2      4               Afghanistan
3    472  Africa CAMEU region, nes
4      8                   Albania
..   ...                       ...
289  879         Western Asia, nes
290  732            Western Sahara
291  887                     Yemen
292  894                    Zambia
293  716                  Zimbabwe

[294 rows x 2 columns]

1 Answer 1

1

There is a BOM character in url_reporter response. Set the encoding like this and you should get your result:

df = pd.io.json.read_json(url_reporter, encoding='utf-8-sig')

Since the whole processing is done in the pd.io.json.read_json method, we cannot select the keys to direct to the actual data that we are after. So you need to run this additional code to get your desired results:

df = pd.json_normalize(df.results.to_list())
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you that fixes the read issue, but the data frame is still formatted with a more and a results column. results is a character column containing {'id': '4', 'text': 'Afghanistan'}. Is there a way to tell read_json to load the id and text columns directly instead?
From URL side there is nothing that could be done. However, after importing the data you can clean it up. I will add the post processing to the solution. @PaulRougieux You can find the solution now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.