1

I have a Json file like such:

{"id": "53f43a7bdabfaeb22f497fb8", "name": "Nayara Fernanda Monte", "h_index": 0, "n_pubs": 1, "tags": [], "pubs": [{"i": "53e9bc79b7602d97048f8888", "r": 2}, {"i": "56d8971cdabfae2eee185494", "r": 2}], "n_citation": 0, "orgs": [""]}
{"id": "53f43f5adabfaedf435b9bdf", "name": "J\u00f6rg B\u00e4ssmann", "h_index": 0, "n_pubs": 1, "tags": [{"w": 1, "t": "Vehicle Theft .Immobilisation .Crime Prevention.Crimereduction . Displacement .Motorcycle Theft .Opportunistic Offenders .Professional Offenders . Evaluation.Mixed-Methods Design"}], "pubs": [{"i": "53e9b4a1b7602d9703fad4e7", "r": 0}], "n_citation": 0, "orgs": ["Bingen am Rhein, Germany"]}

I tried reading it using the following code:

import json

with open('path/xyz.json') as f:
data = json.load(f)

However, it returns an error:

'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

How do I fix this error? Thanks.

5
  • Change the encoding to utf-16 or using the 'rb' mode and re-try; Commented Jun 21, 2020 at 4:29
  • If those two lines are really your file, it's not a valid JSON object. It's two JSON objects separated by a newline. Commented Jun 21, 2020 at 4:30
  • 1
    @MarkMeyer - I think you just saved OP from the next stackoverflow question. Commented Jun 21, 2020 at 4:32
  • this is a file with each line has a dictionary and OP is trying to load all of them together, which is causing error. better to load the all of them one by one and then make a json object Commented Jun 21, 2020 at 4:45
  • 1
    @sahasrara62 - The most immediate problem is that its a utf-16 encoded file. The line-by-line issue is next. Commented Jun 21, 2020 at 4:52

4 Answers 4

1

If you're stuck with the multiple json "documents" in a single file, then you could always do this:


json_documents = []
with open('path/to/file', 'r') as fh:
  for line in fh:
    json_documents.append( json.loads(line) )

this will decode the string version of each line. Note: this only works if each line is a whole json document. If multiple documents are on a single line, or if a single document spans multiple lines, then you'll need to do something fancier.

Sign up to request clarification or add additional context in comments.

Comments

0

Microsoft UTF-16 encoded files start with a Byte Order Mark (BOM) of FF, FE or FE, FF, depending on whether the machine is big- or little-endian. In this case, Microsoft stores unicode characters in a two-byte format. usually each 2 bytes store a single unicode character, but even with UTF-16, some encodings will extend to 4 bytes.

As mentioned, encoding=UTF-16 should read it. See the Unicode HowTo.

As a side note, UTF-16 encoded JSON files may not be recognized by all programs. If you plan on passing them in an HTTP packet for instance, reencoding to UTF-8 is likely a good choice.

import json

with open('path/xyz.json', encoding="UTF-16") as f:
    for line in f:
        data = json.loads(line)

2 Comments

I tried this. It said "the JSON object must be str, bytes or bytearray, not TextIOWrapper".
My mistake, I should have decoded the line. There are multiple JSON objects in the file, one per line. You'll have to figure out how you want to handle that.
0

I think the problem is with the JSON file There can only be one set of brackets, all the data is inside the this bracket, But you have seperated the data

You can do something like this:

{
"1": {
    "id": "53f43a7bdabfaeb22f497fb8",
    "name": "Nayara Fernanda Monte",
    "h_index": 0,
    "n_pubs": 1,
    "tags": [],
    "pubs": [{
        "i": "53e9bc79b7602d97048f8888",
        "r": 2
    }, {
        "i": "56d8971cdabfae2eee185494",
        "r": 2
    }],
    "n_citation": 0,
    "orgs": [""]
},
"2": {
    "id": "53f43f5adabfaedf435b9bdf",
    "name": "J\u00f6rg B\u00e4ssmann",
    "h_index": 0,
    "n_pubs": 1,
    "tags": [{
        "w": 1,
        "t": "Vehicle Theft .Immobilisation .Crime Prevention.Crimereduction . Displacement .Motorcycle Theft .Opportunistic Offenders .Professional Offenders . Evaluation.Mixed-Methods Design"
    }],
    "pubs": [{
        "i": "53e9b4a1b7602d9703fad4e7",
        "r": 0
    }],
    "n_citation": 0,
    "orgs": ["Bingen am Rhein, Germany"]
}}

1 Comment

The format is called JSON lines.
0

The JSON you provided is not a valid JSON.

You are putting multiple JSON objects without any separator or an ARRAY.

For the encoding issue, Seems like the JSON object is being converted to str, from binary.

Try this:

with open('./xyz.json','rb') as f:
  data = json.load(f)

Passed an added parameter 'rb', this will treat the values as binary and won't attempt to convert them into bytes.

Check this reply: https://repl.it/@SourabhLalwani/FickleBriefOperatingenvironment#xyz.json

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.