34

I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe. This is the code I'm using:

import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)

temp = pd.DataFrame()

path_to_json = '/Users/XXX/Desktop/Facebook Data/*' 

json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

for file in file_list:
    data = pd.read_json(file, lines=True)
    temp.append(data, ignore_index = True)

It looks like all the files are loading when I look through file_list, but cannot figure out how to get each file into a dataframe. There are about 50 files with a couple lines in each file.

6 Answers 6

38

Change the last line to:

temp = temp.append(data, ignore_index = True)

The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.

Edit:

Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer).

What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation. Like this:

dfs = [] # an empty list to store the data frames
for file in file_list:
    data = pd.read_json(file, lines=True) # read data frame from json file
    dfs.append(data) # append the data frame to the list

temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.

This alternative should be considerably faster.

Sign up to request clarification or add additional context in comments.

Comments

10
from pathlib import Path
import pandas as pd

paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])

Comments

7

If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
    dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs

Or if your JSON are line-delimited (not tested) :

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        for line in f.readlines():
            json_data = pd.json_normalize(json.loads(line))
            dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs

1 Comment

Thanks @ Skippy le Grand Gourou. I think your code needs 'pd.' in front of 'json_normalize' to run.
3

I combined Juan Estevez's answer with glob. Thanks a lot.

import pandas as pd
import glob

def readFiles(path):
    files = glob.glob(path)
    dfs = [] # an empty list to store the data frames
    for file in files:
        data = pd.read_json(file, lines=True) # read data frame from json file
        dfs.append(data) # append the data frame to the list

    df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
    return df

Comments

0

Maybe you should state, if the json files are created themselves with pandas pd.to_json() or in another way. I used data which was not created with pd.to_json() and I think it is not possible to use pd.read_json() in my case. Instead, I programmed a customized for-each loop approach to write everything to the DataFrames

1 Comment

You probably mean "if the files are valid JSON or not"… ;)
0

Here's another option:

df = pd.DataFrame()  
for file in file_list:  
    data = pd.read_json(file, lines=True) 
    df = pd.concat([df,data], ignore_index=True)  

1 Comment

Hi AnnA. Welcome to SO and thank you for your contribution. This can work, but as implied by the Edit part of the accepted answer, using pd.concat in such a way inside a for-loop is a bad idea, as "[i]t leads to quadratic copying" (from the linked SO post). Creating a list of dfs first, as in the accepted answer, and then applying pd.concat only once afterwards will show far superior performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.