Efficient way creating a dict of dict from a pandas dataframe

Question

I have a pandas dataframe of the following structure:

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

I need to create a dict of dict to get data of all existing edges (indicated by nonzero values) between nodes:

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

I need it to create a network graph using Networkx library and perform some calculations on it. Obviously it would be possible to loop over every cell in the data frame to do this but my data is quite large and it would be inefficient. I'm looking for some better way possibly using vectorization and/or list comprehension. I've tried list comprehension but I'm stuck and cannot make it work. Can anyone suggest a more efficient way to do this please?

Try my answer :-). It a single loop comprehension + pandas native indexing. — LMC
– LMC, Commented Jul 6 at 15:16
Is it required to use a pandas dataframe, since probably just processing the dict will be easier, than converting it to and from a dataframe. — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Jul 6 at 16:28
Converting to a dict first, then creating the nested structure should be the fastest method: {k: {k: {v} for k,v in inner.items() if v} for k, inner in df.to_dict(orient="index").items()}. — cottontail
– cottontail, Commented Jul 7 at 21:16

Viktor Sbruev · Accepted Answer · 2025-07-06 14:17:45Z

3

You can do this by combining df.iterrows() with a dictionary comprehension. Although iterrows() is not truly vectorized, it's still reasonably efficient for this kind of task and cleaner than using manual nested loops. For example, you could write:

edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in attributes.items() if weight != 0}
    for node, attributes in df.iterrows()
}

If your DataFrame is very large and you’re concerned about performance, another approach is to first convert it into a plain dictionary of dictionaries using df.to_dict(orient='index') and then filter out the zeros. That would look like thiss:

data_dictionary = df.to_dict(orient='index')
edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
    for node, connections in data_dict.items()
}

answered Jul 6 at 14:17

Viktor Sbruev

1,0471 gold badge6 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

carpediem Jul 6 at 14:58

Thanks Viktor, your solution works great!

Viktor Sbruev Jul 8 at 10:20

I’m glad I could help!

furas · Accepted Answer · 2025-07-06 14:43:55Z

2

It seems my version is similar to @VictorSbruev but his idea with converting all to dictionary seems better.

I was thinking about using .apply(function, axis=1) to run code on every row and create column with inner dictionaries

def convert(row):
    data = row.to_dict()

    # skip `0` and convert value to `set()`
    data = {key:{val} for key, val in data.items() if val != 0}  

    return data

df['networkx'] = df.apply(convert, axis=1)

to get

A    {'X': {1}, 'Z': {1}, 'W': {3}}
B              {'Y': {1}, 'W': {2}}
C              {'X': {3}, 'Y': {2}}
D              {'X': {1}, 'Y': {1}}
Name: networkx, dtype: object

And later convert this column to dictionary

result = df['networkx'].to_dict()

which gives me expected

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

Full working code where I was testing different versions

import pandas as pd

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

# for test
expected = {'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

print(df)

def convert(row):
    #print(row)
    data = row.to_dict()
    #data = {row.name: {key:{val} for key, val in data.items() if val != 0}} # version 1
    data = {key:{val} for key, val in data.items() if val != 0}  # version 2
    return data

df['networkx'] = df.apply(convert, axis=1)
print(df['networkx'])

#print(list(df['networkx'].items()))

#result = {name:item[name] for name,item in df['networkx'].items()}  # for version 1
#result = {name:item for name,item in df['networkx'].items()}         # for version 2
result = df['networkx'].to_dict()                                    # for version 2

print('result  :', result)
print('expected:', expected)

answered Jul 6 at 14:43

furas

149k12 gold badges121 silver badges171 bronze badges

3 Comments

carpediem Jul 6 at 15:02

Thank you. Great solution. But since Viktor was faster and his solution works also great, I'll have to accept his. I really appreciate your help.

furas Jul 6 at 16:53

his version is nice - short and still readable :)

Viktor Sbruev Jul 8 at 10:26

Thank you! Your solution is nice too! I will look into it with interest and compare it with mine over the weekend.

LMC · Accepted Answer · 2025-07-06 23:17:17Z

This comprehension returns the expected result by

Iterating the index
Applying boolean indexing to each series
Returning a dictionary for each series

import pandas as pd

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

out = { i: df.loc[i][df.loc[i] != 0].to_dict() for i in df.index}

print(out)

Result

{'A': {'X': 1, 'Z': 1, 'W': 3}, 'B': {'Y': 1, 'W': 2}, 'C': {'X': 3, 'Y': 2}, 'D': {'X': 1, 'Y': 1}}

Wrapping values in a Set

{ i: df.loc[i][df.loc[i] != 0].apply(lambda x: {x}).to_dict() for i in df.index}

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'X': {1}, 'Y': {1}}}

Testing

Performance against the accepted answer with list of 2000 items shows the accepted answer is 40% slower.

import pandas as pd
import timeit
import json
import random

def create_json(jpath, n, idx_name):
    data = {}
    for i in range(n):
        data[f'i{i}'] = [random.randint(0, n//5) for _ in range(n)]
    data[idx_name] = list(data.keys())
    with open(jpath, 'w') as j:
        json.dump(data, j)
    return idx_name

def lmc_method(df):
    out = { i: df.loc[i][df.loc[i] != 0].to_dict() for i in df.index}    
    return out

def vs_method(df):
    data_dict = df.to_dict(orient='index')
    edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
    for node, connections in data_dict.items()}
    return edge_dictionary

#d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}

n = 2000
jpath = f'/home/lmc/tmp/faker_data_{n}.json'
idx_name = '38cd7657-f731-4ce3-9160-a3fbfc6619dc'

# Edit and uncomment to create test data
#create_json(jpath, n, idx_name)

with open(jpath, 'r') as j:
    data = json.load(j)
    #print(data)
    df = pd.DataFrame(data=data, columns = list(data.keys()))
    df.set_index(idx_name, inplace=True, drop=True)

    t1 = timeit.timeit(lambda: lmc_method(df), setup="pass",number=3)
    print(f"lmc_method: {t1:.2f}")
    
    t2 = timeit.timeit(lambda: vs_method(df), setup="pass",number=3)
    print(f"vs_method : {t2:.2f}, {t2/t1 - 1:.2f}")

lmc_method: 5.22
vs_method : 7.31, 0.40

Thank you very much, your solution is very efficient. I really appreciate it.

Collectives™ on Stack Overflow

Efficient way creating a dict of dict from a pandas dataframe

3 Answers 3

2 Comments

3 Comments

Testing

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Testing

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related