4

I have a pandas dataframe of the following structure:

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

I need to create a dict of dict to get data of all existing edges (indicated by nonzero values) between nodes:

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

I need it to create a network graph using Networkx library and perform some calculations on it. Obviously it would be possible to loop over every cell in the data frame to do this but my data is quite large and it would be inefficient. I'm looking for some better way possibly using vectorization and/or list comprehension. I've tried list comprehension but I'm stuck and cannot make it work. Can anyone suggest a more efficient way to do this please?

4
  • don't loop but use .apply() Commented Jul 6 at 14:02
  • Try my answer :-). It a single loop comprehension + pandas native indexing. Commented Jul 6 at 15:16
  • 1
    Is it required to use a pandas dataframe, since probably just processing the dict will be easier, than converting it to and from a dataframe. Commented Jul 6 at 16:28
  • Converting to a dict first, then creating the nested structure should be the fastest method: {k: {k: {v} for k,v in inner.items() if v} for k, inner in df.to_dict(orient="index").items()}. Commented Jul 7 at 21:16

3 Answers 3

3

You can do this by combining df.iterrows() with a dictionary comprehension. Although iterrows() is not truly vectorized, it's still reasonably efficient for this kind of task and cleaner than using manual nested loops. For example, you could write:

edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in attributes.items() if weight != 0}
    for node, attributes in df.iterrows()
}

If your DataFrame is very large and you’re concerned about performance, another approach is to first convert it into a plain dictionary of dictionaries using df.to_dict(orient='index') and then filter out the zeros. That would look like thiss:

data_dictionary = df.to_dict(orient='index')
edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
    for node, connections in data_dict.items()
}
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Viktor, your solution works great!
I’m glad I could help!
2

It seems my version is similar to @VictorSbruev but his idea with converting all to dictionary seems better.


I was thinking about using .apply(function, axis=1) to run code on every row and create column with inner dictionaries

def convert(row):
    data = row.to_dict()

    # skip `0` and convert value to `set()`
    data = {key:{val} for key, val in data.items() if val != 0}  

    return data

df['networkx'] = df.apply(convert, axis=1)

to get

A    {'X': {1}, 'Z': {1}, 'W': {3}}
B              {'Y': {1}, 'W': {2}}
C              {'X': {3}, 'Y': {2}}
D              {'X': {1}, 'Y': {1}}
Name: networkx, dtype: object

And later convert this column to dictionary

result = df['networkx'].to_dict()

which gives me expected

{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

Full working code where I was testing different versions

import pandas as pd

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

# for test
expected = {'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'Y': {1}, 'X': {1}}}

print(df)

def convert(row):
    #print(row)
    data = row.to_dict()
    #data = {row.name: {key:{val} for key, val in data.items() if val != 0}} # version 1
    data = {key:{val} for key, val in data.items() if val != 0}  # version 2
    return data

df['networkx'] = df.apply(convert, axis=1)
print(df['networkx'])

#print(list(df['networkx'].items()))

#result = {name:item[name] for name,item in df['networkx'].items()}  # for version 1
#result = {name:item for name,item in df['networkx'].items()}         # for version 2
result = df['networkx'].to_dict()                                    # for version 2

print('result  :', result)
print('expected:', expected)

3 Comments

Thank you. Great solution. But since Viktor was faster and his solution works also great, I'll have to accept his. I really appreciate your help.
his version is nice - short and still readable :)
Thank you! Your solution is nice too! I will look into it with interest and compare it with mine over the weekend.
2

This comprehension returns the expected result by

  • Iterating the index
  • Applying boolean indexing to each series
  • Returning a dictionary for each series
import pandas as pd

d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}
df = pd.DataFrame(data=d, columns=['I','X', 'Y', 'Z', 'W'])
df.set_index('I', inplace=True, drop=True)

out = { i: df.loc[i][df.loc[i] != 0].to_dict() for i in df.index}

print(out)

Result

{'A': {'X': 1, 'Z': 1, 'W': 3}, 'B': {'Y': 1, 'W': 2}, 'C': {'X': 3, 'Y': 2}, 'D': {'X': 1, 'Y': 1}}

Wrapping values in a Set

{ i: df.loc[i][df.loc[i] != 0].apply(lambda x: {x}).to_dict() for i in df.index}
{'A': {'X': {1}, 'Z': {1}, 'W': {3}}, 'B': {'Y': {1}, 'W': {2}}, 'C': {'X': {3}, 'Y': {2}}, 'D': {'X': {1}, 'Y': {1}}}

Testing

Performance against the accepted answer with list of 2000 items shows the accepted answer is 40% slower.

import pandas as pd
import timeit
import json
import random

def create_json(jpath, n, idx_name):
    data = {}
    for i in range(n):
        data[f'i{i}'] = [random.randint(0, n//5) for _ in range(n)]
    data[idx_name] = list(data.keys())
    with open(jpath, 'w') as j:
        json.dump(data, j)
    return idx_name

def lmc_method(df):
    out = { i: df.loc[i][df.loc[i] != 0].to_dict() for i in df.index}    
    return out

def vs_method(df):
    data_dict = df.to_dict(orient='index')
    edge_dictionary = {
    node: {attribute: {weight} for attribute, weight in connections.items() if weight != 0}
    for node, connections in data_dict.items()}
    return edge_dictionary

#d = {'I': ['A', 'B', 'C', 'D'], 'X': [ 1, 0, 3, 1], 'Y': [0, 1, 2, 1], 'Z': [1, 0, 0, 0], 'W': [3, 2, 0, 0]}

n = 2000
jpath = f'/home/lmc/tmp/faker_data_{n}.json'
idx_name = '38cd7657-f731-4ce3-9160-a3fbfc6619dc'

# Edit and uncomment to create test data
#create_json(jpath, n, idx_name)

with open(jpath, 'r') as j:
    data = json.load(j)
    #print(data)
    df = pd.DataFrame(data=data, columns = list(data.keys()))
    df.set_index(idx_name, inplace=True, drop=True)

    t1 = timeit.timeit(lambda: lmc_method(df), setup="pass",number=3)
    print(f"lmc_method: {t1:.2f}")
    
    t2 = timeit.timeit(lambda: vs_method(df), setup="pass",number=3)
    print(f"vs_method : {t2:.2f}, {t2/t1 - 1:.2f}")
lmc_method: 5.22
vs_method : 7.31, 0.40

1 Comment

Thank you very much, your solution is very efficient. I really appreciate it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.