Create multiple dataframe in loops

Question

I have a population data. I want to create separate dataframes for each state and year. The idea is the following:

for i in province_id:
    for j in year:
         sub_data_i_j = data[(data.provid==i) &(data.wave==j)]

However, I am not sure how to generate sub_data_i_j dynamically.

Ken Wei · Accepted Answer · 2017-10-18 09:13:54Z

2

This should do it:

for i in province_id:
    for j in year:
        locals()['sub_data_{}_{}'.format(i,j)] = data[(data.provid==i) & (data.wave==j)]

I initially suggested using exec, which is not usually considered best practice for safety reasons. Having said so, if your code is not exposed to anyone with malicious intentions, it should be OK, and I'll leave it here for the sake of completeness:

for i in province_id:
    for j in year:
        exec "sub_data_{}_{} = data[(data.provid==i) & (data.wave==j)]".format(i,j)

Nevertheless, for most use cases, it's probably better to use a collection of some sort, e.g. a dictionary, because it will be cumbersome to reference dynamically generated variable names in subsequent parts of your code. It's also a one-liner:

data_dict = {key:g for key,g in data.groupby(['provid','wave'])}

edited Oct 18, 2017 at 9:13

answered Oct 18, 2017 at 9:00

Ken Wei

3,1381 gold badge12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yan Song Over a year ago

I agree. The second method is more pythonic. Thanks!

jezrael · Accepted Answer · 2017-10-18 08:53:32Z

2

I think the best is create dictionary of DataFrames with groupby with filtering first by boolean indexing:

df = pd.DataFrame({'A':list('abcdef'),
                   'wave':[2004,2005,2004,2005,2005,2004],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'provid':list('aaabbb')})

print (df)
   A  C  D  E provid  wave
0  a  7  1  5      a  2004
1  b  8  3  3      a  2005
2  c  9  5  6      a  2004
3  d  4  7  9      b  2005
4  e  2  1  2      b  2005
5  f  3  0  4      b  2004


province_id = ['a','b']
year = [2004]
df = df[(df.provid.isin(province_id)) &(df.wave.isin(year))]
print (df)
   A  C  D  E provid  wave
0  a  7  1  5      a  2004
2  c  9  5  6      a  2004
5  f  3  0  4      b  2004

dfs = {'{0[0]}_{0[1]}'.format(i) : x for i, x in df.groupby(['provid','wave'])}

Another solution:

dfs = dict(tuple(df.groupby(df['provid'] + '_' + df['wave'].astype(str))))

print (dfs)
{'a_2004':    A  C  D  E provid  wave
0  a  7  1  5      a  2004
2  c  9  5  6      a  2004, 'b_2004':    A  C  D  E provid  wave
5  f  3  0  4      b  2004}

Last you can select each DataFrame:

print (dfs['b_2004'])
   A  C  D  E provid  wave
5  f  3  0  4      b  2004

Your answer should be changed by:

sub_data = {}
province_id = ['a','b']
year = [2004]
for i in province_id:
    for j in year:
         sub_data[i + '_' + str(j)] = df[(df.provid==i) &(df.wave==j)]

print (sub_data)
{'a_2004':    A  C  D  E provid  wave
0  a  7  1  5      a  2004
2  c  9  5  6      a  2004, 'b_2004':    A  C  D  E provid  wave
5  f  3  0  4      b  2004}

edited Oct 18, 2017 at 8:53

answered Oct 18, 2017 at 8:34

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

3 Comments

Anton vBR Over a year ago

And by the time I post you already got a big answer... nice +1

Anton vBR Over a year ago

I mean, by the time I finished my "answer" you already got an answer with examples and "other solutions". You are quick

Yan Song Over a year ago

Thanks for the detailed answer!

Anton vBR · Accepted Answer · 2017-10-18 08:47:45Z

1

My suggestion:

import io
import pandas as pd
from collections import defaultdict

string = u"""province_id,wave,value
1,2014,10
1,2014,10
1,2013,10
2,2010,10
3,2010,10"""

df = pd.read_csv(io.StringIO(string))

# Output:
d = defaultdict(dict)

# This splits the dataframe by province_id and wave
dfs = df.groupby(["province_id","wave"])

# Loop through the dataframes and stucture them
for ind,df in dfs:
    d[ind[0]][ind[1]] = df

The resulting dictionary structure looks like this:

{
  "1": {
    "2013": "dataframe: 1 2013", 
    "2014": "dataframe: 1 2014"
  }, 
  "2": {
    "2010": "dataframe: 2 2010"
  }, 
  "3": {
    "2010": "dataframe: 3 2010"
  }
}

And you access the dataframes by e.g.:

d[1][2013]

answered Oct 18, 2017 at 8:47

Anton vBR

19k6 gold badges47 silver badges47 bronze badges

1 Comment

Yan Song Over a year ago

Thanks for introducing the defaultdict class.

Collectives™ on Stack Overflow

Create multiple dataframe in loops

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related