Nested for loop filtering inner loop based on outer loop and appending dataframe

Question

I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.

I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.

But I haven't been able to get it right. I've tried to model on this SO question and answer.

This is my starting dataframe:

df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
    "county_fips" : ['06019','06021','06023'] , 
    "grade11" : [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]})
df

This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.

projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

for i in df['county_fips'].unique():
    print(i)
    grade11_change=df.iloc[0]['grade11_chg']
    grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
    full_name=[]
        
    for year in projection_years:
        #print(year)
        df_select=df[df['county_fips']==i]
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row = {}
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
        full_name.append(df_select)

df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
   
print('Finished processing')

But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)

df_final_goal=pd.DataFrame({'year': {0: '2020_21',  1: '2021_22',  2: '2022_23',  3: '2023_24',  4: '2024_25',  5: '2025_26',
  6: '2026_27',  7: '2027_28',  8: '2028_29',  9: '2029_30',  10: '2030_31',  11: '2020_21',  12: '2021_22',  13: '2022_23',
  14: '2023_24',  15: '2024_25',  16: '2025_26',  17: '2026_27',  18: '2027_28',  19: '2028_29',  20: '2029_30',  21: '2030_31',
  22: '2020_21',  23: '2021_22',  24: '2022_23',  25: '2023_24',  26: '2024_25',  27: '2025_26',  28: '2026_27',  29: '2027_28',
  30: '2028_29',  31: '2029_30',  32: '2030_31'},
 'county_fips': {0: '06019',  1: '06019',  2: '06019',  3: '06019',  4: '06019',  5: '06019',  6: '06019',  7: '06019',  8: '06019',
  9: '06019',  10: '06019',  11: '06021',  12: '06021',  13: '06021',  14: '06021',  15: '06021',  16: '06021',  17: '06021',  18: '06021',
  19: '06021',  20: '06021',  21: '06021',  22: '06023',  23: '06023',  24: '06023',  25: '06023',  26: '06023',  27: '06023',
  28: '06023',  29: '06023',  30: '06023',  31: '06023',  32: '06023'},
'grade11': {0: 5000,  1: 5050,  2: 5101,  3: 5152,  4: 5203,  5: 5255,  6: 5308,  7: 5361,  8: 5414,  9: 5468, 10: 5523,
  11: 2000,  12: 2040,  13: 2081,  14: 2122,  15: 2165,  16: 2208,  17: 2252,  18: 2297,  19: 2343,  20: 2390,  21: 2438,
  22: 2000,  23: 2060,  24: 2122,  25: 2185,  26: 2251,  27: 2319,  28: 2388,  29: 2460,  30: 2534,  31: 2610,  32: 2688},
 'grade12': {0: 5200,  1: 4500,  2: 4545,  3: 4590,  4: 4636,  5: 4683,  6: 4730,  7: 4777,  8: 4825,  9: 4873,  10: 4922,
  11: 2200,  12: 1600,  13: 1632,  14: 1665,  15: 1698,  16: 1732,  17: 1767,  18: 1802,  19: 1838,  20: 1875,  21: 1912,
  22: 2200,  23: 1740,  24: 1792,  25: 1846,  26: 1901,  27: 1958,  28: 2017,  29: 2078,  30: 2140,  31: 2204,  32: 2270}})

Thanks for any assistance.

can you update the code in order to make it reproducible? the initial (test) dataframe lacks columns county_fips (is it county_code?), grade11_chg or grade11_12_ratio columns. These 3 columns are used in the latter piece of code. Also note that the projection_years values are not all present in test dataframe — pietroppeter
– pietroppeter, Commented Jun 10, 2022 at 14:04
I see now why the projection_years should not be the dataframe, they represent the rows to be added. — pietroppeter
– pietroppeter, Commented Jun 10, 2022 at 14:11
My apologies. I had an earlier df version in there. It's updated now. — JamesMiller
– JamesMiller, Commented Jun 10, 2022 at 14:12
in your df_final_goal you have grade11 equaling values that don't match the output of your for-loop. E.g. 5000 * 1.01 != 6079 and 5200 * 0.9 != 5417. In fact the first row per [(year, county_fips)] group should be equal to your original df but they aren't. Is something off with the final? Or the original? — Ian Thompson
– Ian Thompson, Commented Jun 10, 2022 at 17:35
You are correct @IanThompson . I botched doing the df_final_goal numbers in Excel. I updated the post. Thank you. — JamesMiller
– JamesMiller, Commented Jun 10, 2022 at 20:32

Ian Thompson · Accepted Answer · 2022-06-11 15:38:19Z

1

Creating a helper function for calculating grade11 helps make this a bit easier.

import pandas as pd


def expand_grade11(
    grade11: int,
    grade11_chg: float,
    len_projection_years: int
) -> list:
    """
    Calculate `grade11` values based on current
    `grade11`, `grade11_chg`, and number of
    `projection_years`.
    """

    list_of_vals = []
    while len(list_of_vals) < len_projection_years:
        grade11 = int(grade11 * grade11_chg)
        list_of_vals.append(grade11)

    return list_of_vals


# initial info
df = pd.DataFrame({
    "year": ['2020_21', '2020_21','2020_21'],
    "county_fips": ['06019','06021','06023'] , 
    "grade11": [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])

# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)

# calculate grade11
final = df.groupby([
    "county_fips",
    "year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")

# concat with original df to get other columns
final = pd.concat([
    df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)

# calculate grade12
grade12 = final.groupby([
    "county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)

# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)

edited Jun 11, 2022 at 15:38

answered Jun 10, 2022 at 21:08

Ian Thompson

3,3252 gold badges22 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

JamesMiller Over a year ago

It's possible that I'll be adding other grades to my script. For this approach, am I correct in thinking that would involve defining functions and adding #calculate grade code blocks for the additional grades? Thanks.

Ian Thompson Over a year ago

that depends on their dependencies. I updated my script to only have one function (grade11) because it is the only value dependent on itself. grade12 depends on grade11 and grade11_12_ratio so it's transformation can be done directly.

JamesMiller Over a year ago

I'm going to work through this solution. I think it will help me become more experienced with functions.

pietroppeter · Accepted Answer · 2022-06-13 06:44:39Z

0

there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):

projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

full_name = []
for i in df['county_fips'].unique():
    print(i)
    df_select = df[df['county_fips']==i]
    grade11_change = df_select.iloc[0]['grade11_chg']
    grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
        
    for year in projection_years:
        #print(year)
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
    full_name.append(df_select)

df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
   
print('Finished processing')

fixes:

full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)
(edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)

the final result (print(df_final.to_markdown())) with the above code is:

	index	year	county_fips	grade11	grade12
0	0	2020_21	06019	5000	5200
1	0	2021_22	06019	5050	4500
2	0	2022_23	06019	5100	4545
3	0	2023_24	06019	5151	4590
4	0	2024_25	06019	5202	4635
5	0	2025_26	06019	5254	4681
6	0	2026_27	06019	5306	4728
7	0	2027_28	06019	5359	4775
8	0	2028_29	06019	5412	4823
9	0	2029_30	06019	5466	4870
10	0	2030_31	06019	5520	4919
11	1	2020_21	06021	2000	2200
12	0	2021_22	06021	2040	1600
13	0	2022_23	06021	2080	1632
14	0	2023_24	06021	2121	1664
15	0	2024_25	06021	2163	1696
16	0	2025_26	06021	2206	1730
17	0	2026_27	06021	2250	1764
18	0	2027_28	06021	2295	1800
19	0	2028_29	06021	2340	1836
20	0	2029_30	06021	2386	1872
21	0	2030_31	06021	2433	1908
22	2	2020_21	06023	2000	2200
23	0	2021_22	06023	2060	1740
24	0	2022_23	06023	2121	1792
25	0	2023_24	06023	2184	1845
26	0	2024_25	06023	2249	1900
27	0	2025_26	06023	2316	1956
28	0	2026_27	06023	2385	2014
29	0	2027_28	06023	2456	2074
30	0	2028_29	06023	2529	2136
31	0	2029_30	06023	2604	2200
32	0	2030_31	06023	2682	2265

note: edited to address the comments

edited Jun 13, 2022 at 6:44

answered Jun 10, 2022 at 14:57

pietroppeter

1,49314 silver badges33 bronze badges

5 Comments

JamesMiller Over a year ago

For the first county_fips - 0619 - the above code gives values that closely match what I get doing the math in Excel. Numbers for the other two counties are not what I expect. For county_fips 06021 in year 2030-31, for example, I get 2438 for grade11 and 1912 for grade12 in Excel. The above code returns 2205 and 1965.

Ian Thompson Over a year ago

@JamesMiller I get 2433 for grade 11, year 2030-31, county 06021. If you do all the calculations without converting to int, you get 2437.9888... Maybe Excel is rounding at the end instead of throughout the series?

JamesMiller Over a year ago

Right, thanks @IanThompson . Both 2433 or ~2437 are quite a bit different from 2205. But I think I see what might be going on -- it appears that the grade11_change and grade11_12_ratio variables are not updating in the outer for loop. In df, each county has different values for those variables. But it seems the outer loop is taking the values for the first county (06019), and using those to generate all three counties' projections in the inner loop. My goal is for grade11_change and grade11_12_ratio variables to update on each loop.

Ian Thompson Over a year ago

@JamesMiller That can be solved by moving the grade11_change and grade11_12_ratio assignments into the second loop. But as a word of advice, I'd avoid using for-loops as much as possible when working with pandas. It has a lot of features that help get away from looping, and they tend to be a lot faster due to vectorization.

pietroppeter Over a year ago

right, there was another issue in the code, see the edited code with a new fix. Now I am getting results consistent with what is reported by @IanThompson.

Collectives™ on Stack Overflow

Nested for loop filtering inner loop based on outer loop and appending dataframe

2 Answers 2

3 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related