1

I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.

I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.

But I haven't been able to get it right. I've tried to model on this SO question and answer.

This is my starting dataframe:

df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
    "county_fips" : ['06019','06021','06023'] , 
    "grade11" : [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]})
df

This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.

projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

for i in df['county_fips'].unique():
    print(i)
    grade11_change=df.iloc[0]['grade11_chg']
    grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
    full_name=[]
        
    for year in projection_years:
        #print(year)
        df_select=df[df['county_fips']==i]
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row = {}
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
        full_name.append(df_select)

df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
   
print('Finished processing')

But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)

df_final_goal=pd.DataFrame({'year': {0: '2020_21',  1: '2021_22',  2: '2022_23',  3: '2023_24',  4: '2024_25',  5: '2025_26',
  6: '2026_27',  7: '2027_28',  8: '2028_29',  9: '2029_30',  10: '2030_31',  11: '2020_21',  12: '2021_22',  13: '2022_23',
  14: '2023_24',  15: '2024_25',  16: '2025_26',  17: '2026_27',  18: '2027_28',  19: '2028_29',  20: '2029_30',  21: '2030_31',
  22: '2020_21',  23: '2021_22',  24: '2022_23',  25: '2023_24',  26: '2024_25',  27: '2025_26',  28: '2026_27',  29: '2027_28',
  30: '2028_29',  31: '2029_30',  32: '2030_31'},
 'county_fips': {0: '06019',  1: '06019',  2: '06019',  3: '06019',  4: '06019',  5: '06019',  6: '06019',  7: '06019',  8: '06019',
  9: '06019',  10: '06019',  11: '06021',  12: '06021',  13: '06021',  14: '06021',  15: '06021',  16: '06021',  17: '06021',  18: '06021',
  19: '06021',  20: '06021',  21: '06021',  22: '06023',  23: '06023',  24: '06023',  25: '06023',  26: '06023',  27: '06023',
  28: '06023',  29: '06023',  30: '06023',  31: '06023',  32: '06023'},
'grade11': {0: 5000,  1: 5050,  2: 5101,  3: 5152,  4: 5203,  5: 5255,  6: 5308,  7: 5361,  8: 5414,  9: 5468, 10: 5523,
  11: 2000,  12: 2040,  13: 2081,  14: 2122,  15: 2165,  16: 2208,  17: 2252,  18: 2297,  19: 2343,  20: 2390,  21: 2438,
  22: 2000,  23: 2060,  24: 2122,  25: 2185,  26: 2251,  27: 2319,  28: 2388,  29: 2460,  30: 2534,  31: 2610,  32: 2688},
 'grade12': {0: 5200,  1: 4500,  2: 4545,  3: 4590,  4: 4636,  5: 4683,  6: 4730,  7: 4777,  8: 4825,  9: 4873,  10: 4922,
  11: 2200,  12: 1600,  13: 1632,  14: 1665,  15: 1698,  16: 1732,  17: 1767,  18: 1802,  19: 1838,  20: 1875,  21: 1912,
  22: 2200,  23: 1740,  24: 1792,  25: 1846,  26: 1901,  27: 1958,  28: 2017,  29: 2078,  30: 2140,  31: 2204,  32: 2270}})

Thanks for any assistance.

7
  • can you update the code in order to make it reproducible? the initial (test) dataframe lacks columns county_fips (is it county_code?), grade11_chg or grade11_12_ratio columns. These 3 columns are used in the latter piece of code. Also note that the projection_years values are not all present in test dataframe Commented Jun 10, 2022 at 14:04
  • I see now why the projection_years should not be the dataframe, they represent the rows to be added. Commented Jun 10, 2022 at 14:11
  • My apologies. I had an earlier df version in there. It's updated now. Commented Jun 10, 2022 at 14:12
  • in your df_final_goal you have grade11 equaling values that don't match the output of your for-loop. E.g. 5000 * 1.01 != 6079 and 5200 * 0.9 != 5417. In fact the first row per [(year, county_fips)] group should be equal to your original df but they aren't. Is something off with the final? Or the original? Commented Jun 10, 2022 at 17:35
  • You are correct @IanThompson . I botched doing the df_final_goal numbers in Excel. I updated the post. Thank you. Commented Jun 10, 2022 at 20:32

2 Answers 2

1

Creating a helper function for calculating grade11 helps make this a bit easier.

import pandas as pd


def expand_grade11(
    grade11: int,
    grade11_chg: float,
    len_projection_years: int
) -> list:
    """
    Calculate `grade11` values based on current
    `grade11`, `grade11_chg`, and number of
    `projection_years`.
    """

    list_of_vals = []
    while len(list_of_vals) < len_projection_years:
        grade11 = int(grade11 * grade11_chg)
        list_of_vals.append(grade11)

    return list_of_vals


# initial info
df = pd.DataFrame({
    "year": ['2020_21', '2020_21','2020_21'],
    "county_fips": ['06019','06021','06023'] , 
    "grade11": [5000,2000,2000],
    "grade12": [5200,2200,2200],
    "grade11_chg": [1.01,1.02,1.03],
    "grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])

# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)

# calculate grade11
final = df.groupby([
    "county_fips",
    "year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")

# concat with original df to get other columns
final = pd.concat([
    df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)

# calculate grade12
grade12 = final.groupby([
    "county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)

# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)
Sign up to request clarification or add additional context in comments.

3 Comments

It's possible that I'll be adding other grades to my script. For this approach, am I correct in thinking that would involve defining functions and adding #calculate grade code blocks for the additional grades? Thanks.
that depends on their dependencies. I updated my script to only have one function (grade11) because it is the only value dependent on itself. grade12 depends on grade11 and grade11_12_ratio so it's transformation can be done directly.
I'm going to work through this solution. I think it will help me become more experienced with functions.
0

there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):

projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']

full_name = []
for i in df['county_fips'].unique():
    print(i)
    df_select = df[df['county_fips']==i]
    grade11_change = df_select.iloc[0]['grade11_chg']
    grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
        
    for year in projection_years:
        #print(year)
        lr = df_select.iloc[-1]
        row = {}
        row['year'] = year
        row['county_fips'] = i
        row['grade11'] = int(lr['grade11'] * grade11_change)
        row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
        df_select = df_select.append([row])
    full_name.append(df_select)

df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
   
print('Finished processing')

fixes:

  • full_name initialized before the outer loop
  • do not redefine df_select in the inner loop
  • row was initialized twice inside the inner loop
  • full_name.append moved outside of the inner loop and after it
  • added reset_index() to df_final (mostly cosmetic)
  • (edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)

the final result (print(df_final.to_markdown())) with the above code is:

index year county_fips grade11 grade12
0 0 2020_21 06019 5000 5200
1 0 2021_22 06019 5050 4500
2 0 2022_23 06019 5100 4545
3 0 2023_24 06019 5151 4590
4 0 2024_25 06019 5202 4635
5 0 2025_26 06019 5254 4681
6 0 2026_27 06019 5306 4728
7 0 2027_28 06019 5359 4775
8 0 2028_29 06019 5412 4823
9 0 2029_30 06019 5466 4870
10 0 2030_31 06019 5520 4919
11 1 2020_21 06021 2000 2200
12 0 2021_22 06021 2040 1600
13 0 2022_23 06021 2080 1632
14 0 2023_24 06021 2121 1664
15 0 2024_25 06021 2163 1696
16 0 2025_26 06021 2206 1730
17 0 2026_27 06021 2250 1764
18 0 2027_28 06021 2295 1800
19 0 2028_29 06021 2340 1836
20 0 2029_30 06021 2386 1872
21 0 2030_31 06021 2433 1908
22 2 2020_21 06023 2000 2200
23 0 2021_22 06023 2060 1740
24 0 2022_23 06023 2121 1792
25 0 2023_24 06023 2184 1845
26 0 2024_25 06023 2249 1900
27 0 2025_26 06023 2316 1956
28 0 2026_27 06023 2385 2014
29 0 2027_28 06023 2456 2074
30 0 2028_29 06023 2529 2136
31 0 2029_30 06023 2604 2200
32 0 2030_31 06023 2682 2265

note: edited to address the comments

5 Comments

For the first county_fips - 0619 - the above code gives values that closely match what I get doing the math in Excel. Numbers for the other two counties are not what I expect. For county_fips 06021 in year 2030-31, for example, I get 2438 for grade11 and 1912 for grade12 in Excel. The above code returns 2205 and 1965.
@JamesMiller I get 2433 for grade 11, year 2030-31, county 06021. If you do all the calculations without converting to int, you get 2437.9888... Maybe Excel is rounding at the end instead of throughout the series?
Right, thanks @IanThompson . Both 2433 or ~2437 are quite a bit different from 2205. But I think I see what might be going on -- it appears that the grade11_change and grade11_12_ratio variables are not updating in the outer for loop. In df, each county has different values for those variables. But it seems the outer loop is taking the values for the first county (06019), and using those to generate all three counties' projections in the inner loop. My goal is for grade11_change and grade11_12_ratio variables to update on each loop.
@JamesMiller That can be solved by moving the grade11_change and grade11_12_ratio assignments into the second loop. But as a word of advice, I'd avoid using for-loops as much as possible when working with pandas. It has a lot of features that help get away from looping, and they tend to be a lot faster due to vectorization.
right, there was another issue in the code, see the edited code with a new fix. Now I am getting results consistent with what is reported by @IanThompson.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.