0

I am trying to read in files in a loop and append them all into one dataset. However my code seems to be reading the data in fine, but the loop is not appending the data to a dataframe. Instead it just uses one of the imported datasets (final_Access hr dataframe).

What is wrong with my loop? why arent my looped files being appended? My dataframe access_HR_attestaion has 77 records, when I am expecting 2639 records as I am reading in 3 files.

for file in files_path:
    mainframe_access_HR = pd.read_pickle(file)
    mainframe_access_HR = mainframe_access_HR.astype(str)
    
    if mainframe_access_HR.shape[0]: 
        
        application = mainframe_access_HR['Owner'].unique()[0]
    

        filtered_attestation_data = attestation_data[attestation_data['cleaned_MAL_CODE']==application]


        final_access_hr = pd.DataFrame()
        column_list = pd.DataFrame(['HRACF2']) 
        for column in range(len(column_list)):
            mainframe_access_HR_new = mainframe_access_HR.copy()

            #Drop rows containing NAN for column c_ACF2_ID for new merge
            mainframe_access_HR_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
        
            #Creating a new column for merge
            mainframe_access_HR_new['ID'] = mainframe_access_HR_new[column_list.iloc[column,0]]
            
            #case folding
            mainframe_access_HR_new['ID'] = mainframe_access_HR_new['ID'].str.strip().str.upper()
        
            #Merge data
            merged_data = pd.merge(filtered_attestation_data, mainframe_access_HR_new, how='right', left_on=['a','b'], right_on =['a','b'])

        
            #Concatinating all data together
            final_access_hr = final_access_hr.append(merged_data)

        #Remove duplicates
        access_HR_attestaion = final_access_hr.drop_duplicates()
3
  • append is unfortunately deprecated but that just means one should collect the parts into a list and then concat them all at the end. Commented Jun 26, 2022 at 17:28
  • @creanion thanks! didnt know it was deprecated. anychance you could help with the concat list part? Commented Jun 26, 2022 at 17:39
  • It's like this pd.concat(list_of_dataframes, axis=0) Commented Jun 26, 2022 at 17:39

1 Answer 1

1

I think the bug is because you are initializing final_access_hr for the every file you are reading. So that is getting reset for every file you read.

Can you move following line out the loop of files_path:

final_access_hr = pd.DataFrame()

and comment if it solves your problem?

Sign up to request clarification or add additional context in comments.

1 Comment

I have put it before the files_path loop, and now im getting NameError: name 'final_access_hr' is not defined

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.