Approaches To Optimize Nested For Loops and Dataframe Creation?

Question

I am not very experienced with coding but I am creating a customtkinter application style script where a user can input a specific type of html that contains diagnostic addresses and various information attributed to that address, and the script will parse through it and return the selected address/information as a dictionary for further use.

The code works but the HTMLs can range from ~10000 to ~70000 lines in length and it will take over a minute to read through the larger HTMLs. I know there inefficiencies in my code so I am trying to figure out ways to reduce the waiting time while the script runs. I figure my biggest bottlenecks are the repeated creation of dataframes and the nested for loop afterwards. I have considered creating only one dataframe and iterating through it but I am unsure of the impact it would make.

How can I write this in a way the improves the runtime?

Here is the function:

# Clear list of previous selections
fv.info_values_to_add.clear()

# Read user's info selections
filter_info_selections()

# Open and read protocol file
with open(file_name, 'r') as file:
    contents = file.read()

# Global variable to used in export functions
global Length_Of_Info_1
Length_Of_Info_1 = len(fv.info_values_to_add)

# Create a soup object from the protocol
parsed_protocol = BeautifulSoup(contents, "html.parser")


for address, address_value in fv.protocol_values_1.items():


    # String to be used to find the correct section of the html
    string_address = "ECU: " + (address)

    try:
        
        # Find the header for the parsed address
        table = parsed_protocol.find('p', string = re.compile(string_address))
        
        # Select the correct table for the information
        data_table = table.find_all_next('table')

        # Create a dataframe from the table
        data_frame = pd.read_html(io.StringIO(str(data_table)))[1]

        # Clean data frame columns and values
        df_clean = data_frame.drop(columns=2, axis=1)
        
        # Save selected data to variables to be used
        sw_version = df_clean.iloc[1,1]
        hw_part_number = df_clean.iloc[2,1]
        hw_version = df_clean.iloc[3,1]
        vehicle_vin = df_clean.iloc[20,1]
        fazit_id = df_clean.iloc[21,1]
        coding = df_clean.iloc[7,1]
        vw_part_number = df_clean.iloc[0,1]


        # List to store variables to be added to the fv.protocol_values_1 dictionary
        temp_list = []

        
        # Iterate through the info list and add the selected variables
        for key in fv.info_values_to_add:

            if key == "Software Version":
                temp_list.append(sw_version)
            

            elif key == "Hardware part number":
                temp_list.append(hw_part_number)
            

            elif key == "Hardware Version":
                temp_list.append(hw_version)
            

            elif key == "Fazit ID":
                temp_list.append(fazit_id)
            

            elif key == "VIN Number":
                temp_list.append(vehicle_vin)
            

            elif key == "Coding":
                temp_list.append(coding)
            

            elif key == "VW part number":
                temp_list.append(vw_part_number)


            else:
                pass
            
                
        # Add values to the address in the dictionary
        fv.protocol_values_1[address] = temp_list

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community
– Community Bot, Commented Apr 16 at 9:00

Chickchu · Accepted Answer · 2025-04-17 19:15:09Z

If anybody reads this. Through some more googling and AI. I found lineprofiler to benchmark the function and was able to identify the bottlenecks. I found that my find_all_next was taking up most of the time, so I re-worked how it is finding the table with find_next. I also changed the parser from html to lxml for the Soup object. I ended up rewriting the code for the dataframe variables to be:

 # Extract values once using a dictionary for clarity
            value_map = {
                "Software Version": df_clean.iloc[1, 1],
                "Hardware part number": df_clean.iloc[2, 1],
                "Hardware Version": df_clean.iloc[3, 1],
                "VIN Number": df_clean.iloc[20, 1],
                "Fazit ID": df_clean.iloc[21, 1],
                "Coding": df_clean.iloc[7, 1],
                "VW part number": df_clean.iloc[0, 1],
            }

            # Use list comprehension for speed and readability
            temp_list = [value_map[key] for key in fv.info_values_to_add if key in value_map]

These changes brought the runtimes from the larger HTMLs, from over a minute to ~10 seconds.

Collectives™ on Stack Overflow

Approaches To Optimize Nested For Loops and Dataframe Creation?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related