0

I am not very experienced with coding but I am creating a customtkinter application style script where a user can input a specific type of html that contains diagnostic addresses and various information attributed to that address, and the script will parse through it and return the selected address/information as a dictionary for further use.

The code works but the HTMLs can range from ~10000 to ~70000 lines in length and it will take over a minute to read through the larger HTMLs. I know there inefficiencies in my code so I am trying to figure out ways to reduce the waiting time while the script runs. I figure my biggest bottlenecks are the repeated creation of dataframes and the nested for loop afterwards. I have considered creating only one dataframe and iterating through it but I am unsure of the impact it would make.

How can I write this in a way the improves the runtime?

Here is the function:

# Clear list of previous selections
fv.info_values_to_add.clear()

# Read user's info selections
filter_info_selections()

# Open and read protocol file
with open(file_name, 'r') as file:
    contents = file.read()

# Global variable to used in export functions
global Length_Of_Info_1
Length_Of_Info_1 = len(fv.info_values_to_add)

# Create a soup object from the protocol
parsed_protocol = BeautifulSoup(contents, "html.parser")


for address, address_value in fv.protocol_values_1.items():


    # String to be used to find the correct section of the html
    string_address = "ECU: " + (address)

    try:
        
        # Find the header for the parsed address
        table = parsed_protocol.find('p', string = re.compile(string_address))
        
        # Select the correct table for the information
        data_table = table.find_all_next('table')

        # Create a dataframe from the table
        data_frame = pd.read_html(io.StringIO(str(data_table)))[1]

        # Clean data frame columns and values
        df_clean = data_frame.drop(columns=2, axis=1)
        
        # Save selected data to variables to be used
        sw_version = df_clean.iloc[1,1]
        hw_part_number = df_clean.iloc[2,1]
        hw_version = df_clean.iloc[3,1]
        vehicle_vin = df_clean.iloc[20,1]
        fazit_id = df_clean.iloc[21,1]
        coding = df_clean.iloc[7,1]
        vw_part_number = df_clean.iloc[0,1]


        # List to store variables to be added to the fv.protocol_values_1 dictionary
        temp_list = []

        
        # Iterate through the info list and add the selected variables
        for key in fv.info_values_to_add:

            if key == "Software Version":
                temp_list.append(sw_version)
            

            elif key == "Hardware part number":
                temp_list.append(hw_part_number)
            

            elif key == "Hardware Version":
                temp_list.append(hw_version)
            

            elif key == "Fazit ID":
                temp_list.append(fazit_id)
            

            elif key == "VIN Number":
                temp_list.append(vehicle_vin)
            

            elif key == "Coding":
                temp_list.append(coding)
            

            elif key == "VW part number":
                temp_list.append(vw_part_number)


            else:
                pass
            
                
        # Add values to the address in the dictionary
        fv.protocol_values_1[address] = temp_list
1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. Commented Apr 16 at 9:00

1 Answer 1

0

If anybody reads this. Through some more googling and AI. I found lineprofiler to benchmark the function and was able to identify the bottlenecks. I found that my find_all_next was taking up most of the time, so I re-worked how it is finding the table with find_next. I also changed the parser from html to lxml for the Soup object. I ended up rewriting the code for the dataframe variables to be:

 # Extract values once using a dictionary for clarity
            value_map = {
                "Software Version": df_clean.iloc[1, 1],
                "Hardware part number": df_clean.iloc[2, 1],
                "Hardware Version": df_clean.iloc[3, 1],
                "VIN Number": df_clean.iloc[20, 1],
                "Fazit ID": df_clean.iloc[21, 1],
                "Coding": df_clean.iloc[7, 1],
                "VW part number": df_clean.iloc[0, 1],
            }

            # Use list comprehension for speed and readability
            temp_list = [value_map[key] for key in fv.info_values_to_add if key in value_map]

These changes brought the runtimes from the larger HTMLs, from over a minute to ~10 seconds.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.