Filtering rows While Joining Two Tables(dataframes)

Question

Currently I have 2 tables, lets say symbol_data and cve_data. symbol_data is structured as below:

 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   package_name     543080 non-null  object
 1   version          543080 non-null  object
 2   visiblename      543080 non-null  object
 3   cve_numbers      486737 non-null  object
 4   cve_numbers_all  543080 non-null  object
 5   family           543080 non-null  object
 6   lib_so           543080 non-null  object
 7   symbol           543080 non-null  object
 8   symbol_type      543080 non-null  object

cve_data below:

Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   cve_number      217542 non-null  object
 1   severity        217542 non-null  object
 2   description     217542 non-null  object
 3   score           217542 non-null  object
 4   cvss_version    217542 non-null  object
 5   cvss_vector     217542 non-null  object
 6   data_source     217542 non-null  object
 7   published_date  217542 non-null  object
 8   last_modified   217542 non-null  object

cve_numbers_all has list of cve numbers, I want to explode cve_numbers_all and join these two tables on cve_number. However I dont need all the resulting records. I want to keep the records only if the symbol string in symbol_data table is inside anywhere in the description string of cve_data table.

when I explode cve_numbers_all kernel crashes probably the data becomes enormous.

I tried filtering row by row:

for index, row in tqdm(symbol_data.iterrows(), total=symbol_data.shape[0], desc="Expanding Symbols over CVEs"):
        relevant_cves = [cve_num for cve_num in row['cve_numbers_all'] if cve_num in cve_data.index]
        for cve_num in relevant_cves:
            if row['symbol'] in cve_data.at[cve_num, 'description']:
                filtered_rows.append(row)
    filtered_df = pd.DataFrame(filtered_rows)

however it runs extremely slow.

I am using python. And my data is originaly in postgresql db. I pull it into dataframe to process because Im noob and python is easier. I am open to both python and postgresql solutions, please help

Stefanov.sm · Accepted Answer · 2024-06-19 13:00:17Z

0

This would be fairly simple and effective in native SQL.

select * -- or whatever expressions you actually need
from symbol_data as sd
cross join lateral unnest(sd.cve_numbers_all) as cven
inner join cve_data as cd on cd.cve_number = cven
where position(sd.symbol in cd.description) <> 0;

I assume that cve_numbers_all data type is array.

Note: It would be better to use tables' DDL in your question.

edited Jun 19, 2024 at 13:00

answered Jun 19, 2024 at 12:55

Stefanov.sm

13.2k2 gold badges24 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Filtering rows While Joining Two Tables(dataframes)

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related