Currently I have 2 tables, lets say symbol_data and cve_data.
symbol_data is structured as below:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 package_name 543080 non-null object
1 version 543080 non-null object
2 visiblename 543080 non-null object
3 cve_numbers 486737 non-null object
4 cve_numbers_all 543080 non-null object
5 family 543080 non-null object
6 lib_so 543080 non-null object
7 symbol 543080 non-null object
8 symbol_type 543080 non-null object
cve_data below:
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cve_number 217542 non-null object
1 severity 217542 non-null object
2 description 217542 non-null object
3 score 217542 non-null object
4 cvss_version 217542 non-null object
5 cvss_vector 217542 non-null object
6 data_source 217542 non-null object
7 published_date 217542 non-null object
8 last_modified 217542 non-null object
cve_numbers_all has list of cve numbers, I want to explode cve_numbers_all and join these two tables on cve_number. However I dont need all the resulting records. I want to keep the records only if the symbol string in symbol_data table is inside anywhere in the description string of cve_data table.
when I explode cve_numbers_all kernel crashes probably the data becomes enormous.
I tried filtering row by row:
for index, row in tqdm(symbol_data.iterrows(), total=symbol_data.shape[0], desc="Expanding Symbols over CVEs"):
relevant_cves = [cve_num for cve_num in row['cve_numbers_all'] if cve_num in cve_data.index]
for cve_num in relevant_cves:
if row['symbol'] in cve_data.at[cve_num, 'description']:
filtered_rows.append(row)
filtered_df = pd.DataFrame(filtered_rows)
however it runs extremely slow.
I am using python. And my data is originaly in postgresql db. I pull it into dataframe to process because Im noob and python is easier. I am open to both python and postgresql solutions, please help