0

Currently I have 2 tables, lets say symbol_data and cve_data. symbol_data is structured as below:

 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   package_name     543080 non-null  object
 1   version          543080 non-null  object
 2   visiblename      543080 non-null  object
 3   cve_numbers      486737 non-null  object
 4   cve_numbers_all  543080 non-null  object
 5   family           543080 non-null  object
 6   lib_so           543080 non-null  object
 7   symbol           543080 non-null  object
 8   symbol_type      543080 non-null  object

cve_data below:

Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   cve_number      217542 non-null  object
 1   severity        217542 non-null  object
 2   description     217542 non-null  object
 3   score           217542 non-null  object
 4   cvss_version    217542 non-null  object
 5   cvss_vector     217542 non-null  object
 6   data_source     217542 non-null  object
 7   published_date  217542 non-null  object
 8   last_modified   217542 non-null  object

cve_numbers_all has list of cve numbers, I want to explode cve_numbers_all and join these two tables on cve_number. However I dont need all the resulting records. I want to keep the records only if the symbol string in symbol_data table is inside anywhere in the description string of cve_data table.

when I explode cve_numbers_all kernel crashes probably the data becomes enormous.

I tried filtering row by row:

for index, row in tqdm(symbol_data.iterrows(), total=symbol_data.shape[0], desc="Expanding Symbols over CVEs"):
        relevant_cves = [cve_num for cve_num in row['cve_numbers_all'] if cve_num in cve_data.index]
        for cve_num in relevant_cves:
            if row['symbol'] in cve_data.at[cve_num, 'description']:
                filtered_rows.append(row)
    filtered_df = pd.DataFrame(filtered_rows)    

however it runs extremely slow.

I am using python. And my data is originaly in postgresql db. I pull it into dataframe to process because Im noob and python is easier. I am open to both python and postgresql solutions, please help

1 Answer 1

0

This would be fairly simple and effective in native SQL.

select * -- or whatever expressions you actually need
from symbol_data as sd
cross join lateral unnest(sd.cve_numbers_all) as cven
inner join cve_data as cd on cd.cve_number = cven
where position(sd.symbol in cd.description) <> 0;    

I assume that cve_numbers_all data type is array.

Note: It would be better to use tables' DDL in your question.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.