There are 13000 values approximately for a given column. The below function works in a way that the input is a list of strings and does the NER tagging for each word in the list. On an average there could be 300 words in a list across 13000 values. It takes around more than 1 hour for the function to process the current column. Hence, I would like to have a solution which processed it faster. I am running on azure ml notebook with a standard CPU compute.
Function :
def perform_ner_batch(texts):
if not texts: # Check if texts is empty
return []
# Perform NER on the provided texts
list_entity = []
for i in texts:
ner_result = ner_pipeline(i)
if ner_result == []:
list_entity.append('O')
for results in ner_result:
list_entity.append(results['entity_group'])
return list_entity
Calling the function:
df['entities'] = df['Tokenized_Abstract_list'].apply(lambda x: perform_ner_batch(x))
pandasis irrelevant here, bottleneck is onner_pipeline(i).