0

There are 13000 values approximately for a given column. The below function works in a way that the input is a list of strings and does the NER tagging for each word in the list. On an average there could be 300 words in a list across 13000 values. It takes around more than 1 hour for the function to process the current column. Hence, I would like to have a solution which processed it faster. I am running on azure ml notebook with a standard CPU compute.

Function :

def perform_ner_batch(texts):
    if not texts:  # Check if texts is empty
        return []
    # Perform NER on the provided texts
    list_entity = []
    for i in texts:
      ner_result = ner_pipeline(i)
      if ner_result == []:
        list_entity.append('O')
      for results in ner_result:
        list_entity.append(results['entity_group'])
    return list_entity

Calling the function:

df['entities'] = df['Tokenized_Abstract_list'].apply(lambda x: perform_ner_batch(x))

5
  • 3
    read notice : minimal reproducible example Commented Feb 14, 2024 at 10:39
  • Maybe you could try swifter Commented Feb 14, 2024 at 11:53
  • Have u tried with GPU? Commented Feb 14, 2024 at 11:55
  • If I get it right, pandas is irrelevant here, bottleneck is on ner_pipeline(i). Commented Feb 14, 2024 at 12:33
  • On GPU it works faster. However, it still takes hours for rows more than 8k or so. Data with 1k takes just 14 minutes. It is a lot better, I think given that ner_pipeline function logic makes it a bit longer. Commented Feb 14, 2024 at 16:52

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.