0

I have a pandas dataframe having HTML based text field from which I want to derive two fields; the count of tags in it and clean text without any tag. I am using BeautifulSoup to perform the functions. Say,

df_ads['content_elements_cnt'] = df_ads['content'].apply(lambda x: dict(Counter([element.name for element in BeautifulSoup(x).html if element.name != None])))
df_ads['content_refined'] = df_ads['content'].apply(lambda x : BeautifulSoup(x).text)

Is it possible if I can encapsulate the above two statements in one function, call it in apply function to generate two columns (I want to utilize BeautifulSoup instantiation and looping only for one). In other words, is there an efficient way of doing these two operations?

1
  • 1
    Can you provide a minimal reproducible example of the dataset? Commented Jan 23, 2022 at 22:09

1 Answer 1

1

You could use a helper function and return a Series:

def bs_extract(x):
    soup = BeautifulSoup(x)
    return pd.Series({'content_elements_cnt': dict(Counter([element.name for element in soup.html if element.name != None])),
                      'content_refined': soup.text})

df_ads[['content_elements_cnt', 'content_refined']] = df_ads['content'].apply(bs_extract)

NB. the code is untested (no input provided)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.