0

I am running the following code in jupyter notebook which checks strings of text within nametest_df['text'] and returns Persons names. I managed to get this working and would like to push these names to the respective fields within the nametest_df['name'] where currently all values are NaN.

I tried the Series.replace() method however all entries within the 'name' column are all showing the same name.

Any clue how I can do this efficiently?

for word in nametest_df['text']:

    for sent in nltk.sent_tokenize(word):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)

        for tag in tags:
            if tag[1]=='PERSON':
                name = tag[0]
                print(name)

    nametest_df.name = nametest_df.name.replace({"NaN": name})

Sample nametest_df

      **text**                    **name**
0   His name is John                NaN
1   I went to the beach             NaN
2   My friend is called Fred        NaN

Expected output

      **text**                    **name**
0   His name is John                John                
1   I went to the beach             NaN
2   My friend is called Fred        Fred      
1
  • 1
    post a sample df and expected df Commented Nov 2, 2018 at 9:30

1 Answer 1

1

Don't try and fill series values one by one. This is inefficient prone to error. A better idea is to create a list of names and assign directly.

L = []
for word in nametest_df['text']:
    for sent in nltk.sent_tokenize(word):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)
        for tag in tags:
            if tag[1]=='PERSON':
                L.append(tag[0])

nametest_df.loc[nametest_df['name'].isnull(), 'name'] = L
Sign up to request clarification or add additional context in comments.

4 Comments

This is giving the following error: ValueError: Length of values does not match length of index I suspect this is happening due to some text fields not containing a name. How can I keep the NaN values for fields where the algorithm doesn't pick up a name?
@MarkTrapani, See update, you can use loc to mask for null values. You have to ensure that the number of values produced by your logic equals the number of null values. Otherwise, you have to fix your for loop.
Which for loop should be changed to push a NaN value to the name column where no names are found?
@MarkTrapani, Not sure, I have no idea about the ntlk library, you should be able to work this out with your data by using print.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.