Replacing specific values within a dataframe column

Question

I am running the following code in jupyter notebook which checks strings of text within nametest_df['text'] and returns Persons names. I managed to get this working and would like to push these names to the respective fields within the nametest_df['name'] where currently all values are NaN.

I tried the Series.replace() method however all entries within the 'name' column are all showing the same name.

Any clue how I can do this efficiently?

for word in nametest_df['text']:

    for sent in nltk.sent_tokenize(word):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)

        for tag in tags:
            if tag[1]=='PERSON':
                name = tag[0]
                print(name)

    nametest_df.name = nametest_df.name.replace({"NaN": name})

Sample nametest_df

      **text**                    **name**
0   His name is John                NaN
1   I went to the beach             NaN
2   My friend is called Fred        NaN

Expected output

      **text**                    **name**
0   His name is John                John                
1   I went to the beach             NaN
2   My friend is called Fred        Fred

post a sample df and expected df

Pyd
– Pyd

2018-11-02 09:30:36 +00:00
Commented Nov 2, 2018 at 9:30 — Pyd
– Pyd, Commented Nov 2, 2018 at 9:30

jpp · Accepted Answer · 2018-11-02 13:42:18Z

1

Don't try and fill series values one by one. This is inefficient prone to error. A better idea is to create a list of names and assign directly.

L = []
for word in nametest_df['text']:
    for sent in nltk.sent_tokenize(word):
        tokens = nltk.tokenize.word_tokenize(sent)
        tags = st.tag(tokens)
        for tag in tags:
            if tag[1]=='PERSON':
                L.append(tag[0])

nametest_df.loc[nametest_df['name'].isnull(), 'name'] = L

edited Nov 2, 2018 at 13:42

answered Nov 2, 2018 at 13:21

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mtra Over a year ago

This is giving the following error: ValueError: Length of values does not match length of index I suspect this is happening due to some text fields not containing a name. How can I keep the NaN values for fields where the algorithm doesn't pick up a name?

jpp Over a year ago

@MarkTrapani, See update, you can use loc to mask for null values. You have to ensure that the number of values produced by your logic equals the number of null values. Otherwise, you have to fix your for loop.

Mtra Over a year ago

Which for loop should be changed to push a NaN value to the name column where no names are found?

jpp Over a year ago

@MarkTrapani, Not sure, I have no idea about the ntlk library, you should be able to work this out with your data by using print.

Collectives™ on Stack Overflow

Replacing specific values within a dataframe column

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related