2

This is a follow up on this question: How to create new column based on substrings in other column in a pandas dataframe?

The dataframe has the following structure

df = pd.DataFrame({
    'Other input': ['Text A', 'Text B', 'Text C', 'Text D', 'Text E'],
    'Substance': ['(NPK) 20/10/6', NaN, '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
    'value': [0.2, NaN, 0.6, 0.8, .9]
})

    Other Input  substance               value
0   Text A       (NPK) 20/10/6           0.2
1   Text B       NaN                     NaN
2   Text C       46%N / O%P2O5 (Urea)    0.6
3   Text D       46%N / O%P2O5 (Urea)    0.8
4   Text E       (NPK) DAP Diammonphosphat; 18/46/0          0.9

It was created by merging two df's with a left join and it turns out that I have rows without substance and value. I need to replace the substance with a Short Name and before having missing values in the dataset, the following code worked:

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

How can I make this work with NaN (or 0 if that is easier)? Is there something equivalent to na_action=None that apparently works with applymap?

2
  • What do you want to do with the missing values? Ignore them? Commented Dec 11, 2021 at 18:15
  • 1
    yes, just keep the nan Commented Dec 11, 2021 at 18:17

2 Answers 2

4

If you want to skip rows containing NaN, just add a call to dropna() before you apply(). That will create a new temporary copy of the dataframe with all rows containing NaN in any columns removed.

test['Short Name'] = test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

Output:

>>> test
  Other input                           Substance  value     Te
0      Text A                       (NPK) 20/10/6    0.2  (NPK)
1      Text B                                 NaN    NaN    NaN
2      Text C                46%N / O%P2O5 (Urea)    0.6   Urea
3      Text D                46%N / O%P2O5 (Urea)    0.8   Urea
4      Text E  (NPK) DAP Diammonphosphat; 18/46/0    0.9    DAP

This will work, because assigning Series objects to DataFrame objects uses their indexes, and if you inspect the return value of the apply() call after adding dropna():

>>> test.dropna()['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
0    (NPK)
2     Urea
3     Urea
4      DAP
Name: Substance, dtype: object

Notice how it skips from 0 to 2. That's because the row at index 1 was removed, but the indexes weren't updated (which we want in this case).

Sign up to request clarification or add additional context in comments.

Comments

0

You can do:

df = df.assign(
    short_name = df.Substance.apply(
        lambda x:
            do_this_if_x_is_not_NaN(x) if x is not np.nan
            else do_this_if_x_is_NaN(x)))

with functions:

def do_this_if_x_is_not_NaN(x):
    return 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)'

def do_this_if_x_is_NaN(x):
    return np.nan # keeping the NaN, or whatever you want to return if x is NaN

df = df.assign(col_name = ...) is just an other way of expressing df['col_name'] = ....

Your df will become:

  Other input                           Substance  value short_name
0      Text A                       (NPK) 20/10/6    0.2      (NPK)
1      Text B                                 NaN    NaN        NaN
2      Text C                46%N / O%P2O5 (Urea)    0.6       Urea
3      Text D                46%N / O%P2O5 (Urea)    0.8       Urea
4      Text E  (NPK) DAP Diammonphosphat; 18/46/0    0.9        DAP

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.