3

I am trying to write a function with if-else logic which will modify two columns in my data frame. But its not working. Following is my function

def get_comment_status(df):
    if df['address'] == 'NY':
        df['comment'] = 'call tomorrow'
        df['selection_status'] = 'interview scheduled'
        return df['comment'] 
        return df['selection_status']
    else:
        df['comment'] = 'Dont call'
        df['selection_status'] = 'application rejected'
        return df['comment']
        return df['selection_status']

and then execute the function as :

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis = 1)

But I am getting error. What am I doing wrong ? My guess is probably the df.apply() syntax is wrong

Error Message:

TypeError: 'str' object cannot be interpreted as an integer KeyError:('address', 'occurred at index 0')

sample dataframe:

df = pd.DataFrame({'address': ['NY', 'CA', 'NJ', 'NY', 'WS', 'OR', 'OR'],
               'name1': ['john', 'mayer', 'dylan', 'bob', 'mary', 'jake', 'rob'],
               'name2': ['mayer', 'dylan', 'mayer', 'bob', 'bob', 'tim', 'ben'],
               'comment': ['n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a'],
               'score': [90, 8, 88, 72, 34, 95, 50],
               'selection_status': ['inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress']})

I have also thought of using lambda function but it doesnt work as I was trying to assign value to 'comment' and 'selection_status' column using '='

Note: I have checked this question which is similar by title but doesn't solve my problem.

5
  • It's useful if you list the error as well Commented Jun 12, 2018 at 22:37
  • 1
    Look at your return statements: only the first one in each branch gets executed. You'll need to return something else, essentially both values at the same time. Commented Jun 12, 2018 at 22:38
  • Can you post your desired output? Commented Jun 12, 2018 at 22:38
  • 1
    Note that .apply doesn't work on a dataframe, but on a row. For your code, it doesn't matter, but the naming of your variable df in your function implies you're thinking incorrectly about apply, which will cause confusion later on. Commented Jun 12, 2018 at 22:39
  • @9769953 - that was very useful note. gracias. Commented Jun 12, 2018 at 22:54

2 Answers 2

2

What you try to do is not very consistent with Pandas philosophy. Also, apply is a very inefficient function. You probably should use Numpy where:

import numpy as np
df['comment'] = np.where(df['address'] == 'NY',
                  'call tomorrow', 'Dont call')
df['selection_status'] = np.where(df['address'] == 'NY',
                           'interview scheduled', 'application rejected')

Or boolean indexing:

df.loc[df['address'] == 'NY', ['comment', 'selection_status']] \
         = 'call tomorrow', 'interview scheduled'
df.loc[df['address'] != 'NY', ['comment', 'selection_status']] \
         = 'Dont call', 'application rejected'
Sign up to request clarification or add additional context in comments.

2 Comments

This is what I understand so far - If I need to return more than one column, writing a function is not useful . I have used df.loc method before - but here I wanted to return both column at the same time instead of separately dealing with them using np.where or df.loc . But I guess that wasnt right approach.
@singularity2047, Pandas is based on series arrays (columns). Updating each series individually in a vectorised fashion will usually be faster than updating them together via pd.DataFrame.apply (which is just a very inefficient loop).
2

You should use numpy.where as per DyZ's solution. A principal benefit of Pandas is vectorised computations. However, below I'll show you how you would use pd.DataFrame.apply. Points to note:

  • Row data feeds your function one row at a time, not the entire dataframe in one go. Therefore, you should name arguments accordingly.
  • Two return statements in a function will not work. A function stops when it reaches return.
  • Instead, you need to return a list of results, then use pd.Series.values.tolist to unpack.

Here's a working example.

def get_comment_status(row):
    if row['address'] == 'NY':
        return ['call tomorrow', 'interview scheduled']
    else:
        return ['Dont call', 'application rejected']

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis=1).values.tolist()

print(df)

  address  name1  name2        comment  score      selection_status
0      NY   john  mayer  call tomorrow     90   interview scheduled
1      CA  mayer  dylan      Dont call      8  application rejected
2      NJ  dylan  mayer      Dont call     88  application rejected
3      NY    bob    bob  call tomorrow     72   interview scheduled
4      WS   mary    bob      Dont call     34  application rejected
5      OR   jake    tim      Dont call     95  application rejected
6      OR    rob    ben      Dont call     50  application rejected

1 Comment

This is immensely helpful for me. Although I will lean towards np.where() from now on, I'd still like to learn different methods of doing same thing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.