apply custom function to an existing column to output multiple columns

Question

Here is my starting df:

import numpy as np
import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])
df
    text
0   alpha
1   beta

Here is the end result I want:

    text    first           second          third
0   alpha   alpha-first     alpha-second    alpha-third
1   beta    beta-first      beta-second     beta-third

I have written the custom function parse(), no issue there:

def parse(text):
    return [text + ' first', text + ' second', text + ' third']

Now I try to apply parse() to the initial df, which is where errors arise:

1) If I try the following:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']) # Create empty columns    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: Must have equal len keys and value when setting with an ndarray

2) Slightly different version:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']).astype(object) # Create empty columns of "object" type    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: shape mismatch: value array of shape (2,) could not be broadcast 
to indexing result of shape (3,2)

Where am I going wrong?

EDIT:

I should clarify that parse() itself is a much more complicated function in the real-world problem I'm trying to solve. (it takes a paragraph, finds 3 specific types of strings in it, and outputs those strings as a list of length 3). In my code above, I made up a somewhat random simple definition of parse() as a substitute to avoid getting bogged down in details unrelated to the two errors I'm getting.

jpp · Accepted Answer · 2018-02-04 00:00:55Z

2

No need for apply:

import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])

for i in ['first', 'second', 'third']:
    df[i] = df.text + '-' + i

#     text       first       second       third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

In general the hierarchy of "process type" to choose for your calculations should be:

Vectorised calculations, such as above.
pd.Series.apply
pd.DataFrame.apply
pd.DataFrame.iterrows

edited Feb 4, 2018 at 0:00

answered Feb 3, 2018 at 23:43

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

gnotnek Over a year ago

Thanks, I learned something from your last statement. But I have questions. First, how is using a regular for-loop vectorized? I thought explicit loops were the very slowest (or tied for slowest) process type? Second, assume the function I am applying is arbitrary. All you know about it is it takes one string and outputs a list of three strings. (please see my EDIT in the original post, which was after your reply).

jpp Over a year ago

(1) Using a regular for loop is not vectorised. Using pandas in-built functionality, e.g. df.text + '-' + i is vectorised. (2) You are right, my 4th option, df.iterrows, is an explicit loop and it is slowest. (3) If your function is complex and not vectorisable, then pd.Series.apply or pd.DataFrame.apply is your best bet. Which one depends on how much data you need for your function (data from one column or all columns for each row).

gnotnek Over a year ago

Ok re: (2) and (3). Re: (1), the vectorized pandas functionality you used was still wrapped in an explicit for loop. Doesn't that count as a slow process then? Is it ok only because the loop itself iterates over only 3 columns while the code inside it implicitly iterates over (potentially) thousands or millions of rows? Just trying to reach full understanding. Thanks!

jpp Over a year ago

@gnotnek, i'm confused, where's the "vectorized pandas functionality you used was still wrapped in an explicit for loop" - no for loop in my code above!

gnotnek Over a year ago

for i in ['first', 'second', 'third']: is an explicit for loop, right?

|

cmaher · Accepted Answer · 2018-02-04 00:11:26Z

1

Here's a one-liner with pd.DataFrame.assign:

df.assign(**{x: df['text']+'-'+x for x in ['first', 'second', 'third']})

#     text        first        second        third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

edited Feb 4, 2018 at 0:11

answered Feb 3, 2018 at 23:53

cmaher

5,2641 gold badge24 silver badges34 bronze badges

Comments

MaxU - stand with Ukraine · Accepted Answer · 2018-02-04 00:21:43Z

1

This can be done in a several ways:

Option 1:

def f(s):
    return pd.DataFrame(np.repeat(s, 3).values.reshape(len(s), -1),
                        columns=['first','second','third']) \
             .apply(lambda c: c+'-'+c.name)


In [183]: df[['first','second','third']] = f(df.text)

In [184]: df
Out[184]:
    text        first        second        third
0  alpha  alpha-first  alpha-second  alpha-third
1   beta   beta-first   beta-second   beta-third

edited Feb 4, 2018 at 0:21

answered Feb 3, 2018 at 23:43

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

gnotnek Over a year ago

Thanks, but I'm looking more for a clarification of why I'm even getting these errors and a very minimal tweak of my code, not a wholesale change. I'm having some difficulty parsing your function since it's all one statement. Also please see my "EDIT" above re: the function itself being arbitrary, which I believe I made after your post.

thomas.mac · Accepted Answer · 2018-02-04 02:05:44Z

0

Check this:

lst = ['text','first','second','third']
df = pd.DataFrame([['alpha']*len(lst),['beta']*len(lst)],columns=lst)

final = df.apply(lambda x: x+'-'+x.name)
final.text = final.text.str.split('-')[0]

edited Feb 4, 2018 at 2:05

answered Feb 4, 2018 at 1:58

thomas.mac

1,2563 gold badges20 silver badges39 bronze badges

Collectives™ on Stack Overflow

apply custom function to an existing column to output multiple columns

4 Answers 4

7 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related