1

Here is my starting df:

import numpy as np
import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])
df
    text
0   alpha
1   beta

Here is the end result I want:

    text    first           second          third
0   alpha   alpha-first     alpha-second    alpha-third
1   beta    beta-first      beta-second     beta-third

I have written the custom function parse(), no issue there:

def parse(text):
    return [text + ' first', text + ' second', text + ' third']

Now I try to apply parse() to the initial df, which is where errors arise:

1) If I try the following:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']) # Create empty columns    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: Must have equal len keys and value when setting with an ndarray

2) Slightly different version:

df = df.reindex(columns = list(df.columns) + ['first', 'second', 'third']).astype(object) # Create empty columns of "object" type    
df[['first', 'second', 'third']] = df.text.apply(parse)

I get:

ValueError: shape mismatch: value array of shape (2,) could not be broadcast 
to indexing result of shape (3,2)

Where am I going wrong?

EDIT:

I should clarify that parse() itself is a much more complicated function in the real-world problem I'm trying to solve. (it takes a paragraph, finds 3 specific types of strings in it, and outputs those strings as a list of length 3). In my code above, I made up a somewhat random simple definition of parse() as a substitute to avoid getting bogged down in details unrelated to the two errors I'm getting.

0

4 Answers 4

2

No need for apply:

import pandas as pd

df = pd.DataFrame(['alpha', 'beta'], columns = ['text'])

for i in ['first', 'second', 'third']:
    df[i] = df.text + '-' + i

#     text       first       second       third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

In general the hierarchy of "process type" to choose for your calculations should be:

  1. Vectorised calculations, such as above.
  2. pd.Series.apply
  3. pd.DataFrame.apply
  4. pd.DataFrame.iterrows
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, I learned something from your last statement. But I have questions. First, how is using a regular for-loop vectorized? I thought explicit loops were the very slowest (or tied for slowest) process type? Second, assume the function I am applying is arbitrary. All you know about it is it takes one string and outputs a list of three strings. (please see my EDIT in the original post, which was after your reply).
(1) Using a regular for loop is not vectorised. Using pandas in-built functionality, e.g. df.text + '-' + i is vectorised. (2) You are right, my 4th option, df.iterrows, is an explicit loop and it is slowest. (3) If your function is complex and not vectorisable, then pd.Series.apply or pd.DataFrame.apply is your best bet. Which one depends on how much data you need for your function (data from one column or all columns for each row).
Ok re: (2) and (3). Re: (1), the vectorized pandas functionality you used was still wrapped in an explicit for loop. Doesn't that count as a slow process then? Is it ok only because the loop itself iterates over only 3 columns while the code inside it implicitly iterates over (potentially) thousands or millions of rows? Just trying to reach full understanding. Thanks!
@gnotnek, i'm confused, where's the "vectorized pandas functionality you used was still wrapped in an explicit for loop" - no for loop in my code above!
for i in ['first', 'second', 'third']: is an explicit for loop, right?
|
1

Here's a one-liner with pd.DataFrame.assign:

df.assign(**{x: df['text']+'-'+x for x in ['first', 'second', 'third']})

#     text        first        second        third
# 0  alpha  alpha-first  alpha-second  alpha-third
# 1   beta   beta-first   beta-second   beta-third

Comments

1

This can be done in a several ways:

Option 1:

def f(s):
    return pd.DataFrame(np.repeat(s, 3).values.reshape(len(s), -1),
                        columns=['first','second','third']) \
             .apply(lambda c: c+'-'+c.name)


In [183]: df[['first','second','third']] = f(df.text)

In [184]: df
Out[184]:
    text        first        second        third
0  alpha  alpha-first  alpha-second  alpha-third
1   beta   beta-first   beta-second   beta-third

1 Comment

Thanks, but I'm looking more for a clarification of why I'm even getting these errors and a very minimal tweak of my code, not a wholesale change. I'm having some difficulty parsing your function since it's all one statement. Also please see my "EDIT" above re: the function itself being arbitrary, which I believe I made after your post.
0

Check this:

lst = ['text','first','second','third']
df = pd.DataFrame([['alpha']*len(lst),['beta']*len(lst)],columns=lst)

final = df.apply(lambda x: x+'-'+x.name)
final.text = final.text.str.split('-')[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.