How to string join one column with another columns - pandas

Question

I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:

>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
   a      b
0  a  hello
1  b   good
2  c  great
3  d   nice

I would like the a column to join the values in the b column, so my desired output is:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

How would I go about that?

Hope you can see the correlation with this, here is one example with the first row that you can do in python:

>>> 'a'.join('hello')
'haealalao'
>>>

Just like in the desired output.

I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.

P.S. I have a self-answer below.

U13-Forward · Accepted Answer · 2020-12-20 01:55:01Z

3

TL;DR

The below code is the fastest answer I could figure out from this question:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]

The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.

Long answer:

Going to show my solutions:

Solution 1:

To use a list comprehension and a generator:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)

Solution 2:

Group by the index, and apply and str.join the two columns' value:

df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)

Solution 3:

Use a list comprehension that iterates through both columns and str.joins:

df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)

These codes all output:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

Timing:

Now it's time to move on to timing with the timeit module, here is the code we use to time:

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]
    
def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
    
def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))

Output:

Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698

So the first solution is the quickest, using a generator.

answered Dec 20, 2020 at 1:55

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

sammywemmy Over a year ago

as an aside, series have an iter

U13-Forward Over a year ago

@sammywemmy Oh yeah thats true, you can post that as an answer if you want

sammywemmy Over a year ago

Not really. Learning from your code. You could also try it with a larger data and see if the speeds change.

U13-Forward Over a year ago

@sammywemmy Yeah.

Akash Ranjan Over a year ago

i think your function arent getting the same df to process

|

Akash Ranjan · Accepted Answer · 2020-12-20 02:31:39Z

I tried achieving the output using df.apply

>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0    haealalao
1      gbobobd
2    gcrcecact
3      ndidcde
dtype: object

Timing it for performance comparison,

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})

def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]

def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))

def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

def u11_4():
    df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))

Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.

Mayank Porwal · Accepted Answer · 2020-12-20 06:58:55Z

2

Here's another solution using zip and list comprehension. Should be better than df.apply:

In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]

In [1578]: df
Out[1578]: 
   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

answered Dec 20, 2020 at 6:58

Mayank Porwal

34.2k9 gold badges45 silver badges65 bronze badges

Collectives™ on Stack Overflow

How to string join one column with another columns - pandas

3 Answers 3

TL;DR

Long answer:

Timing:

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

TL;DR

Long answer:

Timing:

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related