1

I just came across this question, how do I do str.join by one column to join the other, here is my DataFrame:

>>> df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
   a      b
0  a  hello
1  b   good
2  c  great
3  d   nice

I would like the a column to join the values in the b column, so my desired output is:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

How would I go about that?

Hope you can see the correlation with this, here is one example with the first row that you can do in python:

>>> 'a'.join('hello')
'haealalao'
>>> 

Just like in the desired output.

I think it might be useful to know how two columns can interact. join might not be the best example but there are other functions that you could do. It could maybe be useful if you use split to split on the other columns, or replace the characters in the other columns with something else.

P.S. I have a self-answer below.

3 Answers 3

3

TL;DR

The below code is the fastest answer I could figure out from this question:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]

The above code first does a generator of the a column, then you can use next for getting the next value every time, then in the list comprehension it joins the two strings.

Long answer:

Going to show my solutions:

Solution 1:

To use a list comprehension and a generator:

it = iter(df['a'])
df['b'] = [next(it).join(i) for i in df['b']]
print(df)

Solution 2:

Group by the index, and apply and str.join the two columns' value:

df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
print(df)

Solution 3:

Use a list comprehension that iterates through both columns and str.joins:

df['b'] = [x.join(y) for x, y in df.values.tolist()]
print(df)

These codes all output:

   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

Timing:

Now it's time to move on to timing with the timeit module, here is the code we use to time:

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]
    
def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))
    
def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

print('Solution 1:', timeit(u11_1, number=5))
print('Solution 2:', timeit(u11_2, number=5))
print('Solution 3:', timeit(u11_3, number=5))

Output:

Solution 1: 0.007374127670871819
Solution 2: 0.05485127553865618
Solution 3: 0.05787154087587698

So the first solution is the quickest, using a generator.

Sign up to request clarification or add additional context in comments.

6 Comments

as an aside, series have an iter
@sammywemmy Oh yeah thats true, you can post that as an answer if you want
Not really. Learning from your code. You could also try it with a larger data and see if the speeds change.
@sammywemmy Yeah.
i think your function arent getting the same df to process
|
2

I tried achieving the output using df.apply

>>> df.apply(lambda x: x['a'].join(x['b']), axis=1)
0    haealalao
1      gbobobd
2    gcrcecact
3      ndidcde
dtype: object

Timing it for performance comparison,

from timeit import timeit
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})

def u11_1():
    it = iter(df['a'])
    df['b'] = [next(it).join(i) for i in df['b']]

def u11_2():
    df['b'] = df.groupby(df.index).apply(lambda x: x['a'].item().join(x['b'].item()))

def u11_3():
    df['b'] = [x.join(y) for x, y in df.values.tolist()]

def u11_4():
    df['c'] = df.apply(lambda x: x['a'].join(x['b']), axis=1)

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 1:', timeit(u11_1, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 2:', timeit(u11_2, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 3:', timeit(u11_3, number=5))
df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['hello', 'good', 'great', 'nice']})
print('Solution 4:', timeit(u11_4, number=5))

Note that I am reinitializing df before every line so that all the functions process the same dataframe. It can also be done by passing the df as a parameter to the function.

Comments

2

Here's another solution using zip and list comprehension. Should be better than df.apply:

In [1576]: df.b = [i.join(j) for i,j in zip(df.a, df.b)]

In [1578]: df
Out[1578]: 
   a          b
0  a  haealalao
1  b    gbobobd
2  c  gcrcecact
3  d    ndidcde

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.