3

I want to convert a number to binary and store in multiple columns in Pandas using Python. Here is an example.

df = pd.DataFrame([['a', 1], ['b', 2], ['c', 0]], columns=["Col_A", "Col_B"])

for i in range(0,len(df)):
    df.loc[i,'Col_C'],df.loc[i,'Col_D'] = list( (bin(df.loc[i,'Col_B']).zfill(2) ) )

I am trying to convert a binary and store it in a multiple columns in dataframe. After converting number to Binary, output has to contains 2 digits. It is working fine.

Question: If my dataset contains thousands of records, I can see performance difference. If I want to improve performance of above code how do we do it? I tried using following single line code, which didn't work for me.

df[['Col_C','Col_D']] = list( (bin(df['Col_B']).zfill(2) ) )

2 Answers 2

4

If performance is important, use numpy with this solution:

d = df['Col_B'].values
m = 2
df[['Col_C','Col_D']]  = pd.DataFrame((((d[:,None] & (1 << np.arange(m)))) > 0).astype(int))
print (df)
  Col_A  Col_B  Col_C  Col_D
0     a      1      1      0
1     b      2      0      1
2     c      0      0      0

Performance (about 1000 times faster):

df = pd.DataFrame([['a', 1], ['b', 2], ['c', 0]], columns=["Col_A", "Col_B"])


df = pd.concat([df] * 1000, ignore_index=True)

In [162]: %%timeit
     ...: df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))
     ...: 
609 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [163]: %%timeit
     ...: d = df['Col_B'].values
     ...: m = 2
     ...: df[['Col_C','Col_D']]  = pd.DataFrame((((d[:,None] & (1 << np.arange(m)))) > 0).astype(int))
     ...: 
618 µs ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

Comments

2

apply is the method you are looking for.

df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))

does the trick.

I benchmarked it on 3000 rows and it is faster than the for cycle method you mention (0.5 seconds vs 3 seconds). But generally the speed won't be much faster since it still needs to apply the function for each row separately.

from time import time
start = time()
for i in range(0,len(df)):
    df.loc[i,'Col_C'],df.loc[i,'Col_D'] = list( (bin(df.loc[i,'Col_B'])[2:].zfill(2) ) )
print(time() - start)
# 3.4339962005615234

start = time()
df[['Col_C','Col_D']] = df['Col_B'].apply(lambda x: pd.Series(list(bin(x)[2:].zfill(2))))
print(time() - start)
# 0.5619983673095703

Note: I am using python 3, so e.g. bin(1) returns '0b1' and thus I use bin(x)[2:] to get rid of the '0b' part.

1 Comment

@jezrael, your solution worked. This is really faster. I have processed 50K records, using your solution it took nearly 13s. Matej solution took only less than 1s. I need to process huge data. I want to go with performance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.