pandas - move multiple columns with the same name and different missing data into single column then delete duplicate columns

Question

I have a dataframe that looks like this:

Col1  | Col2  | Col1  | Col3  | Col1  | Col4
  a   |   d   |       |   h   |   a   |   p
  b   |   e   |   b   |   i   |   b   |   l
      |   l   |   a   |   l   |       |   a
  l   |   r   |   l   |   a   |   l   |   x
  a   |   i   |   a   |   w   |       |   i
      |   c   |       |   i   |   r   |   c
  d   |   o   |   d   |   e   |   d   |   o

Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.

How can I create a column with the complete information and then delete the previous duplicate columns?

Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.

The desired output looks like this:

Col2  | Col3  | Col4  | Col5
  d   |   h   |   p   |   a
  e   |   i   |   l   |   b
  l   |   l   |   a   |   a
  r   |   a   |   x   |   l
  i   |   w   |   i   |   a
  c   |   i   |   c   |   r
  o   |   e   |   o   |   d

I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.

You need to give more info. How do you come up with values of Col1 and Col5 of the desired output? why is Col1 of the output the same as Col4 of the sample data? — Andy L.
– Andy L., Commented Dec 18, 2019 at 1:42
Are you certain that when the column is duplicated the values in each row are always the same when not missing? It's groupby + first in that case. — ALollz
– ALollz, Commented Dec 18, 2019 at 1:45
I can't get a dataframe to have the same name columns, it renames the to Col1.1 and Col1.2 , etc — oppressionslayer
– oppressionslayer, Commented Dec 18, 2019 at 1:49

Santosh M. · Accepted Answer · 2019-12-18 01:51:21Z

2

First replace empty values in your columns with nan as below:

import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)

Then, you could use groupby and then first()

df.groupby(level = 0, axis = 1).first()

answered Dec 18, 2019 at 1:51

Santosh M.

2,4541 gold badge22 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

adin Over a year ago

This seemed to work but I want to note it shifted reordered the columns. I am accepting this answer because in the original question I did not specify that column order must remain the same.

moys · Accepted Answer · 2019-12-18 02:17:33Z

0

May be something like this is what you are looking for.

col_list = list(set(df.columns))
dicts={}
for col in col_list:
    val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
    dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})

output

   Col3     Col4    Col1    Col2
0   h          i    d       d
1   w          l    b       r
2   i          c    r       i
3   l          x    l       l
4   a          p    a       o
5   e          o    NaN     c
6   NaN        a    NaN     e

answered Dec 18, 2019 at 2:17

moys

8,1173 gold badges19 silver badges51 bronze badges

Collectives™ on Stack Overflow

pandas - move multiple columns with the same name and different missing data into single column then delete duplicate columns

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related