0

I have a dataframe that looks like this:

Col1  | Col2  | Col1  | Col3  | Col1  | Col4
  a   |   d   |       |   h   |   a   |   p
  b   |   e   |   b   |   i   |   b   |   l
      |   l   |   a   |   l   |       |   a
  l   |   r   |   l   |   a   |   l   |   x
  a   |   i   |   a   |   w   |       |   i
      |   c   |       |   i   |   r   |   c
  d   |   o   |   d   |   e   |   d   |   o

Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.

How can I create a column with the complete information and then delete the previous duplicate columns?

Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.

The desired output looks like this:

Col2  | Col3  | Col4  | Col5
  d   |   h   |   p   |   a
  e   |   i   |   l   |   b
  l   |   l   |   a   |   a
  r   |   a   |   x   |   l
  i   |   w   |   i   |   a
  c   |   i   |   c   |   r
  o   |   e   |   o   |   d

I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.

7
  • They all have the same name? Commented Dec 18, 2019 at 1:29
  • 2
    You need to give more info. How do you come up with values of Col1 and Col5 of the desired output? why is Col1 of the output the same as Col4 of the sample data? Commented Dec 18, 2019 at 1:42
  • 2
    Are you certain that when the column is duplicated the values in each row are always the same when not missing? It's groupby + first in that case. Commented Dec 18, 2019 at 1:45
  • I can't get a dataframe to have the same name columns, it renames the to Col1.1 and Col1.2 , etc Commented Dec 18, 2019 at 1:49
  • 1
    @ALollz that worked, thx! Commented Dec 18, 2019 at 2:02

2 Answers 2

2

First replace empty values in your columns with nan as below:

import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)

Then, you could use groupby and then first()

df.groupby(level = 0, axis = 1).first() 
Sign up to request clarification or add additional context in comments.

1 Comment

This seemed to work but I want to note it shifted reordered the columns. I am accepting this answer because in the original question I did not specify that column order must remain the same.
0

May be something like this is what you are looking for.

col_list = list(set(df.columns))
dicts={}
for col in col_list:
    val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
    dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})

output

   Col3     Col4    Col1    Col2
0   h          i    d       d
1   w          l    b       r
2   i          c    r       i
3   l          x    l       l
4   a          p    a       o
5   e          o    NaN     c
6   NaN        a    NaN     e

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.