2

I have a dataframe in Pandas with duplicate values in Col1:

Col1
a
a
b
a
a
b

What I want to do is to split this df into different df-s with unique Col1 values in each.

DF1:

Col1
a
b

DF2:

Col1
a
b

DF3:

Col1
a

DF4:

Col1
a

Any suggestions ?

1 Answer 1

1

I don't think you can achieve this in a vectorial way.

One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:

def cum_uniq(s):
    i = 0
    seen = set()
    out = []
    for x in s:
        if x in seen:
            i+=1
            seen = set()
        out.append(i)
        seen.add(x)
    return pd.Series(out, index=s.index)

out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]

output:

[  Col1
 0    a,
   Col1
 1    a
 2    b,
   Col1
 3    a,
   Col1
 4    a
 5    b]

intermediate:

cum_uniq(df['Col1'])

0    0
1    1
2    1
3    2
4    3
5    3
dtype: int64
if order doesn't matter

Let's ad a Col2 to the example:

  Col1  Col2
0    a     0
1    a     1
2    b     2
3    a     3
4    a     4
5    b     5

the previous code gives:

[  Col1  Col2
 0    a     0,
   Col1  Col2
 1    a     1
 2    b     2,
   Col1  Col2
 3    a     3,
   Col1  Col2
 4    a     4
 5    b     5]

If order does not matter, you can vectorize it:

out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]

output:

[  Col1  Col2
0    a     0
2    b     2,
   Col1  Col2
1    a     1
5    b     5,
   Col1  Col2
3    a     3,
   Col1  Col2
4    a     4]

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.