1

I have a Dataframe that looks like this:

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 4, old
101, electronic, 10, 10, new
102, hardware, 2, 1, new
103, other, 22, 13, new

I am trying to update my Dataframe such that the updated Dataframe has the row with type=old if all other columns are same else using value from type=new

Final output:

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 1, new
103, other, 22, 13, new
5
  • did you look at the Dataframe.drop_duplicates method? It takes a subset parameter Commented Aug 18, 2020 at 14:08
  • @PaulH, I did check out drop_duplicates method. What I am having trouble is apply drop_duplicates based on values in a column Commented Aug 18, 2020 at 14:10
  • did you read about the subset parameter? Commented Aug 18, 2020 at 14:10
  • also, check your example output. you have a "new" electronic row but you included the "old" row, which seems to contradict your problem statement Commented Aug 18, 2020 at 14:15
  • @PaulH, may be I did not put this correctly in the first place. If all rows (except type) are same, then I would prefer to take the first occurrence. If the values in any of the columns have a mismatch then I would like to take the latest row. On the other question, I did try df.drop_duplicates(subset=['col_1','col_2']) would perform the duplicate elimination but I am trying to have a check on type column before applying the drop_duplicates method Commented Aug 18, 2020 at 14:18

3 Answers 3

1

From what I understand, you try with 2 boolean masks one checking if there is no duplicated values and type is new and another keeping the type='old' when there is duplicated,

u = df.drop("type",1)
c = ((u.duplicated(keep=False) & df['type'].eq('old')) | 
     (df['type'].eq('new') & ~u.duplicated(keep=False)) )
out = df[c].copy()

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
3      102    hardware      2      1  new
4      103       other     22     13  new
Sign up to request clarification or add additional context in comments.

Comments

0

As I see, you want the result to contain a single row from each source group of rows with each prod_id (more precisely, the last row).

The content of type column depends on whether all values in all col_... columns, actually in columns from 2 to the last but one, are the same.

To get this result, define the following function:

def grpRes(grp):
    res = grp.iloc[-1,:]
    res.type = 'old' if np.unique(grp.values[:, 2:-1]).size == 1 else 'new'
    return res

Then apply this function to each group:

result = df.groupby('prod_id').apply(grpRes).reset_index(drop=True)

The result is:

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
1      102    hardware      2      1  new
2      103       other     22     13  new

Comments

0

There is a simple solution, if and only if type = 'old' comes first in all the duplicated rows

columns = list(df.columns)
columns.remove('type')
df = df.drop_duplicates(subset=columns, keep='first')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.