Removing duplicate rows in dataframe in python

Question

I have a dataframe which has 27949 rows & 7 columns & the first few rows look like below https://i.sstatic.net/1Pipf.png

Task: In the dataframe I have a 'title' column which has many duplicate titles which I want to remove (duplicate title:almost all the title is same except for 1 or 2 words). Pseudo code: I want to check the 1st row with all other rows & if any of these is a duplicate I want to remove it. Then I want to check the 2nd row with all other rows & if any of these is a duplicate I want to remove it - similarly with all rows i.e. i = 1st line to last line j = i+1 to last line. My code:

for i in range(0,27950):
    for j in range(1,27950):
        a = data_sorted['title'].iloc[i].split()
        b = data_sorted['title'].iloc[j].split()
        if len(a)-len(b)<=2:
            data_sorted.drop(b)
            j=j
        else:
            j+=1
    i+=1

Error: IndexError: single positional indexer is out-of-bounds

Can anyone please help me out with my code. Thanks in advance.

does a duplicate title mean a duplicated row? because if the title is duplicated but not the row, it can leads to issues. — Mox
– Mox, Commented Aug 11, 2018 at 5:56
Anyway the reason why you are getting positional index error is because you have tried to drop the element in the loop, setting the j=j will not reduce the range of index that you will be looping through. — Mox
– Mox, Commented Aug 11, 2018 at 6:03
i added j=j bcoz if when the row at j(i+1) is dropped then the next row after j now becomes the jth row — Vnay
– Vnay, Commented Aug 11, 2018 at 6:13
but your j=j will have no effect, and there is no need for you do to j+=1 and i+=1, in python, the increment in for-loop is automatic. so what you are doing is in fact i+=2 and j+=2 in each iteration. I hope my explanation is clear. — Mox
– Mox, Commented Aug 11, 2018 at 6:16
Seems a bit strange to dynamically drop rows of a DataFrame, since DataFrames are mostly append-only data structures when I use them. Would use group_by and apply to create a new DataFrame in dedup applications. — zaxliu
– zaxliu, Commented Aug 11, 2018 at 6:24

Shgidi · Accepted Answer · 2018-08-11 06:49:59Z

1

I would suggest the following approach:

Build a difference matrix of your title, where the i,j element will represent the word difference between i'th and j'th title.

Like so:

    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j

answered Aug 11, 2018 at 6:49

Shgidi

1541 gold badge6 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Removing duplicate rows in dataframe in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related