1

I have a dataframe which has 27949 rows & 7 columns & the first few rows look like below https://i.sstatic.net/1Pipf.png

Task: In the dataframe I have a 'title' column which has many duplicate titles which I want to remove (duplicate title:almost all the title is same except for 1 or 2 words). Pseudo code: I want to check the 1st row with all other rows & if any of these is a duplicate I want to remove it. Then I want to check the 2nd row with all other rows & if any of these is a duplicate I want to remove it - similarly with all rows i.e. i = 1st line to last line j = i+1 to last line. My code:

for i in range(0,27950):
    for j in range(1,27950):
        a = data_sorted['title'].iloc[i].split()
        b = data_sorted['title'].iloc[j].split()
        if len(a)-len(b)<=2:
            data_sorted.drop(b)
            j=j
        else:
            j+=1
    i+=1

Error: IndexError: single positional indexer is out-of-bounds

Can anyone please help me out with my code. Thanks in advance.

7
  • does a duplicate title mean a duplicated row? because if the title is duplicated but not the row, it can leads to issues. Commented Aug 11, 2018 at 5:56
  • Anyway the reason why you are getting positional index error is because you have tried to drop the element in the loop, setting the j=j will not reduce the range of index that you will be looping through. Commented Aug 11, 2018 at 6:03
  • i added j=j bcoz if when the row at j(i+1) is dropped then the next row after j now becomes the jth row Commented Aug 11, 2018 at 6:13
  • but your j=j will have no effect, and there is no need for you do to j+=1 and i+=1, in python, the increment in for-loop is automatic. so what you are doing is in fact i+=2 and j+=2 in each iteration. I hope my explanation is clear. Commented Aug 11, 2018 at 6:16
  • 1
    Seems a bit strange to dynamically drop rows of a DataFrame, since DataFrames are mostly append-only data structures when I use them. Would use group_by and apply to create a new DataFrame in dedup applications. Commented Aug 11, 2018 at 6:24

1 Answer 1

1

I would suggest the following approach:

Build a difference matrix of your title, where the i,j element will represent the word difference between i'th and j'th title.

Like so:

    import numpy as np
    from itertools import product

    l = list(data_sorted['title'])

    def diff_words(text_1, text_2):
        # return the number of different words between two texts
        words_1 = text_1.split()
        words_2 = text_2.split()
        diff = max(len(words_1),len(words_2))-len(np.intersect1d(words_1, words_2))
        return diff


    differences = [diff_words(i,j) for i,j in product(l,l)]
    # differences: a flat matrix integers where the i,j element is the word difference between titles i and j
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.