0

I'm currently working on a function that will detect if the row is a duplicate based on multiple conditions (square meters, images and price). It works perfectly fine, till it finds the duplicate, removes the row from DataFrame and then my for loop is disturbed. This produces IndexError: single positional indexer is out-of-bounds.

def image_duplicate(df):
    # Detecting duplicates based on the publications' images, m2 and price.
    for index1 in df.index:
        for index2 in df.index:
            if index1 == index2:
                continue
            print('index1: {} \t index2: {}'.format(index1, index2))

            img1 = Image.open(requests.get(df['img_url'].iloc[index1], stream=True).raw).resize((213, 160))
            img2 = Image.open(requests.get(df['img_url'].iloc[index2], stream=True).raw).resize((213, 160))
            img1 = np.array(img1).astype(float)
            img2 = np.array(img2).astype(float)

            ssim_result = ssim(img1, img2, multichannel=True)

            ssim_result_percentage = (1+ssim_result)/2

            if ssim_result_percentage > 0.80 and df['m2'].iloc[index1] == df['m2'].iloc[index2] \
                    and df['Price'].iloc[index1] == df['Price'].iloc[index2]:
                df.drop(df.iloc[index2], inplace=True).reindex()


image_duplicate(full_df)

What would be a good solution to this issue?

EDIT: Sample: enter image description here

Expected output: Remove One Bedroom row [2] from the DataFrame.

6
  • Whenever possible you should not loop over the dataframe, you can likely use the apply method or vector operations. Here you are changing the dataframe while looping though it, which can't work. Please provide an example of your data and the expected output. Commented Jul 27, 2021 at 14:06
  • Is there a way to utilise two rows, or all rows of DataFrame while comparing to one? Constructed a sample, please let me know if that is enough @mozway Commented Jul 27, 2021 at 14:16
  • Can't you just create a new empty Dataframe and copy the lines you would not delete into it? If something is not duplicate you copy it tot he new Dataframe, if you find a duplicate, you just move onto the next line. Commented Jul 27, 2021 at 14:17
  • So here you want to compare all the images in the combinations of rows? Then I suggest you create first a function that takes two images (or filenames) and returns True/False if they are similar enough similar(img1, img2) -> True/False. Then you can apply it more easily Commented Jul 27, 2021 at 14:24
  • @CaptainCsaba The issue with that is that I will have multiple entry as the code is iterating over the hole df multiple of times. Commented Jul 27, 2021 at 14:42

1 Answer 1

1

From your question it seems (correct me if I'm wrong) that you need to iterate over indexes (Cartesian product) and drop the second indexes (index2 in your example) from the original dataframe.

I would recommend something like this to solve your issue:

import itertools

def image_duplicate(df):
    # Detecting duplicates based on the publications' images, m2 and price.
    indexes_to_drop = []
    for index1, index2 in itertools.product(df.index, df.index):
        if index1 == index2:
            continue
        print("index1: {} \t index2: {}".format(index1, index2))

        img1 = Image.open(requests.get(df["img_url"].iloc[index1], stream=True).raw).resize((213, 160))
        img2 = Image.open(requests.get(df["img_url"].iloc[index2], stream=True).raw).resize((213, 160))
        img1 = np.array(img1).astype(float)
        img2 = np.array(img2).astype(float)

        ssim_result = ssim(img1, img2, multichannel=True)
        ssim_result_percentage = (1 + ssim_result) / 2

        if (
            ssim_result_percentage > 0.80
            and df["m2"].iloc[index1] == df["m2"].iloc[index2]
            and df["Price"].iloc[index1] == df["Price"].iloc[index2]
        ):
            indexes_to_drop.append(index2)

    indexes_to_drop = list(set(indexes_to_drop))
    return df.drop(indexes_to_remove)


output_df = image_duplicate(full_df)  # `output_df` should contain the expected output

Explanation:

  1. Iterate through indexes (I prefer to use itertools in such cases, but feel free to use your approach with for loops)
  2. Create indexes_to_drop list and instead of drop at the end, append those indexes to the list
  3. Get unique list of indexes to drop (it might happen then the identical index will be present duplicate times in the list) - list(set(indexes_to_drop)) is simple way how to remove duplicates (set cannot contains duplicates)
  4. Drop those indexes at once (not sure why did you use .reindex in your example)

There might be another ways how to improve your code, e.g. do not compare images where index2 is already in the indexes_to_drop list (e.g. check if index2 in indexes_to_drop and continue if True) or you could even turn this into function that could be used with apply (iteration over index2 would happen inside apply), but this is not necessary.

Sign up to request clarification or add additional context in comments.

1 Comment

Hey, was holidays with no access to Internet, hence, no reply. I was actually able to resolve the issue with try and except method and caching the images, as the column was getting too big. Your way is actually much cleaner!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.