0

I have a set of binarized images containing forms, each image follows one of N layouts. There are a few outliers which do not follow a layout and contain random text and images.

The distance between two images can be calculated, as the number of intersecting black pixels. High overlap means the images are more likely to depict the same form.

Are there any algorithms that can cluster the images without computing all pairwise distances, i.e. iteratively or online? I would like to cluster the images by the forms used in each image. Outliers should be detected and not end up within any cluster.

Ideally in Python, using scipy.

3
  • 1
    The first thing that comes to mind is kmeans. That compares all elements to all cluster centroids. If the number of clusters is much smaller than the number of elements, that can be a lot faster. SciPy technically has a kmeans implementation, but I would really steer you toward sklearn's kmeans implementation. Commented 2 days ago
  • 1
    The other idea I would suggest is perceptual hashes. If you can reduce the amount of data inside an image to 64 bits, then comparing all images against all other images is not so bad. Commented 2 days ago
  • Thank you @NickODell, I will give it a try. Perceptual hashes sound interesting! Commented 2 days ago

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.