Text data clustering with python

Question

I am currently trying to cluster a list of sequences based on their similarity using python.

ex:

DFKLKSLFD

DLFKFKDLD

LDPELDKSL
...

The way I pre process my data is by computing the pairwise distances using for example the Levenshtein distance. After calculating all the pairwise distances and creating the distance matrix, I want to use it as input for the clustering algorithm.

I have already tried using Affinity Propagation, but convergence is a bit unpredictable and I would like to go around this problem.

Does anyone have any suggestions regarding other suitable clustering algorithms for this case?

Thank you!!

there's a whole list of algorithm, scikit-learn.org/stable/modules/clustering.html, without stating your aim or intended outcome.. i think it's really hard to provide a concrete answer here — StupidWolf
– StupidWolf, Commented Mar 31, 2021 at 8:43
These are some of the targets/conditions that I have: I do not know the number of clusters that I want. I want to get rid of outliers. Cluster the sequences taking into account a maximum distance (i.e. the distance between any pair within a cluster cannot be superior to x). — mantunes
– mantunes, Commented Mar 31, 2021 at 10:27

Willem Hendriks · Accepted Answer · 2021-04-01 22:38:26Z

1

sklearn actually does show this example using DBSCAN, just like Luke once answered here.

This is based on that example, using !pip install python-Levenshtein. But if you have pre-calculated all distances, you could change the custom metric, as shown below.

from Levenshtein import distance

import numpy as np
from sklearn.cluster import dbscan

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def z:
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

dbscan(X, metric=lev_metric, eps=5, min_samples=2)

And if you pre-calculated you could define pre_lev_metric(x, y) along the lines of

def pre_lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return DISTANCES[i,j]

Alternative answer based on K-Medoids using sklearn_extra.cluster.KMedoids. K-Medoids is not yet that well known, but only needs distance as well.

I had to install like this

!pip uninstall -y enum34
!pip install scikit-learn-extra

Than I was able to create clusters with;

from sklearn_extra.cluster import KMedoids
import numpy as np

from Levenshtein import distance

data = ["DFKLKSLFD", "DLFKFKDLD", "LDPELDKSL"]

def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return distance(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)

kmedoids = KMedoids(n_clusters=2, random_state=0, metric=lev_metric).fit(X)

The labels/centers are in

kmedoids.labels_
kmedoids.cluster_centers_

edited Apr 1, 2021 at 22:38

answered Apr 1, 2021 at 7:53

Willem Hendriks

1,5272 gold badges11 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mantunes Over a year ago

Thank you a lot for your answer. I managed to implement DBSCAN and the results so far are quite promising. I do have one doubt about the algorithm: given that it selects a random point to begin the clustering, and taking into account that I am passing it the pairwise distances as the basis for the clustering, how much can the clustering final outcome be influenced by the first random point that it selects to start the process?

Willem Hendriks Over a year ago

Indeed it can slightly vary. If that is a problem search for deterministic variations of DBSCAN, but in practice usually it is not a problem. Clustering is usually done to get some summary of the data. One edge point belong to cluster A or B doesnt impact the project too much.

ASH · Accepted Answer · 2021-04-01 13:34:40Z

0

Try this.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Results:

 - *LDPELDKSL:* LDPELDKSL
 - *DFKLKSLFD:* DFKLKSLFD
 - *XYZ:* ABC, XYZ
 - *DLFKFKDLD:* DLFKFKDLD

answered Apr 1, 2021 at 13:34

ASH

20.5k28 gold badges115 silver badges247 bronze badges

1 Comment

Vallalar_dev Over a year ago

What about sentences? if i have a list of sentences, will it be possible to use lv similarity or should i use some other technique?

Платформа Игр · Accepted Answer · 2022-02-24 05:16:02Z

0

common_words = kmeans.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

answered Feb 24, 2022 at 5:16

Платформа Игр

11 bronze badge

Collectives™ on Stack Overflow

Text data clustering with python

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related