In sql, group by using similar group_name

Question

How can I perform a

 GROUP BY

in SQL when the group_name values are similar but not exactly the same?

In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent, so I might need to define a custom rule or function like is_similar() to cluster them.

For simple cases, I can extract a common pattern using regex or string functions (e.g., strip suffixes, lowercase, take prefixes). But how should I handle more complex scenarios, like fuzzy or semantic similarity?

Case:

group_name	val
'Apple Inc.'	100
'AAPL'	50
'Apple'	30
'Microsoft'	80
'MSFT'	70

What I want to achieve:

new_group_name	total_val
'Apple'	180
'Microsoft'	150

What are the best approaches to achieve this in SQL? And how would I write a query like this:

SELECT some_characteristic(group_name) AS new_group_name,
       SUM(val)
FROM tb1
GROUP BY new_group_name;

Is it possible in Python? using gensim?

is this relevant? stackoverflow.com/questions/6101404/sql-group-by-like — Kenzo Staelens
– Kenzo Staelens, Commented May 15 at 7:26
Is your question how to code is_similar() or how to use it assuming you've already coded it? The first part is too wishy-washy to be answered... — Julien
– Julien, Commented May 15 at 7:40
You must define what is similar for you and what not, otherwise it's impossible to answer your question. — Jonas Metzler
– Jonas Metzler, Commented May 15 at 13:55

ValNik · Accepted Answer · 2025-05-15 13:50:50Z

2

Create a separate table that explicitly maps each variation of group_name to its desired canonical new_group_name

populate this table with entries like:

canonical	variation
Apple Inc.	Apple
AAPL	Apple
Apple	Apple
Microsoft	Microsoft
MSFT	Microsoft

JOIN your data table (tb1) with this mapping table on group_name = variation , and then GROUP BY the canonical_name from the mapping table.

SELECT
    gm.canonical_name AS new_group_name,
    SUM(t.val) AS total_val
FROM tb1 t
JOIN group_mapping gm ON t.group_name = gm.variation

GROUP BY gm.canonical_name
ORDER BY new_group_name;

edited May 15 at 13:50

ValNik

7,7981 gold badge8 silver badges18 bronze badges

answered May 15 at 7:40

Ritik Ranjan

213 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lemon · Accepted Answer · 2025-05-15 12:37:02Z

If you:

do know the canonical names
do not know the mapping "variation: canonical" in advance
have implemented the "is_similar()" function, that takes two strings and computes the similarity

You can compute a full join between your table of names and canonical names and use your "is_similar" function to compute the similarity between pairs. Then you replace the names from your first table with your remapped names and compute the aggregation.

Here's an example that gives you an idea of the involved SQL operations:

WITH remapping AS (
    SELECT all_names.string AS original_string,
           MAX(CASE WHEN IS_SIMILAR(all_names.string, canonical_names.string) > threshold
                    THEN canonical_names.string
                    ELSE all_names.string
               END) AS new_string
    FROM      (SELECT string FROM tb1) AS all_names
    FULL JOIN (SELECT string AS canonical_name 
               FROM tb1
               WHERE string IN <canonical_names_list>) canonical_names
    GROUP BY all_names.string
)
SELECT remapping.new_string,
       SUM(val) AS total
FROM      tb1
LEFT JOIN remapping
       ON tb1.string = remapping.original_string
GROUP BY remapping.new_string

There's a degree of complexity in actually defining the is_similar function, and given you would carry out this operations on all combinations of strings, it may easily become very slow with lots of data (and semantic similarity is next to impossible with SQL). You could attempt to reduce significantly the downtime and exploit ready-to-use tools in Python instead, including:

the cdifflib library, faster implementation of the difflib library, it allows you to compute syntactic distance between words, it works through its main class SequenceMatcher, and comes with several helper functions. You can also play around with this using a bit of preprocessing (lowering characters, stemming/lemmatizing, changing the order of words - for multi-worded names, etc...)

from cdifflib import CSequenceMatcher

CSequenceMatcher(None, string1, string2).ratio()

the rapidfuzz library, even faster than the cdifflib library, as it can parallelize the comparison between one string with an array (brings complexity from O(n)^2 to O(n) - you can actually go further to O(1) if you generate in advance all the combinations, at expense of physical memory). It gives similarity results that slightly differ with respect to the previous library (uses a different processor), I wouldn't say these are necessarily better in effectiveness.

from rapidfuzz import process, fuzz

process.extract(string1, strings2, scorer=fuzz.WRatio)

the sentence_transformers library, you can select an LLM model among the available ones, it allows you to generate semantic embeddings given the words you have, and you can use the similarity function that comes with the library to compare the embeddings.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

embs1 = model.encode(strings1)
embs2 = model.encode(strings2)

float(util.cos_sim(embs1, embs2))

These are simple examples, but you should really go in depth with the docs and get creative with the parameters and functions.

In general I wouldn't throw in the semantic similarity for this use case, unless the names you have are well-formed ("AAPL" may not be semantically similar to "Apple", would rather have a higher syntactic similarity instead). If you really like to have both, you can use a weighted average of the two and fine-tune the weights according to the results you see.

d r · Accepted Answer · 2025-05-17 11:16:25Z

Don't know your rdbms, but in Oracle there are SOUNDEX() function and UTL_MATCH Package functions EDIT_DISTANCE() and EDIT_DISTANCE_SIMILARITY() that could help.
Both could be done using python fuzzy.Soundex() and levenshteinDistance().
Below is one of the options how it could be solved with Oracle:

Create Table group_names (group_name VARCHAR2(32));

--      S a m p l e    D a t a :
Insert Into group_names
Select 'Apple Inc.' From Dual Union All 
Select 'AAPL' From Dual Union All 
Select 'Apple' From Dual Union All 
Select 'Microsoft' From Dual Union All 
Select 'MSFT' From Dual

--      S Q L : 
WITH
  grps AS
    ( Select     g0.group_name as G0_GRP_NAME, 
                 g1.group_name as G1_GRP_NAME,
                 SOUNDEX(g0.group_name) as G0_SOUNDEX,
                 SOUNDEX(g1.group_name) as G1_SOUNDEX,
                 UTL_MATCH.EDIT_DISTANCE(Upper(g0.group_name), Upper(g1.group_name)) as DISTANCE, 
                 UTL_MATCH.EDIT_DISTANCE_SIMILARITY(Upper(g0.group_name), Upper(g1.group_name)) as DISTANCE_PCT
      From       group_names g0
      Inner Join group_names g1 ON( Upper(g0.group_name) != Upper(g1.group_name) )
      Where      SubStr(SOUNDEX(g0.group_name), 1, 2) = SubStr(SOUNDEX(g1.group_name),1, 2)
      Order By   SOUNDEX(g0.group_name), UTL_MATCH.EDIT_DISTANCE(Upper(g0.group_name), Upper(g1.group_name))
   )
Select      Max(G0_GRP_NAME) as group_name
From        grps
Group By    SubStr(G0_SOUNDEX, 1, 2)

R e s u l t :

GROUP_NAME
Apple Inc.
Microsoft

Above result is extracted from grps CTE which results as:

G0_GRP_NAME	G1_GRP_NAME	G0_SOUNDEX	G1_SOUNDEX	DISTANCE	DISTANCE_PCT
Apple	AAPL	A140	A140	2	60
AAPL	Apple	A140	A140	2	60
Apple	Apple Inc.	A140	A145	5	50
AAPL	Apple Inc.	A140	A145	7	30
Apple Inc.	Apple	A145	A140	5	50
Apple Inc.	AAPL	A145	A140	7	30
MSFT	Microsoft	M213	M262	5	45
Microsoft	MSFT	M262	M213	5	45

NOTE:
Not all columns are needed - I put them here as a sample ... You should find your own way to do the extraction that will fit your actual context.

fiddle

Collectives™ on Stack Overflow

In sql, group by using similar group_name

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related