1

How can I perform a

 GROUP BY

in SQL when the group_name values are similar but not exactly the same?

In my dataset, the group_name values may differ slightly (e.g., "Apple Inc.", "AAPL", "Apple"), but conceptually they refer to the same entity. The similarity might not be obvious or consistent, so I might need to define a custom rule or function like is_similar() to cluster them.

For simple cases, I can extract a common pattern using regex or string functions (e.g., strip suffixes, lowercase, take prefixes). But how should I handle more complex scenarios, like fuzzy or semantic similarity?

Case:

group_name val
'Apple Inc.' 100
'AAPL' 50
'Apple' 30
'Microsoft' 80
'MSFT' 70

What I want to achieve:

new_group_name total_val
'Apple' 180
'Microsoft' 150

What are the best approaches to achieve this in SQL? And how would I write a query like this:

SELECT some_characteristic(group_name) AS new_group_name,
       SUM(val)
FROM tb1
GROUP BY new_group_name;

Is it possible in Python? using gensim?

4
  • is this relevant? stackoverflow.com/questions/6101404/sql-group-by-like Commented May 15 at 7:26
  • Why spamming python tag? Commented May 15 at 7:27
  • Is your question how to code is_similar() or how to use it assuming you've already coded it? The first part is too wishy-washy to be answered... Commented May 15 at 7:40
  • You must define what is similar for you and what not, otherwise it's impossible to answer your question. Commented May 15 at 13:55

3 Answers 3

2

Create a separate table that explicitly maps each variation of group_name to its desired canonical new_group_name

populate this table with entries like:

canonical variation
Apple Inc. Apple
AAPL Apple
Apple Apple
Microsoft Microsoft
MSFT Microsoft

JOIN your data table (tb1) with this mapping table on group_name = variation , and then GROUP BY the canonical_name from the mapping table.

SELECT
    gm.canonical_name AS new_group_name,
    SUM(t.val) AS total_val
FROM tb1 t
JOIN group_mapping gm ON t.group_name = gm.variation

GROUP BY gm.canonical_name
ORDER BY new_group_name;
Sign up to request clarification or add additional context in comments.

Comments

1

If you:

  • do know the canonical names
  • do not know the mapping "variation: canonical" in advance
  • have implemented the "is_similar()" function, that takes two strings and computes the similarity

You can compute a full join between your table of names and canonical names and use your "is_similar" function to compute the similarity between pairs. Then you replace the names from your first table with your remapped names and compute the aggregation.

Here's an example that gives you an idea of the involved SQL operations:

WITH remapping AS (
    SELECT all_names.string AS original_string,
           MAX(CASE WHEN IS_SIMILAR(all_names.string, canonical_names.string) > threshold
                    THEN canonical_names.string
                    ELSE all_names.string
               END) AS new_string
    FROM      (SELECT string FROM tb1) AS all_names
    FULL JOIN (SELECT string AS canonical_name 
               FROM tb1
               WHERE string IN <canonical_names_list>) canonical_names
    GROUP BY all_names.string
)
SELECT remapping.new_string,
       SUM(val) AS total
FROM      tb1
LEFT JOIN remapping
       ON tb1.string = remapping.original_string
GROUP BY remapping.new_string

There's a degree of complexity in actually defining the is_similar function, and given you would carry out this operations on all combinations of strings, it may easily become very slow with lots of data (and semantic similarity is next to impossible with SQL). You could attempt to reduce significantly the downtime and exploit ready-to-use tools in Python instead, including:

  • the cdifflib library, faster implementation of the difflib library, it allows you to compute syntactic distance between words, it works through its main class SequenceMatcher, and comes with several helper functions. You can also play around with this using a bit of preprocessing (lowering characters, stemming/lemmatizing, changing the order of words - for multi-worded names, etc...)
from cdifflib import CSequenceMatcher

CSequenceMatcher(None, string1, string2).ratio()
  • the rapidfuzz library, even faster than the cdifflib library, as it can parallelize the comparison between one string with an array (brings complexity from O(n)^2 to O(n) - you can actually go further to O(1) if you generate in advance all the combinations, at expense of physical memory). It gives similarity results that slightly differ with respect to the previous library (uses a different processor), I wouldn't say these are necessarily better in effectiveness.
from rapidfuzz import process, fuzz

process.extract(string1, strings2, scorer=fuzz.WRatio)
  • the sentence_transformers library, you can select an LLM model among the available ones, it allows you to generate semantic embeddings given the words you have, and you can use the similarity function that comes with the library to compare the embeddings.
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

embs1 = model.encode(strings1)
embs2 = model.encode(strings2)

float(util.cos_sim(embs1, embs2))

These are simple examples, but you should really go in depth with the docs and get creative with the parameters and functions.

In general I wouldn't throw in the semantic similarity for this use case, unless the names you have are well-formed ("AAPL" may not be semantically similar to "Apple", would rather have a higher syntactic similarity instead). If you really like to have both, you can use a weighted average of the two and fine-tune the weights according to the results you see.

Comments

0

Don't know your rdbms, but in Oracle there are SOUNDEX() function and UTL_MATCH Package functions EDIT_DISTANCE() and EDIT_DISTANCE_SIMILARITY() that could help.
Both could be done using python fuzzy.Soundex() and levenshteinDistance().
Below is one of the options how it could be solved with Oracle:

Create Table group_names (group_name VARCHAR2(32));
--      S a m p l e    D a t a :
Insert Into group_names
Select 'Apple Inc.' From Dual Union All 
Select 'AAPL' From Dual Union All 
Select 'Apple' From Dual Union All 
Select 'Microsoft' From Dual Union All 
Select 'MSFT' From Dual
--      S Q L : 
WITH
  grps AS
    ( Select     g0.group_name as G0_GRP_NAME, 
                 g1.group_name as G1_GRP_NAME,
                 SOUNDEX(g0.group_name) as G0_SOUNDEX,
                 SOUNDEX(g1.group_name) as G1_SOUNDEX,
                 UTL_MATCH.EDIT_DISTANCE(Upper(g0.group_name), Upper(g1.group_name)) as DISTANCE, 
                 UTL_MATCH.EDIT_DISTANCE_SIMILARITY(Upper(g0.group_name), Upper(g1.group_name)) as DISTANCE_PCT
      From       group_names g0
      Inner Join group_names g1 ON( Upper(g0.group_name) != Upper(g1.group_name) )
      Where      SubStr(SOUNDEX(g0.group_name), 1, 2) = SubStr(SOUNDEX(g1.group_name),1, 2)
      Order By   SOUNDEX(g0.group_name), UTL_MATCH.EDIT_DISTANCE(Upper(g0.group_name), Upper(g1.group_name))
   )
Select      Max(G0_GRP_NAME) as group_name
From        grps
Group By    SubStr(G0_SOUNDEX, 1, 2)

R e s u l t :

GROUP_NAME
Apple Inc.
Microsoft

Above result is extracted from grps CTE which results as:

G0_GRP_NAME G1_GRP_NAME G0_SOUNDEX G1_SOUNDEX DISTANCE DISTANCE_PCT
Apple AAPL A140 A140 2 60
AAPL Apple A140 A140 2 60
Apple Apple Inc. A140 A145 5 50
AAPL Apple Inc. A140 A145 7 30
Apple Inc. Apple A145 A140 5 50
Apple Inc. AAPL A145 A140 7 30
MSFT Microsoft M213 M262 5 45
Microsoft MSFT M262 M213 5 45

NOTE:
Not all columns are needed - I put them here as a sample ... You should find your own way to do the extraction that will fit your actual context.

fiddle

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.