If you:
- do know the canonical names
- do not know the mapping "variation: canonical" in advance
- have implemented the "is_similar()" function, that takes two strings and computes the similarity
You can compute a full join between your table of names and canonical names and use your "is_similar" function to compute the similarity between pairs. Then you replace the names from your first table with your remapped names and compute the aggregation.
Here's an example that gives you an idea of the involved SQL operations:
WITH remapping AS (
SELECT all_names.string AS original_string,
MAX(CASE WHEN IS_SIMILAR(all_names.string, canonical_names.string) > threshold
THEN canonical_names.string
ELSE all_names.string
END) AS new_string
FROM (SELECT string FROM tb1) AS all_names
FULL JOIN (SELECT string AS canonical_name
FROM tb1
WHERE string IN <canonical_names_list>) canonical_names
GROUP BY all_names.string
)
SELECT remapping.new_string,
SUM(val) AS total
FROM tb1
LEFT JOIN remapping
ON tb1.string = remapping.original_string
GROUP BY remapping.new_string
There's a degree of complexity in actually defining the is_similar function, and given you would carry out this operations on all combinations of strings, it may easily become very slow with lots of data (and semantic similarity is next to impossible with SQL). You could attempt to reduce significantly the downtime and exploit ready-to-use tools in Python instead, including:
- the cdifflib library, faster implementation of the difflib library, it allows you to compute syntactic distance between words, it works through its main class SequenceMatcher, and comes with several helper functions. You can also play around with this using a bit of preprocessing (lowering characters, stemming/lemmatizing, changing the order of words - for multi-worded names, etc...)
from cdifflib import CSequenceMatcher
CSequenceMatcher(None, string1, string2).ratio()
- the rapidfuzz library, even faster than the cdifflib library, as it can parallelize the comparison between one string with an array (brings complexity from O(n)^2 to O(n) - you can actually go further to O(1) if you generate in advance all the combinations, at expense of physical memory). It gives similarity results that slightly differ with respect to the previous library (uses a different processor), I wouldn't say these are necessarily better in effectiveness.
from rapidfuzz import process, fuzz
process.extract(string1, strings2, scorer=fuzz.WRatio)
- the sentence_transformers library, you can select an LLM model among the available ones, it allows you to generate semantic embeddings given the words you have, and you can use the similarity function that comes with the library to compare the embeddings.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')
embs1 = model.encode(strings1)
embs2 = model.encode(strings2)
float(util.cos_sim(embs1, embs2))
These are simple examples, but you should really go in depth with the docs and get creative with the parameters and functions.
In general I wouldn't throw in the semantic similarity for this use case, unless the names you have are well-formed ("AAPL" may not be semantically similar to "Apple", would rather have a higher syntactic similarity instead). If you really like to have both, you can use a weighted average of the two and fine-tune the weights according to the results you see.
is_similar()or how to use it assuming you've already coded it? The first part is too wishy-washy to be answered...