SQL count the number of comma-separated string match in another string [closed]

Question

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 1 year ago.

Improve this question

I have a table in SQL Server with two columns as shown here, all number are comma-separated:

IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL
     DROP TABLE #TempTable

CREATE TABLE #TempTable 
(
     [Sample] varchar(33), 
     Benchmark varchar(33)
)

INSERT INTO #TempTable VALUES ('9,55,66', '55,44');
INSERT INTO #TempTable VALUES ('88,23,2,3,4', '23,88');

I need to match the data in Sample column with the Benchmark, then calculate the missing number ratio:

Let me explain:

Column C means the number 55 exist in Sample and Benchmark
Column D means only 1 match, which is 55
Column E means total number exists in Benchmark = 2
Column F is calculated (2-1)/2 = 50% of them are missing

I don't need column C, D ,E to be stored in the table, just a explanation purpose. WHAT I need is the column F.

I am using SQL Server 2019+.

As per the question guide, please do not post images of code, data, error messages, etc. - copy or type the text into the question. Please reserve the use of images for diagrams or demonstrating rendering bugs, things that are impossible to describe accurately via text. — Dale K
– Dale K, Commented Nov 30, 2023 at 1:11
Attach some sample data to make row constructors of and this might be a fun question to work on. — Xedni
– Xedni, Commented Nov 30, 2023 at 1:37
The solution would be to normalize your database. Anything other than that is just a workaround to handle a poor database design. Read Is storing a delimited list in a database column really that bad?, where you will see a lot of reasons why the answer to this question is Absolutely yes! — Zohar Peled
– Zohar Peled, Commented Nov 30, 2023 at 8:00
I agree with @ZoharPeled 10000%. If you fix your design by normalizing it then your query will be simple. — Sean Lange
– Sean Lange, Commented Nov 30, 2023 at 14:18

Dale K · Accepted Answer · 2023-11-30 04:19:45Z

First things first, you're going to need a way to uniquely identify each of your rows. I could make the assumption that sample uniquely identifies the rows, but I'm sort of guessing there might be something like an identity on the table. if there isn't one, make one using row_number(). I simply added an identity called Rid to your temp table.

The reason you'll need this is you're going to split the samples and the benchmarks into two CTEs, call string_split on each of them, and then join them back together on the split value and the row identifier (in my example, RID, but again, ifsample is a valid key, that works too.

The samples CTE is going to look like this:

select RID, Value = trim(b.value)
from #tmp a
cross apply string_split(a.sample, ',') b

and the benchmarks CTE is going to look like this:

select RID, Value = trim(b.value)
from #tmp a
cross apply string_split(a.benchmark, ',') b

I'm showing you these separately so you can see what each CTE contains. I'm also trimming the value in case there is whitespace in there. You could go a step further and cast them to int if you want/need, but I'll leave that as an exercise for the reader.

Finally, left join everything back together on RID and Value, grouping by RID (i.e. the original row you were testing). The Benchmark count is just the total count of rows (or you could use bm.value; it's the same thing since benchmarks is your left table here) and the match count is the count of s.value. This works, because if there isn't a match, you'll get a null, and count(s.value) will skip any s.value where it's null.

;with samples as
(
    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.sample, ',') b
), benchmarks as
(
    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.benchmark, ',') b
)
select
    bm.RID,
    BenchmarkCount = count(1),
    MatchCount = count(s.value),
    MissingRatio = 100 - convert(decimal(9,3), (100.0 * count(s.Value)) / nullif(count(1), 0))
from benchmarks bm
left outer join samples s
    on bm.RID = s.RID
        and bm.Value = s.Value
group by bm.RID

From there, you can format your results however you like.

Dale K · Accepted Answer · 2023-11-30 04:18:54Z

2

Use string_split() to split the sample and benchmark CSV to rows. Then benchmark left join to sample.

with 
sample as
(
  select id, t.sample, t.benchmark, s.value
  from   your_table t
         cross apply string_split(t.sample, ',') s
),
benchmark as
(
  select id, t.sample, t.benchmark, s.value
  from   your_table t
         cross apply string_split(t.benchmark, ',') s
)
select b.sample, b.benchmark,
       [ratio] = (count(*) - count(s.value)) * 100.0 / count(*)
from   benchmark b
       left join sample s   on  b.id    = s.id
                            and b.value = s.value
group by b.sample, b.benchmark

Demo

edited Nov 30, 2023 at 4:18

Dale K

28.1k15 gold badges59 silver badges85 bronze badges

answered Nov 30, 2023 at 1:52

Squirrel

24.8k5 gold badges41 silver badges37 bronze badges

4 Comments

PyBoss Over a year ago

your code doesn't work. I attached a sample table. if OBJECT_ID('tempdb..#TempTable') is not null DROP TABLE #TempTable create table #TempTable ([Sample] varchar(33), Benchmark varchar(33)) insert into #TempTable values ('9,55,66','55,44'); insert into #TempTable values ('88,23,2,3,4','23,88');

Squirrel Over a year ago

What do you mean by doesn't work?

PyBoss Over a year ago

it works now, one question, how does this CROSS APPLY generate the s.value? is it a special window function ?

Squirrel Over a year ago

that is the result of string_split(). Refer to documentation for further detail

Anshu · Accepted Answer · 2023-11-30 05:24:17Z

2

I have used Common Table Expressions (CTEs) to compare values in two comma-separated columns (Sample and Benchmark) within the your_table table. It uses the STRING_SPLIT function to break these comma-separated values into individual rows. The query then calculates the number of unique matching values between the two columns and the total number of unique values in the benchmark column. Finally, it computes a missing ratio percentage for each row in the your_table table, representing the ratio of values in the Sample column that do not have a match in the Benchmark column. The result set includes all columns from the your_table table along with the calculated missing ratio.

WITH cte_sample AS (
    SELECT value AS sample_value
    FROM your_table
    CROSS APPLY STRING_SPLIT(Sample, ',')
),
cte_benchmark AS (
    SELECT value AS benchmark_value
    FROM your_table
    CROSS APPLY STRING_SPLIT(Benchmark, ',')
),
cte_matches AS (
    SELECT COUNT(DISTINCT s.sample_value) AS num_matches
    FROM cte_sample s
    WHERE EXISTS (
        SELECT 1
        FROM cte_benchmark b
        WHERE b.benchmark_value = s.sample_value
    )
),
cte_total AS (
    SELECT COUNT(DISTINCT benchmark_value) AS total_number
    FROM cte_benchmark
),
cte_final AS (
    SELECT
        yt.*,
        CAST(num_matches AS DECIMAL(10, 2)) / NULLIF(total_number, 0) * 100 AS missing_ratio
    FROM your_table yt
    CROSS JOIN cte_matches
    CROSS JOIN cte_total
)
SELECT *
FROM cte_final;

edited Nov 30, 2023 at 5:24

answered Nov 30, 2023 at 1:56

Anshu

8515 silver badges19 bronze badges

2 Comments

Anshu Over a year ago

@DaleK Thanks, I'll add an explanation. When should we use Cast explicitly? Multiplying with 1.0 is a good way or a Cast one?

Anshu Over a year ago

Thanks, @DaleK, I got your point and updated the query.

gordy · Accepted Answer · 2023-11-30 04:38:16Z

1

select *,
   [missing ratio] = (1.0 * benchmark_count - match_count) / benchmark_count
from TempTable
cross apply (
    select string_agg(s.value, ','), count(*)
    from string_split(Sample, ',') s, string_split(Benchmark, ',') b
    where s.value = b.value) m(match, match_count)
cross apply (
    select count(*) from string_split(Benchmark, ',')) b(benchmark_count)

http://sqlfiddle.com/#!18/a805fa/7/0

answered Nov 30, 2023 at 4:38

gordy

9,9713 gold badges37 silver badges49 bronze badges

3 Comments

Dale K Over a year ago

A good answer contains an explanation in addition to a solution.

gordy Over a year ago

lovely thing about SQL is that it's self explanatory ;)

Dale K Over a year ago

Not to a beginner... hence the need to add an explanation

Collectives™ on Stack Overflow

SQL count the number of comma-separated string match in another string [closed]

4 Answers 4

Comments

4 Comments

2 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

2 Comments

3 Comments

Linked

Related