0

I have a table in SQL Server with two columns as shown here, all number are comma-separated:

enter image description here

IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL
     DROP TABLE #TempTable

CREATE TABLE #TempTable 
(
     [Sample] varchar(33), 
     Benchmark varchar(33)
)

INSERT INTO #TempTable VALUES ('9,55,66', '55,44');
INSERT INTO #TempTable VALUES ('88,23,2,3,4', '23,88');

I need to match the data in Sample column with the Benchmark, then calculate the missing number ratio:

enter image description here

Let me explain:

  • Column C means the number 55 exist in Sample and Benchmark
  • Column D means only 1 match, which is 55
  • Column E means total number exists in Benchmark = 2
  • Column F is calculated (2-1)/2 = 50% of them are missing

I don't need column C, D ,E to be stored in the table, just a explanation purpose. WHAT I need is the column F.

I am using SQL Server 2019+.

5
  • 2
    As per the question guide, please do not post images of code, data, error messages, etc. - copy or type the text into the question. Please reserve the use of images for diagrams or demonstrating rendering bugs, things that are impossible to describe accurately via text. Commented Nov 30, 2023 at 1:11
  • 1
    Attach some sample data to make row constructors of and this might be a fun question to work on. Commented Nov 30, 2023 at 1:37
  • @Xedni I attached a sample table. please help Commented Nov 30, 2023 at 2:20
  • 2
    The solution would be to normalize your database. Anything other than that is just a workaround to handle a poor database design. Read Is storing a delimited list in a database column really that bad?, where you will see a lot of reasons why the answer to this question is Absolutely yes! Commented Nov 30, 2023 at 8:00
  • 1
    I agree with @ZoharPeled 10000%. If you fix your design by normalizing it then your query will be simple. Commented Nov 30, 2023 at 14:18

4 Answers 4

2

First things first, you're going to need a way to uniquely identify each of your rows. I could make the assumption that sample uniquely identifies the rows, but I'm sort of guessing there might be something like an identity on the table. if there isn't one, make one using row_number(). I simply added an identity called Rid to your temp table.

The reason you'll need this is you're going to split the samples and the benchmarks into two CTEs, call string_split on each of them, and then join them back together on the split value and the row identifier (in my example, RID, but again, ifsample is a valid key, that works too.

The samples CTE is going to look like this:

select RID, Value = trim(b.value)
from #tmp a
cross apply string_split(a.sample, ',') b

and the benchmarks CTE is going to look like this:

select RID, Value = trim(b.value)
from #tmp a
cross apply string_split(a.benchmark, ',') b

I'm showing you these separately so you can see what each CTE contains. I'm also trimming the value in case there is whitespace in there. You could go a step further and cast them to int if you want/need, but I'll leave that as an exercise for the reader.

Finally, left join everything back together on RID and Value, grouping by RID (i.e. the original row you were testing). The Benchmark count is just the total count of rows (or you could use bm.value; it's the same thing since benchmarks is your left table here) and the match count is the count of s.value. This works, because if there isn't a match, you'll get a null, and count(s.value) will skip any s.value where it's null.

;with samples as
(
    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.sample, ',') b
), benchmarks as
(
    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.benchmark, ',') b
)
select
    bm.RID,
    BenchmarkCount = count(1),
    MatchCount = count(s.value),
    MissingRatio = 100 - convert(decimal(9,3), (100.0 * count(s.Value)) / nullif(count(1), 0))
from benchmarks bm
left outer join samples s
    on bm.RID = s.RID
        and bm.Value = s.Value
group by bm.RID

From there, you can format your results however you like.

Sign up to request clarification or add additional context in comments.

Comments

2

Use string_split() to split the sample and benchmark CSV to rows. Then benchmark left join to sample.

with 
sample as
(
  select id, t.sample, t.benchmark, s.value
  from   your_table t
         cross apply string_split(t.sample, ',') s
),
benchmark as
(
  select id, t.sample, t.benchmark, s.value
  from   your_table t
         cross apply string_split(t.benchmark, ',') s
)
select b.sample, b.benchmark,
       [ratio] = (count(*) - count(s.value)) * 100.0 / count(*)
from   benchmark b
       left join sample s   on  b.id    = s.id
                            and b.value = s.value
group by b.sample, b.benchmark

Demo

4 Comments

your code doesn't work. I attached a sample table. if OBJECT_ID('tempdb..#TempTable') is not null DROP TABLE #TempTable create table #TempTable ([Sample] varchar(33), Benchmark varchar(33)) insert into #TempTable values ('9,55,66','55,44'); insert into #TempTable values ('88,23,2,3,4','23,88');
What do you mean by doesn't work?
it works now, one question, how does this CROSS APPLY generate the s.value? is it a special window function ?
that is the result of string_split(). Refer to documentation for further detail
2

I have used Common Table Expressions (CTEs) to compare values in two comma-separated columns (Sample and Benchmark) within the your_table table. It uses the STRING_SPLIT function to break these comma-separated values into individual rows. The query then calculates the number of unique matching values between the two columns and the total number of unique values in the benchmark column. Finally, it computes a missing ratio percentage for each row in the your_table table, representing the ratio of values in the Sample column that do not have a match in the Benchmark column. The result set includes all columns from the your_table table along with the calculated missing ratio.

WITH cte_sample AS (
    SELECT value AS sample_value
    FROM your_table
    CROSS APPLY STRING_SPLIT(Sample, ',')
),
cte_benchmark AS (
    SELECT value AS benchmark_value
    FROM your_table
    CROSS APPLY STRING_SPLIT(Benchmark, ',')
),
cte_matches AS (
    SELECT COUNT(DISTINCT s.sample_value) AS num_matches
    FROM cte_sample s
    WHERE EXISTS (
        SELECT 1
        FROM cte_benchmark b
        WHERE b.benchmark_value = s.sample_value
    )
),
cte_total AS (
    SELECT COUNT(DISTINCT benchmark_value) AS total_number
    FROM cte_benchmark
),
cte_final AS (
    SELECT
        yt.*,
        CAST(num_matches AS DECIMAL(10, 2)) / NULLIF(total_number, 0) * 100 AS missing_ratio
    FROM your_table yt
    CROSS JOIN cte_matches
    CROSS JOIN cte_total
)
SELECT *
FROM cte_final;

2 Comments

@DaleK Thanks, I'll add an explanation. When should we use Cast explicitly? Multiplying with 1.0 is a good way or a Cast one?
Thanks, @DaleK, I got your point and updated the query.
1
select *,
   [missing ratio] = (1.0 * benchmark_count - match_count) / benchmark_count
from TempTable
cross apply (
    select string_agg(s.value, ','), count(*)
    from string_split(Sample, ',') s, string_split(Benchmark, ',') b
    where s.value = b.value) m(match, match_count)
cross apply (
    select count(*) from string_split(Benchmark, ',')) b(benchmark_count)

http://sqlfiddle.com/#!18/a805fa/7/0

3 Comments

A good answer contains an explanation in addition to a solution.
lovely thing about SQL is that it's self explanatory ;)
Not to a beginner... hence the need to add an explanation

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.