1

I want to generate some sample data with random values.

I have a mini table with 5 rows, ids from 1 to 5 with some text for every row.

Then I want to generate 65536 rows - first column has value 1 for every row, second column is random number between 1 and 5, without NULL values.

Then I want to join these two tables. With ROW_NUMBER() % 5 approach INNER/OUTER JOIN returns 65536 rows.

Instead of this pseudo random column I want to use RAND seeded by NEWID.

LEFT JOIN returns 65536 rows as I suspected, but INNER JOIN returns different row count for every call.

When LFINAL table is materialized into temp table, then INNER JOIN works and returns 65536 rows too.

Can somebody explain to me why INNER JOIN does not return 65536 rows as I expected?

WITH Names AS 
(
    SELECT id, row_name 
    FROM 
        (SELECT 1, 'Row 1' UNION ALL 
         SELECT 2, 'Row 2' UNION ALL
         SELECT 3, 'Row 3' UNION ALL
         SELECT 4, 'Row 4' UNION ALL
         SELECT 5, 'Row 5') AS D (id, row_name)
),
L0 AS 
(
    SELECT c 
    FROM 
        (SELECT 1 UNION ALL SELECT 1) AS D(c)
),  --2^1
L1 AS 
(
    SELECT 1 AS c 
    FROM L0 AS A 
    CROSS JOIN L0 AS B
),          --2^2
L2 AS 
(
    SELECT 1 AS c 
    FROM L1 AS A 
    CROSS JOIN L1 AS B
),          --2^4
L3 AS 
(
    SELECT 1 AS c 
    FROM L2 AS A 
    CROSS JOIN L2 AS B
),          --2^8
L4 AS 
(
    SELECT 1 AS c 
    FROM L3 AS A 
    CROSS JOIN L3 AS B
),          --2^16 = 65536
LFINAL AS 
(
    SELECT 
        c, 
        --ROW_NUMBER() OVER (ORDER BY c) % 5 + 1 AS rnd FROM L4)
        FLOOR(RAND(CONVERT(VARBINARY, NEWID()))*5) + 1 AS rnd 
    FROM 
        L4
)
SELECT * 
FROM LFINAL lf
LEFT JOIN Names n ON n.id = lf.rnd
7
  • 1
    " but INNER JOIN returns different row count for every call" Less, I assume (you don't tell us). If so, the reason is clear; some of the JOINs failed to find a related row and so less rows were returned. Commented Mar 7, 2022 at 9:35
  • for inner join you get different rows. Commented Mar 7, 2022 at 9:50
  • Change LEFT to INNER - even more rows than 65536 can be returned if you run query several times.. Commented Mar 7, 2022 at 10:05
  • 1
    My row count with inner join = 65659, 65743, 65374, 65398, 65825, 65491, 65374. I think that this will be some bug in SQL optimalizator. Commented Mar 7, 2022 at 10:13
  • 1
    Per Salman's answer, this is not a bug and it's not going to be fixed. The optimizer does not guarantee every row in a CTE is evaluated only once, since they're interpolated into the query as though they were subqueries. That means using non-deterministic expressions can have unexpected results which will depend on the execution plan chosen. As an aside, a shorter and probably better way of generating a random number is CRYPT_GEN_RANDOM(1) % 5 + 1 (though note this is still non-deterministic). Commented Mar 7, 2022 at 11:01

1 Answer 1

0

I was able to reproduce the problem. This should be the expected behavior because:

  • The newid() is not generated per row, it is generated per invocation
  • The optimizer is free to evaluate a row more than once

When tested, SQL server seems to do the following behind the scenes:

for (row 1 ... 5 in dbo.names)
    for (row 1 ... 65536 in lfinal)
        add the pair {dbo.names.id, rand(newid...) as rnd} to selection
filter (where dbo.names.id = rnd)

Notice that each row from the set of 327680 rows has 20% probability of matching the filter. You will get around 65536 rows in total but not exactly that many.

I would suggest inserting the random numbers into a temporary table (variable) so that the numbers are materialized, then join.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.