Remove Cast on int fields on directQuery to 2 billion row fact when computing averages

Question

Using Power BI directQuery to databricks SQL warehouse, I see the queries getting generated to compute an average have

SELECT 
    SUM(CAST(int_field AS DOUBLE)), COUNT(int_field) 
FROM
    fact

Since my fact table is so large, is there a way to eliminate the CAST(int_field AS Double) ? Or to at least put it outside the SUM?

Would CAST(SUM(int_field) AS DOUBLE) perform better?

julaine · Accepted Answer · 2025-11-18 09:49:48Z

1

It would most likely not be faster and it could even be incorrect.

(1) likely not correct:

If you sum a billion integers as integers, you will likely run into the problem of integer overflow. Your result will be nonsense if that happens. Might also be that some part of your stack notices this happening and throws an error instead. Failure either way. Casting the input numbers to Double avoids this problem, because double-precision floating-point-numbers have enormous range.

(2) likely not faster:

Latency Numbers every programmer should know has a random read on an SSD as a hundred thousand times slower than a typical instruction. So your problem is totally dominated by disk, the tiny amount of compute that your approach may save is irrelevant.

The question of latency vs throughput came up in the comments, and throughput is the more important number here. The difference is smaller here, but still, the CPU outpaces the Disk by a lot. A really fast SSD may read at 7 GB/s. Under ideal conditions, if your table contains only the numbers you want to average and no book-keeping information, other columns, ignored rows, free space, that would mean your 2 billion int32 will be read in a bit more than a second, but a real database will be much, much slower. On the CPU side, 2 billion instructions may run in one second on an ARM Cortex-A8 from 2005.

edited Nov 18 at 9:49

answered Nov 15 at 23:08

julaine

2,3151 gold badge20 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user555045 Nov 16 at 21:58

You're citing latency numbers in a throughput scenario

julaine Nov 17 at 8:06

That is correct, my reasoning is flawed. I will try to find better numbers, but my intuition is that the conclusion will not change.

julaine Nov 18 at 9:52

I've put in some throughput numbers. I still think disk is the slowest part. My intuition that CPU is not the bottleneck here stems from analyzing arithmetic intensity in high-performance computing. The latency numbers were always given as explanation of the importance of arithmetic intensity. Since they are an explanation of the 'obvious', I never gave them much scrutiny.

julaine Nov 18 at 9:55

I have also reordered my answer, concerns about correctness come first now. Because if I believed that the cast-once-solution was correct you could argue it is worth trying out and benchmarking, theoretical predictions of performance are often wrong. But if it is incorrect it does not even matter if it is fast.

Collectives™ on Stack Overflow

Remove Cast on int fields on directQuery to 2 billion row fact when computing averages

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related