1

Using Power BI directQuery to databricks SQL warehouse, I see the queries getting generated to compute an average have

SELECT 
    SUM(CAST(int_field AS DOUBLE)), COUNT(int_field) 
FROM
    fact

Since my fact table is so large, is there a way to eliminate the CAST(int_field AS Double) ? Or to at least put it outside the SUM?

Would CAST(SUM(int_field) AS DOUBLE) perform better?

New contributor
Nathan Jones is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

1 Answer 1

1

It would most likely not be faster and it could even be incorrect.

(1) likely not correct:

If you sum a billion integers as integers, you will likely run into the problem of integer overflow. Your result will be nonsense if that happens. Might also be that some part of your stack notices this happening and throws an error instead. Failure either way. Casting the input numbers to Double avoids this problem, because double-precision floating-point-numbers have enormous range.

(2) likely not faster:

Latency Numbers every programmer should know has a random read on an SSD as a hundred thousand times slower than a typical instruction. So your problem is totally dominated by disk, the tiny amount of compute that your approach may save is irrelevant.

The question of latency vs throughput came up in the comments, and throughput is the more important number here. The difference is smaller here, but still, the CPU outpaces the Disk by a lot. A really fast SSD may read at 7 GB/s. Under ideal conditions, if your table contains only the numbers you want to average and no book-keeping information, other columns, ignored rows, free space, that would mean your 2 billion int32 will be read in a bit more than a second, but a real database will be much, much slower. On the CPU side, 2 billion instructions may run in one second on an ARM Cortex-A8 from 2005.

Sign up to request clarification or add additional context in comments.

4 Comments

You're citing latency numbers in a throughput scenario
That is correct, my reasoning is flawed. I will try to find better numbers, but my intuition is that the conclusion will not change.
I've put in some throughput numbers. I still think disk is the slowest part. My intuition that CPU is not the bottleneck here stems from analyzing arithmetic intensity in high-performance computing. The latency numbers were always given as explanation of the importance of arithmetic intensity. Since they are an explanation of the 'obvious', I never gave them much scrutiny.
I have also reordered my answer, concerns about correctness come first now. Because if I believed that the cast-once-solution was correct you could argue it is worth trying out and benchmarking, theoretical predictions of performance are often wrong. But if it is incorrect it does not even matter if it is fast.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.