Distinct cumulative count in PostgreSQL window function

Question

I've got a table with many columns some of these are: product_id, territory_id, quarter_num (it's a number of a quarter from 1 to 28 for instance). There are some other columns but they aren't necessary in this query. I need to count number of distinct products in every territory in every cumulative quarter: only 1 first, 1+2 second, 1+2+3 third and so on until from 1 till 28. Before this query was realized in QlikSence with a loop. Now I need to rewrite it in PostgreSQL in one query (even in one CTE part of a long query) using standard SQL with no loops etc.

It would be simply something like this:

select *
    ,count(distinct product_id) 
        filter(where some_condition) 
        over(partition by territory_id order by quarter_num) 
    as cum_filtered_product_count
from some_table

If I had no distinct which not realized in window functions. I've broken my head, read and tried to use many advices here but still have found no correct solution. Any help will be appreciated.

PS The solution with two subquieris where the first one counts distinctly in a single quarter in a group and the second one sums the results of the first one in a window function cumulatively doesn't work. Because the latter subquery potentially sums the same products.

A little update on this issue. I've tried both methods, both successfully (jsonb is faster). But unfotrunately these methods doesn't work on my production server. I can't install any extensions there (intarrray as well) and it's version doesn't support jasonb functions, seems like it's prior to 12 version. And I can't do anything with it at the moment unfortunately. Maybe there is another third way to force window functions to use distinct without intarray and jasonb? — ginfonic
– ginfonic, Commented Sep 10, 2024 at 6:38
I guess you could always select from a subquery that does the DISTINCT filtering? — Bergi
– Bergi, Commented Sep 10, 2024 at 6:42
Before v12 you can still use the recursive CTE or you can run an array_agg() as arr as a window function in a subquery/CTE, then (select count(distinct e) from unnest(arr)e) on that. — Zegarek
– Zegarek, Commented Sep 10, 2024 at 7:19
Type of this? select count(distinct arv) from (select unnest(art.arr) from (select array_agg(val) over () as arr from test.array) art) arv; It works ok but I'm confused that if I miss distinct it returns the numbers of rows squared. Do I do anything wrong? — ginfonic
– ginfonic, Commented Sep 10, 2024 at 14:31

Zegarek · Accepted Answer · 2024-06-19 10:51:40Z

0

You can use intarray to emulate the missing count(distinct x)over() using sets: _db<>fiddle

select distinct quarter_num
  ,territory_id
  ,#uniq(sort(array_agg(product_id) 
              filter(where product_id<>13) 
              over(partition by territory_id 
                   order by quarter_num)))
    as cum_filtered_product_count
from some_table
order by 1,2;

By turning the aggregated array into a set, you only keep distinct elements, and # tells you how many you got. You can also use -'{}' as a trick to quietly turn the array into a set, but while shorter, the operation still involves the uniq(sort()), plus the empty subtraction.

If product_id isn't an int and you don't want to map it, you can lean on the fact that jsonb keys are also sets, so aggregating into a jsonb object will by nature only keep unique keys:

select distinct quarter_num
  ,territory_id
  ,jsonb_array_length(jsonb_path_query_array(cum_filtered_product,'$.*'))
   as cum_filtered_product_count
from (
select *,jsonb_object_agg(product_id,0) 
           filter(where product_id<>13) 
           over(partition by territory_id order by quarter_num) 
           as cum_filtered_product
from some_table)_
order by 1,2;

The two jsonb.. functions just extract and count the keys.

You can take a closer look at what this does in the demo. Some_condition is product_id<>13. Note how it discards duplicates from both the same as well as earlier quarters for a given territory:

t	q	products	cumulative	filtered	distinct	count_distinct
1	1	{1,1,13}	{1,1,13}	{1,1}	{1}	1
1	2	{3,5,7}	{1,1,3,5,7,13}	{1,1,3,5,7}	{1,3,5,7}	4
1	3	{8,8,11}	{1,1,3,5,7,8,8,11,13}	{1,1,3,5,7,8,8,11}	{1,3,5,7,8,11}	6
2	1	{1,6,11}	{1,6,11}	{1,6,11}	{1,6,11}	3
2	2	{4,6,7}	{1,4,6,6,7,11}	{1,4,6,6,7,11}	{1,4,6,7,11}	5
2	3	{12,13,14}	{1,4,6,6,7,11,12,13,14}	{1,4,6,6,7,11,12,14}	{1,4,6,7,11,12,14}	7
3	1	{3,11,13}	{3,11,13}	{3,11}	{3,11}	2
3	2	{0,3,12}	{0,3,3,11,12,13}	{0,3,3,11,12}	{0,3,11,12}	4
3	3	{0,9,13}	{0,0,3,3,9,11,12,13,13}	{0,0,3,3,9,11,12}	{0,3,9,11,12}	5

edited Jun 19, 2024 at 10:51

answered Jun 19, 2024 at 9:53

Zegarek

29.9k5 gold badges27 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Bergi Over a year ago

What's the # in #uniq(…)?

Zegarek Over a year ago

That's intarray's way of doing array_length(x,1). I accidentally posted too early, so the links and the explanation were missing, but that's added now.

Bergi Over a year ago

Oh it's two functions, # of uniq(…)! I thought it was misspelled. I've used intarray myself but never came across the unary # operator

ginfonic Over a year ago

Many thanks. It's genious! I've tried intarray but ended up realizing it doesn't work with text. Getting to the point of using IDs instead I could't by myself. Shame on me! Thanks again

Collectives™ on Stack Overflow

Distinct cumulative count in PostgreSQL window function

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related