how to remove duplicate rows when using array_agg in bigquery

Question

I have the following data:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
)

I am using the following query

select
   FIRST_VALUE(id) over w1 as id
 , ARRAY_AGG(value) over w1  as data
 , FIRST_VALUE(other_value) over w1 as first_other_data
 , LAST_VALUE(other_value) over w1  as last_other_data
from data
WINDOW w1 as (PARTITION BY id order by value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

And I get

id  data  first_other_data  last_other_data 
1   18    A                 D 
    20
    42
1   18    A                 D
    20
    42
1   18    A                 D
    20
    42
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
3   30    A                 A

But i am getting duplicates that I don't want. I was thinking to use distinct keyword, but bigquery do not like it. My expected result is :

id  data  first_other_data  last_other_data     
1   18     A                 D
    20
    42
2   22     C                 B
    31
    37
3   30     A                 A

I have found similar questions but not exactly this case. Thanks EDIT: In my attempt to simplify the scenario for this SO question I took out some essential components. I have modified this with a more accurate version of my problem.

Martin Weitzmann · Accepted Answer · 2020-02-11 15:53:32Z

3

I think you still want an aggregation with grouping :) Since value and other_value seem to be related (you order by value in the window to select other_value) I'd simply put them into the array, too:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
)

select 
  id,
  array_agg(struct(value, other_value)) as data
from data
group by id

So, if you need it later you can write a subquery to get it - or you can add another step to do in one query with the aggregation in between:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
), 
temp as (
  select 
    id,
    array_agg(struct(value, other_value)) as data
  from data
  group by id
)
select 
  id,
  array(select value from unnest(data)) as data,
  (select other_value from unnest(data) order by value ASC limit 1) first_other_data,
  (select other_value from unnest(data) order by value DESC limit 1) last_other_data
from temp

edited Feb 11, 2020 at 15:53

answered Feb 11, 2020 at 12:42

Martin Weitzmann

4,75612 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DarioB Over a year ago

Hi Martin, my bad, In my attempt to simplify the query for this SO question I took out some essential components. I have modified it with a more accurate representation of my problem. but given the scenario presented you're right, your query would solve the previous problem.

DarioB Over a year ago

Yes it works, thank you for your reply. I have upvoted your answer and comments, however I will approve Mikhail's answer as it looks more neat. Thank you.

Martin Weitzmann Over a year ago

it does?! I wouldn't expect it to scale well

Mikhail Berlyant · Accepted Answer · 2020-02-11 17:34:12Z

2

The easiest way to achieve your goal without changing your initial query is to wrap it with extra select as in below example

#standardSQL
SELECT ANY_VALUE(t).* FROM (
  SELECT
     FIRST_VALUE(id) OVER w1 AS id
   , ARRAY_AGG(value) OVER w1  AS data
   , FIRST_VALUE(other_value) OVER w1 AS first_other_data
   , LAST_VALUE(other_value) OVER w1  AS last_other_data
  FROM data
  WINDOW w1 AS (PARTITION BY id ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) t
GROUP BY FORMAT('%t', t)

with result

answered Feb 11, 2020 at 17:34

Mikhail Berlyant

174k10 gold badges172 silver badges250 bronze badges

2 Comments

DarioB Over a year ago

interesting, thank you. What does FORMAT in the group by exactly does?

Mikhail Berlyant Over a year ago

FORMAT('%t', t) generates kind of fingerprint of row so you can then use it to deduplicate your result. Please consider voting up the answer if it was helpful :o)

Collectives™ on Stack Overflow

how to remove duplicate rows when using array_agg in bigquery

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related