How to aggregate on multiple columns using SQL or spark SQL

Question

I have following table:

Id col1 col2
1  a    1   
1  b    2   
1  c    3   
2  a    1   
2  e    3   
2  f    4

Expected output is:

Id col3
1  a1b2c3
2  a1e3f4

The aggregation computation involves 2 columns, is this supported in SQL?

I think aggregation can only aggregate on one column, so I need to combine 2 columns for from a new column then aggregation on that new column — Xiaoyong Guo
– Xiaoyong Guo, Commented Jul 1, 2022 at 1:11

ZygD · Accepted Answer · 2022-07-01 06:03:19Z

2

In Spark SQL you can do it like this:

SELECT Id, aggregate(list, '', (acc, x) -> concat(acc, x)) col3
FROM (SELECT Id, array_sort(collect_list(concat(col1, col2))) list
      FROM df
      GROUP BY Id )

or in one select:

SELECT Id, aggregate(array_sort(collect_list(concat(col1, col2))), '', (acc, x) -> concat(acc, x)) col3
FROM df
GROUP BY Id

Higher-order aggregate function is used in this example.

aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

answered Jul 1, 2022 at 6:03

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to aggregate on multiple columns using SQL or spark SQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related