1

I would like to aggregate some columns into an array or json object on redash. The data table is on presto database. I need to query it from pyspark hive. The data table is large and I need to keep its size as small as possible so that I can save the dataframe to s3 faster and then read it as parquet from s3 efficiently.

I am not sure what the best data structure should be for this? (json object? array of array ?)

The original table (> 10^9 rows, some columns (e.g. obj_desc) may have more than 30 English words):

id.    cat_name.   cat_desc.       obj_name.    obj_desc.   obj_num
1.     furniture    living office   desk         4 corners    1.5.      
1      furniture.   living office.  chair.       4 legs.      0.8
1.     furniture.   restroom.       tub.         white wide.  2.7
1.     cloth.       fashion.        T-shirt.     black large. 1.1

I want (this may not be the best data structure):

id.     cat_item_aggregation
 1.     [['furniture', ['living office', ['desk', '4 corners', '1.5'], ['chair', '4 legs', '0.8']], ['restroom', [['tub', 'white wide', '2.7']], ['cloth', ['fashion', ['T-shirt', 'black', '1.1']]]]

I have tried array_agg from PostgreSQL: Efficiently aggregate array columns as part of a group by Postgres - aggregate two columns into one item

also json_build_object from Return as array of JSON objects in SQL (Postgres) How to group multiple columns into a single array or similar?

but they do not work in redash.

Could anybody let me know what the best data structure should be for this kind of table ? and how to build it ?

json may be better than array of array because it is hard to decompose the elements from array of array ?

thanks

2
  • What do you mean by they do not work? Do you get an error? Id so the what is the exact and complete error message? Do they just not produce what you expect? If so then exactly what do they produce and exactly what do you expect? Further having 30 (or more) English words means nothing do you actually mean populated columns? Would be best if you posted the actual table definition (ddl). Commented Jan 22, 2022 at 19:17
  • "the best data structure", if any, highly depends on the way you want to query this aggregate data. Can you give examples of typical queries/criteria to query this data ? Also, did you investigate the VIEW as a solution for your need ? Commented Jan 22, 2022 at 20:38

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.