Using spark dataframe transpose

Question

I am trying to figure out how to solve this use case using spark dataframe.

In the below google sheet, I have the source data where the questions from the survey answered by the people will be stored. Also the question columns will be more than 1000 columns approx, and is more dynamic and not fixed.

There is a metadata table, which explains about the question, its description and the choices it can contain.

Output table should be the like the one I had mentioned in the sheet. Any suggestions or ideas on how this can be achieved ?

https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

vilalabinot · Accepted Answer · 2022-08-14 20:01:48Z

1

Let's assume your main table is called df:

+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz      |xyz        |john       |1     |2     |1     |
|abc      |abc        |foo        |3     |1     |1     |
|def      |def        |bar        |2     |2     |2     |
+---------+-----------+-----------+------+------+------+

and the mapping table is called df2:

+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text      |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102     |Gender       |What is your gender|1        |Male       |
|Q1D102     |Gender       |What is your gender|2        |Female     |
|Q1D102     |Gender       |What is your gender|3        |Diverse    |
|Q1D103     |Age          |What is your age   |1        |20 - 50    |
|Q1D103     |Age          |What is your age   |2        |50 >       |
|Q1D105     |work_status  |Do you work        |1        |Yes        |
|Q1D105     |work_status  |Do you work        |2        |No         |
+-----------+-------------+-------------------+---------+-----------+

We can construct a dynamic unpivot expression as below:

val columns = df.columns.filter(c => c.startsWith("Q1D"))

val data = columns.map(c => s"'$c', $c").mkString(",")

val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"

With 3 questions, we get the following expression (Q1D102, Q1D103 and Q1D105): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)

Finally, we use the constructed variable:

df = df
  .selectExpr("survey_id", "response_id", "person_name", finalExpr)
  .join(df2, Seq("question_id", "choice_id"), "left")

You get this result:

+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text      |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102     |1        |xyz      |xyz        |john       |Gender       |What is your gender|Male       |
|Q1D102     |2        |def      |def        |bar        |Gender       |What is your gender|Female     |
|Q1D102     |3        |abc      |abc        |foo        |Gender       |What is your gender|Diverse    |
|Q1D103     |1        |abc      |abc        |foo        |Age          |What is your age   |20 - 50    |
|Q1D103     |2        |xyz      |xyz        |john       |Age          |What is your age   |50 >       |
|Q1D103     |2        |def      |def        |bar        |Age          |What is your age   |50 >       |
|Q1D105     |1        |xyz      |xyz        |john       |work_status  |Do you work        |Yes        |
|Q1D105     |1        |abc      |abc        |foo        |work_status  |Do you work        |Yes        |
|Q1D105     |2        |def      |def        |bar        |work_status  |Do you work        |No         |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+

Which I think is what you need (just unordered), good luck!

edited Aug 14, 2022 at 20:01

answered Aug 14, 2022 at 19:25

vilalabinot

1,6216 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Hemanth Kumar Over a year ago

Hey, thanks much for your answer. I am sorry as i forgot to mention this. In the source table, the question columns will be more than 1000 columns approx, and is more dynamic and not fixed(varies from survey to survey). Hence i was thinking to use transpose, but not sure if that is the right approach. I will update this information in my question above. thanks.!!

vilalabinot Over a year ago

Do you mean there are 1000+ columns as Q1D102, Q1D103, Q1D105, etc. ?

Hemanth Kumar Over a year ago

yes exactly. it is more than 1000 columns like Q1D102, Q1D103, Q1D105, Q1D106, Q1D107 etc.

vilalabinot Over a year ago

Edited the post, good luck!

Collectives™ on Stack Overflow

Using spark dataframe transpose

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related