0

I am trying to figure out how to solve this use case using spark dataframe.

In the below google sheet, I have the source data where the questions from the survey answered by the people will be stored. Also the question columns will be more than 1000 columns approx, and is more dynamic and not fixed.

There is a metadata table, which explains about the question, its description and the choices it can contain.

Output table should be the like the one I had mentioned in the sheet. Any suggestions or ideas on how this can be achieved ?

https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

1 Answer 1

1

Let's assume your main table is called df:

+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz      |xyz        |john       |1     |2     |1     |
|abc      |abc        |foo        |3     |1     |1     |
|def      |def        |bar        |2     |2     |2     |
+---------+-----------+-----------+------+------+------+

and the mapping table is called df2:

+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text      |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102     |Gender       |What is your gender|1        |Male       |
|Q1D102     |Gender       |What is your gender|2        |Female     |
|Q1D102     |Gender       |What is your gender|3        |Diverse    |
|Q1D103     |Age          |What is your age   |1        |20 - 50    |
|Q1D103     |Age          |What is your age   |2        |50 >       |
|Q1D105     |work_status  |Do you work        |1        |Yes        |
|Q1D105     |work_status  |Do you work        |2        |No         |
+-----------+-------------+-------------------+---------+-----------+

We can construct a dynamic unpivot expression as below:

val columns = df.columns.filter(c => c.startsWith("Q1D"))

val data = columns.map(c => s"'$c', $c").mkString(",")

val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"

With 3 questions, we get the following expression (Q1D102, Q1D103 and Q1D105): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)

Finally, we use the constructed variable:

df = df
  .selectExpr("survey_id", "response_id", "person_name", finalExpr)
  .join(df2, Seq("question_id", "choice_id"), "left")

You get this result:

+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text      |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102     |1        |xyz      |xyz        |john       |Gender       |What is your gender|Male       |
|Q1D102     |2        |def      |def        |bar        |Gender       |What is your gender|Female     |
|Q1D102     |3        |abc      |abc        |foo        |Gender       |What is your gender|Diverse    |
|Q1D103     |1        |abc      |abc        |foo        |Age          |What is your age   |20 - 50    |
|Q1D103     |2        |xyz      |xyz        |john       |Age          |What is your age   |50 >       |
|Q1D103     |2        |def      |def        |bar        |Age          |What is your age   |50 >       |
|Q1D105     |1        |xyz      |xyz        |john       |work_status  |Do you work        |Yes        |
|Q1D105     |1        |abc      |abc        |foo        |work_status  |Do you work        |Yes        |
|Q1D105     |2        |def      |def        |bar        |work_status  |Do you work        |No         |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+

Which I think is what you need (just unordered), good luck!

Sign up to request clarification or add additional context in comments.

4 Comments

Hey, thanks much for your answer. I am sorry as i forgot to mention this. In the source table, the question columns will be more than 1000 columns approx, and is more dynamic and not fixed(varies from survey to survey). Hence i was thinking to use transpose, but not sure if that is the right approach. I will update this information in my question above. thanks.!!
Do you mean there are 1000+ columns as Q1D102, Q1D103, Q1D105, etc. ?
yes exactly. it is more than 1000 columns like Q1D102, Q1D103, Q1D105, Q1D106, Q1D107 etc.
Edited the post, good luck!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.