5

I am new to Spark and need a help with transposing the below input dataframe into the desired output df (Rows to Columns) using PySpark or Spark Sql.

Input Dataframe-

A    B    C   D

1    2    3   4

10   11   12  13
......
........

Required Output (transposed) data

A  1

B  2

C  3

D  4

A  11

B  12

C  13

D  14

....
......

It is better if I can pivot the input data (columns) as per our requirement.

3 Answers 3

4

Use stack function in spark.

Example:

df.show()
#+---+---+---+---+
#|  A|  B|  C|  D|
#+---+---+---+---+
#|  1|  2|  3|  4|
#| 10| 11| 12| 13|
#+---+---+---+---+

from pyspark.sql.functions import *

df.selectExpr("stack(4,'A',A,'B',B,'C',C,'D',D) as (key,value)").show()
#+---+-----+
#|key|value|
#+---+-----+
#|  A|    1|
#|  B|    2|
#|  C|    3|
#|  D|    4|
#|  A|   10|
#|  B|   11|
#|  C|   12|
#|  D|   13|
#+---+-----+
Sign up to request clarification or add additional context in comments.

Comments

4

You can make a generalized function like below (inspired from my previous answer here):

def stack_multiple(data,cols=None,output_columns=["col","values"]):
    """stacks multiple columns in a dataframe, 
       takes all columns by default unless passed a list of values"""

    cols = data.columns if cols is None else cols
    cols= [cols] if isinstance(cols,str) else cols

    return data.selectExpr(f"""stack({len(cols)},{','.join(map(','.join,
               (zip([f'"{i}"' for i in cols],[f"`{i}`" for i in cols]))))}) 
                as ({','.join(output_columns)})""")

Sample Runs:

stack_multiple(df).show()

+---+------+
|col|values|
+---+------+
|  A|     1|
|  B|     2|
|  C|     3|
|  D|     4|
|  A|    10|
|  B|    11|
|  C|    12|
|  D|    13|
+---+------+

stack_multiple(df,['A','B'],output_columns=['A','B']).show()

+---+---+
|  A|  B|
+---+---+
|  A|  1|
|  B|  2|
|  A| 10|
|  B| 11|
+---+---+

Comments

3
df = spark.createDataFrame([[1,2,3,4],[10,11,12,13]]).toDF('A','B','C','D')
+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
|  1|  2|  3|  4|
| 10| 11| 12| 13|
+---+---+---+---+


df.select(
    F.explode(
        F.create_map(
            *reduce(lambda x, y: x+y,
                [[F.lit(col), col] for col in df.columns]
            )
        )
    )
)
+---+-----+
|key|value|
+---+-----+
|  A|    1|
|  B|    2|
|  C|    3|
|  D|    4|
|  A|   10|
|  B|   11|
|  C|   12|
|  D|   13|
+---+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.