0

I have the following data frame which has the following structure and values

root
|-- Row_number: integer (nullable = false)
 |-- Factor: string (nullable = false)
 |-- Country: string (nullable = false)
 |-- Date: date (nullable = false)
 |-- Amount: integer (nullable = false)

+----------+-----------------------+
|Row_number|              Factor  |
+----------+-----------------------+
|         1|[EN2_1, EN2_2, EN3_3] |
|         2|[EN2_1, EN2_2, EN3_3] |
|         3|[EN2_1, EN2_2, EN3_3] |
+----------+------------------------+

I want to convert into the following data frame

1, EN2_1

1, EN2_2

1, EN2_3

2, EN2_1

2, EN2_2

2, EN2_3

3, EN2_1

3, EN2_2

3, EN2_3

I tried to read the column-like ArrayType but it gives error

2 Answers 2

2

A combination of split and explode should work

from pyspark.sql.functions import F

df.withColumn("New_Factor", F.explode(F.split(F.regexp_replace(F.col("Factor"), "(^\[)|(\]$)", ""), ", ")))
Sign up to request clarification or add additional context in comments.

Comments

1

You can first remove the square brackets using trim both, split using ', ', and explode the resulting array into rows:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'Factor',
    F.explode(F.split(F.expr("trim(both '[]' from Factor)"), ', '))
)

df2.show()
+----------+------+
|Row_number|Factor|
+----------+------+
|         1| EN2_1|
|         1| EN2_2|
|         1| EN3_3|
|         2| EN2_1|
|         2| EN2_2|
|         2| EN3_3|
|         3| EN2_1|
|         3| EN2_2|
|         3| EN3_3|
+----------+------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.