Handle string to array conversion in pyspark dataframe

Question

I have a file(csv) which when read in spark dataframe has the below values for print schema

-- list_values: string (nullable = true)

the values in the column list_values are something like:

[[[167, 109, 80, ...]]]

Is it possible to convert this to array type instead of string?

I tried splitting it and using code available online for similar problems:

df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))

but if I run the above code the array which I get skips a lot of values in the original array i.e.

output of the above code is:

[, 109, 80, 69, 5...

which is different from original array i.e. (-- 167 is missing)

[[[167, 109, 80, ...]]]

Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.

So I'll repeat the question again :

How can I convert/cast an array stored as string to array i.e.

'[]' to [] conversion

pault · Accepted Answer · 2018-10-25 17:33:18Z

13

Suppose your DataFrame was the following:

df.show()
#+----+------------------+
#|col1|              col2|
#+----+------------------+
#|   a|[[[167, 109, 80]]]|
#+----+------------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)

You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn(
    "col3",
    split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()

#+----+------------------+--------------+
#|col1|              col2|          col3|
#+----+------------------+--------------+
#|   a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+

df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: string (containsNull = true)

If you wanted the column as an array of integers, you could use cast:

from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: integer (containsNull = true)

answered Oct 25, 2018 at 17:33

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

kunal Over a year ago

The method (regex replacement works like a charm) and then casting to array of integers works for me. Upvoted the answer for correctly solving the issue. Thanks

pault Over a year ago

@kunal there's only one method here. The casting is an optional 2nd step if you wanted to transform the resultant split array from an array of strings into an array of ints. You could also combine them into one step: split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ").cast("array<int>"). Again, it's up to you depending on what you want to do with the data.

kunal Over a year ago

Hey pault .. I am assuming the regex would take care of the replacement and the second step would replace the newly created column to array of integers ?

pault Over a year ago

@kunal yes. If you notice, I overwrote col3 in the last step. It's doing the cast on the result of the regex+split.

MichaelChirico Over a year ago

I guess trim(BOTH '[]' col2) will be more efficient than regexp_replace

|

Collectives™ on Stack Overflow

Handle string to array conversion in pyspark dataframe

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related