6

I have a file(csv) which when read in spark dataframe has the below values for print schema

-- list_values: string (nullable = true)

the values in the column list_values are something like:

[[[167, 109, 80, ...]]]

Is it possible to convert this to array type instead of string?

I tried splitting it and using code available online for similar problems:

df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))

but if I run the above code the array which I get skips a lot of values in the original array i.e.

output of the above code is:

[, 109, 80, 69, 5...

which is different from original array i.e. (-- 167 is missing)

[[[167, 109, 80, ...]]] 

Since I am new to spark I don't have much knowledge how it is done (For python I could have done ast.literal_eval but spark has no provision for this.

So I'll repeat the question again :

How can I convert/cast an array stored as string to array i.e.

'[]' to [] conversion

1 Answer 1

13

Suppose your DataFrame was the following:

df.show()
#+----+------------------+
#|col1|              col2|
#+----+------------------+
#|   a|[[[167, 109, 80]]]|
#+----+------------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)

You could use pyspark.sql.functions.regexp_replace to remove the leading and trailing square brackets. Once that's done, you can split the resulting string on ", ":

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn(
    "col3",
    split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()

#+----+------------------+--------------+
#|col1|              col2|          col3|
#+----+------------------+--------------+
#|   a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+

df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: string (containsNull = true)

If you wanted the column as an array of integers, you could use cast:

from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: integer (containsNull = true)
Sign up to request clarification or add additional context in comments.

6 Comments

The method (regex replacement works like a charm) and then casting to array of integers works for me. Upvoted the answer for correctly solving the issue. Thanks
@kunal there's only one method here. The casting is an optional 2nd step if you wanted to transform the resultant split array from an array of strings into an array of ints. You could also combine them into one step: split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ").cast("array<int>"). Again, it's up to you depending on what you want to do with the data.
Hey pault .. I am assuming the regex would take care of the replacement and the second step would replace the newly created column to array of integers ?
@kunal yes. If you notice, I overwrote col3 in the last step. It's doing the cast on the result of the regex+split.
I guess trim(BOTH '[]' col2) will be more efficient than regexp_replace
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.