I have a table in this format:
| name | fruits | apple | banana | orange |
|---|---|---|---|---|
| Alice | ["apple","banana","orange"] | 5 | 8 | 3 |
| Bob | ["apple"] | 2 | 9 | 1 |
I want to make a new column that contains a JSON package in this format, where the key is the element of the array, and the value is the resulting value of the name of the column:
| name | fruits | apple | banana | orange | new_col |
|---|---|---|---|---|---|
| Alice | ["apple","banana","orange"] | 5 | 8 | 3 | {"apple":5, "banana":8, "orange":3} |
| Bob | ["apple"] | 2 | 9 | 1 | {"apple":2} |
Any thoughts on how to proceed? I'm assuming a UDF, but I can't get the right syntax.
This is as far as I've got with the code:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import MapType, StringType
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", ["apple", "banana", "orange"], 5, 8, 3),
("Bob", ["apple"], 2, 9, 1)]
# Define the schema
schema = ["name", "fruits", "apple", "banana", "orange"]
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)
# Show the initial DataFrame
print("Initial DataFrame:")
display(df)
# Define a UDF to create a dictionary
@udf(MapType(StringType(), StringType()))
def json_map(fruits):
result = {}
for i in fruits:
result[i] = col(i)
return result
# Apply the UDF to the 'fruits' column
new_df = df.withColumn('test', json_map(col('fruits')))
# Display the updated DataFrame
display(new_df)