0

I have a pyspark dataframe which looks like this:

Subscription_id Subscription parameters
5516            ["'catchupNotificationsEnabled': True","'newsNotificationsEnabled': True","'autoDownloadsEnabled': False"]

I need the output Dataframe to be as:

Subscription_id catchupNotificationsEnabled newsNotificationsEnabled    autoDownloadsEnabled
5516    True    True    False

How Can I achieve this in Pyspark? I have tried several options using UDF but couldn't succeed.

Any help is greatly appreciated.

3
  • Do you know the keys ahead of time? Commented Oct 16, 2018 at 22:45
  • @pault Yes, there are only these 3 parameters catchupNotificationsEnabled, newsNotificationsEnabled and autoDownloadsEnabled with different values of True and False for different records Commented Oct 16, 2018 at 22:55
  • Could you provide the schema of the DataFrame ? Is "Subscription parameters" of type : StructType() or ArrayType() ? (or other) Commented Oct 17, 2018 at 9:18

2 Answers 2

1

You can use something like below

>>> df.show()
+---------------+-----------------------+
|Subscription_id|Subscription_parameters|
+---------------+-----------------------+
|           5516|   ["'catchupNotific...|
+---------------+-----------------------+

>>> 
>>> df1 = df.select('Subscription_id')
>>> 
>>> data = df.select('Subscription_parameters').rdd.map(list).collect()
>>> data = [i[0][1:-1].split(',') for i in data]
>>> data = {i.split(':')[0][2:-1]:i.split(':')[1].strip()[:-1] for i in data[0]}
>>> 
>>> df2 = spark.createDataFrame(sc.parallelize([data]))
>>> 
>>> df3 = df1.crossJoin(df2)
>>> 
>>> df3.show()
+---------------+--------------------+---------------------------+------------------------+
|Subscription_id|autoDownloadsEnabled|catchupNotificationsEnabled|newsNotificationsEnabled|
+---------------+--------------------+---------------------------+------------------------+
|           5516|               False|                       True|                    True|
+---------------+--------------------+---------------------------+------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the help guys. Both the solutions work for me!
0

Let's suppose your "Subscription parameters" column is ArrayType().

from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.context import SparkContext

# Call SparkContext
sc = SparkContext.getOrCreate()
sc = sparkContext

First create the DataFrame

df = sc.createDataFrame([Row(Subscription_id=5516,
                         Subscription_parameters=["'catchupNotificationsEnabled': True",
"'newsNotificationsEnabled': True", "'autoDownloadsEnabled': False"])])

Split this array into three columns, by simple indexing :

df = df.select("Subscription_id", 
      F.col("Subscription_parameters")[0].alias("catchupNotificationsEnabled"),
      F.col("Subscription_parameters")[1].alias("newsNotificationsEnabled"),
      F.col("Subscription_parameters")[2].alias("autoDownloadsEnabled"))

Now your DataFrame is properly split, each new column contains a string such as e.g. "'catchupNotificationsEnabled': True" :

+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
|           5516|       'catchupNotificat...|    'newsNotification...|'autoDownloadsEna...|
+---------------+---------------------------+------------------------+--------------------+

Then I suggest to update column values by checking if it contains "True" or not

df = df.withColumn('catchupNotificationsEnabled',
                  F.when(F.col("catchupNotificationsEnabled").contains("True"), True).otherwise(False))\
        .withColumn('newsNotificationsEnabled',
                   F.when(F.col("newsNotificationsEnabled").contains("True"), True).otherwise(False))\
        .withColumn('autoDownloadsEnabled',
                   F.when(F.col("autoDownloadsEnabled").contains("True"), True).otherwise(False))

The resulting DataFrame is as expected

+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
|           5516|                       true|                    true|               false|
+---------------+---------------------------+------------------------+--------------------+

PS: if the column is not of ArrayType() you might have to modify this code a little bit.See this question for example

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.