1

I will change number in column date with value in monthList array.

monthList array

monthList = ["None","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]

pyspark code

d = df.select(col('InvoiceDate'),col('TotalSales')/1000000).groupBy(month('InvoiceDate')).sum()
d = d.select(col('month(InvoiceDate)').alias('date'),col('sum((TotalSales / 1000000))').alias('value')).orderBy('date')
d = d.select(col('date'),round(col('value'),2).alias('value'))
d.show()

result

+----+-----+
|date|value|
+----+-----+
|   1|19.75|
|   2|15.51|
|   3|20.66|
+----+-----+

I will try this but it not working. It's error 'DataFrame' object has no attribute 'apply'

d.date = d.select('date').apply(lambda x: monthList[x]) 

Thank you for your helping.

2 Answers 2

2

You can try this alternative-

df1.selectExpr(
      s"element_at(array('None','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'), date) as date",
      "value"
    ).show(false)

    /**
      * +----+-----+
      * |date|value|
      * +----+-----+
      * |None|19.75|
      * |Jan |15.51|
      * |Feb |20.66|
      * +----+-----+
      */
Sign up to request clarification or add additional context in comments.

Comments

2

One approach can be creating dataframe(date_lookup) upfront with the monthlist. This df can be broadcasted for performance. Then you can perform left join with actual df.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

 spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()

 sc= spark.sparkContext

monthList = ["None","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]

lookup_list = []

for i in range(len(monthList)):
   lookup_list.append((i,monthList[i]))


date_lookup = sc.parallelize(lookup_list).toDF(["date_num", "date_label"])

date_lookup.show()

 +--------+----------+
 |date_num|date_label|
 +--------+----------+
 |       0|      None|
 |       1|       Jan|
 |       2|       Feb|
 |       3|       Mar|
 |       4|       Apr|
 |       5|       May|
 |       6|       Jun|
 |       7|       Jul|
 |       8|       Aug|
 |       9|       Sep|
 |      10|       Oct|
 |      11|       Nov|
 |      12|       Dec|
 +--------+----------+

 df= sc.parallelize([
 (1,19.75), (2,15.51)]).toDF(["date", "value"])

 +----+-----+
 |date|value|
 +----+-----+
 |   1|19.75|
 |   2|15.51|
 +----+-----+

 df1  = df.join(F.broadcast(date_lookup),df.date==date_lookup.date_num, how='left').select('date_label','value')

 df1.show()

 +----------+-----+
 |date_label|value|
 +----------+-----+
 |       Jan|19.75|
 |       Feb|15.51|
 +----------+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.