0

I have the below pyspark dataframe

a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))



dcode   zone
0   480s  North
1   480s   West
2   499s   East
3   499s  North
4   650s   East
5   650s  North
6   702s  North
7   702s   West
8   736s  North
9   736s  South
10  736s   West
11  737s  North
12  737s   West

I want my data frame looks like -

 dcode   zone output
0   480s  North     NW
1   480s   West     NW
2   499s   East       
3   499s  North     NW
4   650s   East       
5   650s  North     NW
6   702s  North       
7   702s   West       
8   736s  North       
9   736s  South       
10  736s   West       
11  737s  North       
12  737s   West  

For the same I am using this logic, but it's not giving the desired results.

  df_ = df.withColumn("output", F.when((F.col("Zone") == "North") | (F.col("Zone") == "West") & (F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"), "NW"))

I want NW in output column only when zone is north or west and decode is not in 736,737s,702s.

3 Answers 3

2

Please consider firstly converting your pandas df to a spark one, since you are using pypark syntax. Then I would advise rewriting your code into a more concise and clearer way, using isin:

from pyspark.sql import functions as F
df = spark.createDataFrame(df)

df_ = df.withColumn("output", F.when(
  (F.col("Zone").isin("North","West")) & (~F.col("dcode").isin('736s','737s','702s')
                                         ),"NW").otherwise(""))

>>> df_.show(truncate=False)

+-----+-----+------+
|dcode|zone |output|
+-----+-----+------+
|480s |North|NW    |
|480s |West |NW    |
|499s |East |      |
|499s |North|NW    |
|650s |East |      |
|650s |North|NW    |
|702s |North|      |
|702s |West |      |
|736s |North|      |
|736s |South|      |
|736s |West |      |
|737s |North|      |
|737s |West |      |
+-----+-----+------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can directly use SQL style expressions (expr function).

import pyspark.sql.functions as F
......
df = df.withColumn('output', F.expr("case when zone in ('North', 'West') and dcode not in ('736s', '737s', '702s') then 'NW' end"))
......

Comments

1

Just check your parenthesis

By the way, df = pd.DataFrame(dict(dcode=a, zone=b)) is not PySpark

from pyspark.sql import functions as F
import pandas as pd


a = ['480s','480s','499s','499s','650s','650s','702s','702s','736s','736s','736s','737s','737s']
b = ['North','West','East','North','East','North','North','West','North','South','West','North','West']


df = pd.DataFrame(dict(dcode=a, zone=b))

df_ = spark.createDataFrame(df)

df_ = df_.withColumn("output", F.when((\
                                       ((F.col("Zone") == "North") | (F.col("Zone") == "West")) & ((F.col("dcode") != "702s") | (F.col("dcode") != "736s") | (F.col("dcode") != "737s"))\
                                      ), "NW"))

df_.show()

+-----+-----+------+
|dcode| zone|output|
+-----+-----+------+
| 480s|North|    NW|
| 480s| West|    NW|
| 499s| East|  null|
| 499s|North|    NW|
| 650s| East|  null|
| 650s|North|    NW|
| 702s|North|    NW|
| 702s| West|    NW|
| 736s|North|    NW|
| 736s|South|  null|
| 736s| West|    NW|
| 737s|North|    NW|
| 737s| West|    NW|
+-----+-----+------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.