8

I have 500 columns in my pyspark data frame...Some are of string type,some int and some boolean(100 boolean columns ). Now, all the boolean columns have two distinct levels - Yes and No and I want to convert those into 1/0

For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers

 c1| c2 |    c3 |c4|c5..... |c500
yes| yes|passed |45....
No | Yes|failed |452....
Yes|No  |None   |32............

when I do

df.replace(yes,1)

I get following error:

ValueError: Mixed type replacements are not supported
0

4 Answers 4

10

For string I have three values- passed, failed and null. How do I replace those nulls with 0? fillna(0) works only with integers

First, import when and lit

from pyspark.sql.functions import when, lit

Assuming your DataFrame has these columns

# Reconstructing my DataFrame based on your assumptions
# cols are Columns in the DataFrame
cols = ['name', 'age', 'col_with_string']

# Similarly the values
vals = [
     ('James', 18, 'passed'),
     ('Smith', 15, 'passed'),
     ('Albie', 32, 'failed'),
     ('Stacy', 33, None),
     ('Morgan', 11, None),
     ('Dwight', 12, None),
     ('Steve', 16, 'passed'), 
     ('Shroud', 22, 'passed'),
     ('Faze', 11,'failed'),
     ('Simple', 13, None)
]

# This will create a DataFrame using 'cols' and 'vals'
# spark is an object of SparkSession
df = spark.createDataFrame(vals, cols)

# We have the following DataFrame
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|           null|
|Morgan| 11|           null|
|Dwight| 12|           null|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|           null|
+------+---+---------------+

You can use:

  • withColumn() - To specify the column you want use.
  • isNull() - A filter that evaluates to true iff the attribute evaluates to null
  • lit() - creates a column for literals
  • when(), otherwise() - is used to check the condition with respect to the column

I can replace the values having null with 0

df = df.withColumn('col_with_string', when(df.col_with_string.isNull(), 
lit('0')).otherwise(df.col_with_string))

# We have replaced nulls with a '0'
df.show()

+------+---+---------------+
|  name|age|col_with_string|
+------+---+---------------+
| James| 18|         passed|
| Smith| 15|         passed|
| Albie| 32|         failed|
| Stacy| 33|              0|
|Morgan| 11|              0|
|Dwight| 12|              0|
| Steve| 16|         passed|
|Shroud| 22|         passed|
|  Faze| 11|         failed|
|Simple| 13|              0|
+------+---+---------------+

Part 1 of your question: Yes/No boolean values - you mentioned that, there are 100 columns of Boolean's. For this, I generally reconstruct the table with updated values or create a UDF returns 1 or 0 for Yes or No.

I am adding two more columns can_vote and can_lotto to the DataFrame (df)

df = df.withColumn("can_vote", col('Age') >= 18)
df = df.withColumn("can_lotto", col('Age') > 16) 

# Updated DataFrame will be
df.show()

+------+---+---------------+--------+---------+
|  name|age|col_with_string|can_vote|can_lotto|
+------+---+---------------+--------+---------+
| James| 18|         passed|    true|     true|
| Smith| 15|         passed|   false|    false|
| Albie| 32|         failed|    true|     true|
| Stacy| 33|              0|    true|     true|
|Morgan| 11|              0|   false|    false|
|Dwight| 12|              0|   false|    false|
| Steve| 16|         passed|   false|    false|
|Shroud| 22|         passed|    true|     true|
|  Faze| 11|         failed|   false|    false|
|Simple| 13|              0|   false|    false|
+------+---+---------------+--------+---------+

Assuming you have similar columns to can_vote and can_lotto (boolean values being Yes/No)

You can use the following line of code to fetch the columns in the DataFrame having boolean type

col_with_bool = [item[0] for item in df.dtypes if item[1].startswith('boolean')]

This returns a list

['can_vote', 'can_lotto']

You can create a UDF and iterate for each column in this type of list, lit each of the columns using 1 (Yes) or 0 (No).

For reference, refer to the following links

Sign up to request clarification or add additional context in comments.

Comments

1

I tried to replicate you issue with the below data:

df_test=pd.DataFrame([['yes','pass',1.2],['No','pass',34],['yes',None,0.4],[0,1,'No'],['No',1,True],['NO','YES',1]])

then I just use:

df_test.replace('yes',1)

2 Comments

I don't get any error but not theoutput. I am using pyspark
I think you're not seeing output since theres no collect() or show(); try above w/ either of those on the end or just do df2 = df_test.replace('yes', 1) then df2.show()
1

You should try using df.na.fill() but making the distinction between columns in the arguments of the function fill.

You would have something like :

df_test.na.fill({"value":"","c4":0}).show()

Comments

-2

You can use Koalas to do Pandas like operations in spark. However, you need to respect the schema of a give dataframe. Using Koalas you could do the following:

df = df.replace('yes','1')

Once you replaces all strings to digits you can cast the column to int. If you want to replace certain empty values with NaNs I can recommend doing the following:

df = df .replace(['?'], None) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.