How to delete columns in pyspark dataframe

Question

>>> a
DataFrame[id: bigint, julian_date: string, user_id: bigint]
>>> b
DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]
>>> a.join(b, a.id==b.id, 'outer')
DataFrame[id: bigint, julian_date: string, user_id: bigint, id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint]

There are two id: bigint and I want to delete one. How can I do?

qwr · Accepted Answer · 2021-10-31 03:55:07Z

207

Reading the Spark documentation I found an easier solution.

Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe.

You can use it in two ways

df.drop('age')
df.drop(df.age)

Pyspark Documentation - Drop

edited Oct 31, 2021 at 3:55

qwr

11.5k6 gold badges75 silver badges121 bronze badges

answered Sep 18, 2015 at 13:32

Patrick C.

2,4211 gold badge13 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mnis.p Over a year ago

when the data size is large, collect() might cause heap space error. you can also create a new dataframe dropping the extra field by ndf = df.drop('age')

qwr Over a year ago

There is absolutely no reason to use collect for this operation so I removed it from this answer

Clock Slave · Accepted Answer · 2019-03-01 17:36:28Z

171

Adding to @Patrick's answer, you can use the following to drop multiple columns

columns_to_drop = ['id', 'id_copy']
df = df.drop(*columns_to_drop)

edited Mar 1, 2019 at 17:36

answered May 23, 2018 at 15:56

Clock Slave

8,01516 gold badges77 silver badges123 bronze badges

5 Comments

avgbody Over a year ago

I had to reassign the drop results back to the dataframe: df = df.drop(*columns_to_drop)

Guido Over a year ago

Note that you will not get an error if the column does not exist

frlzjosh Over a year ago

I get an error saying TreeNodeException: Binding attribute, tree: _gen_alias_34#34 after I drop a column, and use .show()

Juan-Kabbali Over a year ago

What the asterisk * means in *columns_to_drop?

Clock Slave Over a year ago

The * is to unpack the list. (*[a,b,c]) becomes (a,b,c)

Haroldo Gondim · Accepted Answer · 2016-03-11 00:47:03Z

36

An easy way to do this is to user "select" and realize you can get a list of all columns for the dataframe, df, with df.columns

drop_list = ['a column', 'another column', ...]

df.select([column for column in df.columns if column not in drop_list])

edited Mar 11, 2016 at 0:47

Haroldo Gondim

8,04111 gold badges46 silver badges64 bronze badges

answered Mar 10, 2016 at 23:26

ev.per.baryon

3613 silver badges2 bronze badges

1 Comment

Shane Halloran Over a year ago

Thank-you, this works great for me for removing duplicate columns with the same name as another column, where I use df.select([df.columns[column_num] for column_num in range(len(df.columns)) if column_num!=2]), where the column I want to remove has index 2.

Aron Asztalos · Accepted Answer · 2018-08-27 09:35:49Z

25

You can use two way:

1: You just keep the necessary columns:

drop_column_list = ["drop_column"]
df = df.select([column for column in df.columns if column not in drop_column_list])

2: This is the more elegant way.

df = df.drop("col_name")

You should avoid the collect() version, because it will send to the master the complete dataset, it will take a big computing effort!

answered Aug 27, 2018 at 9:35

Aron Asztalos

8549 silver badges7 bronze badges

Comments

karlson · Accepted Answer · 2015-04-14 07:26:39Z

14

You could either explicitly name the columns you want to keep, like so:

keep = [a.id, a.julian_date, a.user_id, b.quan_created_money, b.quan_created_cnt]

Or in a more general approach you'd include all columns except for a specific one via a list comprehension. For example like this (excluding the id column from b):

keep = [a[c] for c in a.columns] + [b[c] for c in b.columns if c != 'id']

Finally you make a selection on your join result:

d = a.join(b, a.id==b.id, 'outer').select(*keep)

answered Apr 14, 2015 at 7:26

karlson

5,4633 gold badges33 silver badges63 bronze badges

4 Comments

deusxmach1na Over a year ago

I think I got the answer. Select needs to take a list of strings NOT a list of columns. So do this: keep = [c for c in a.columns] + [c for c in b.columns if c != 'id'] d = a.join(b, a.id==b.id, 'outer').select(*keep)

karlson Over a year ago

Well, that should do exactly the same thing as my answer, as I'm pretty sure that select accepts either strings OR columns (spark.apache.org/docs/latest/api/python/…). Btw, in your line keep = ... there's no need to use a list comprehension for a: a.columns + [c for c in b.columns if c != 'id'] should achieve the exact same thing, as a.columns is already a list of strings.

karlson Over a year ago

@deusxmach1na Actually the column selection based on strings cannot work for the OP, because that would not solve the ambiguity of the id column. In that case you have to use the Column instances in select.

deusxmach1na Over a year ago

All good points. I tried your solution in Spark 1.3 and got errors, so what I posted actually worked for me. And to resolve the id ambiguity I renamed my id column before the join then dropped it after the join using the keep list. HTH anyone else that was stuck like I was.

Yuri Brovman · Accepted Answer · 2015-07-07 19:50:36Z

4

Maybe a little bit off topic, but here is the solution using Scala. Make an Array of column names from your oldDataFrame and delete the columns that you want to drop ("colExclude"). Then pass the Array[Column] to select and unpack it.

val columnsToKeep: Array[Column] = oldDataFrame.columns.diff(Array("colExclude"))
                                               .map(x => oldDataFrame.col(x))
val newDataFrame: DataFrame = oldDataFrame.select(columnsToKeep: _*)

answered Jul 7, 2015 at 19:50

Yuri Brovman

1,1232 gold badges12 silver badges17 bronze badges

Comments

kyramichel · Accepted Answer · 2021-03-13 06:33:29Z

2

Yes, it is possible to drop/select columns by slicing like this:

slice = data.columns[a:b]

data.select(slice).show()

Example:

newDF = spark.createDataFrame([
                           (1, "a", "4", 0), 
                            (2, "b", "10", 3), 
                            (7, "b", "4", 1), 
                            (7, "d", "4", 9)],
                            ("id", "x1", "x2", "y"))


slice = newDF.columns[1:3]
newDF.select(slice).show()

Use select method to get features column:

features = newDF.columns[:-1]
newDF.select(features).show()

Use drop method to get last column:

last_col= newDF.drop(*features)
last_col.show()

answered Mar 13, 2021 at 6:33

kyramichel

4915 silver badges4 bronze badges

Comments

techgeek · Accepted Answer · 2019-06-06 08:14:34Z

0

You can delete column like this:

df.drop("column Name).columns

In your case :

df.drop("id").columns

If you want to drop more than one column you can do:

dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")

answered Jun 6, 2019 at 8:14

techgeek

433 bronze badges

3 Comments

seufagner Over a year ago

Spark 2.4 (and least versions) doesn't accepts more than one column name.

DataBach Over a year ago

Is it possible to drop columns by index ?

Topde Over a year ago

@seufagner it does just pass it as a list

New Contributer · Accepted Answer · 2019-04-13 21:21:39Z

Consider 2 dataFrames:

>>> aDF.show()
+---+----+
| id|datA|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

and

>>> bDF.show()
+---+----+
| id|datB|
+---+----+
|  2|  b2|
|  3|  b3|
|  4|  b4|
+---+----+

To accomplish what you are looking for, there are 2 ways:

1. Different joining condition. Instead of saying aDF.id == bDF.id

aDF.join(bDF, aDF.id == bDF.id, "outer")

Write this:

aDF.join(bDF, "id", "outer").show()
+---+----+----+
| id|datA|datB|
+---+----+----+
|  1|  a1|null|
|  3|  a3|  b3|
|  2|  a2|  b2|
|  4|null|  b4|
+---+----+----+

This will automatically get rid of the extra the dropping process.

2. Use Aliasing: You will lose data related to B Specific Id's in this.

>>> from pyspark.sql.functions import col
>>> aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()

+----+----+----+
|  id|datA|datB|
+----+----+----+
|   1|  a1|null|
|   3|  a3|  b3|
|   2|  a2|  b2|
|null|null|  b4|
+----+----+----+

Collectives™ on Stack Overflow

How to delete columns in pyspark dataframe

9 Answers 9

2 Comments

5 Comments

1 Comment

Comments

4 Comments

Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

5 Comments

1 Comment

Comments

4 Comments

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related