create new column in pyspark dataframe using existing columns

Question

I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.

Lets say I have a dataframe that looks like this:

+-----+---+---+
|   _1| _2| _3|
+-----+---+---+
|x1-y1|  3| z1|
|x2-y2|  2| z2|
|x3-y3|  1| z3|
+-----+---+---+

I am looking for way to create a dataframe which looks like this:

+-----+---+---+----+--------+
|   _1| _2| _3|  _4|      _5|
+-----+---+---+----+--------+
|x1-y1|  3| z1|x1y1|x1=y1=z1|
|x2-y2|  2| z2|x2y2|x2=y2=z2|
|x3-y3|  1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

_4 is just '-' removed from _1 and _5 uses values from _1 and _3

I am using spark-2.3.3 and python 2.7

Thanks!

giser_yugang · Accepted Answer · 2019-03-15 06:22:37Z

3

You can use pyspark.sql.functions to achieve it.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [('x1-y1', 3,'z1'),
        ('x2-y2', 2,'z2'),
        ('x3-y3', 1,'z3')]
test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])

test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
test_df.show()

+-----+---+---+----+--------+
|   _1| _2| _3|  _4|      _5|
+-----+---+---+----+--------+
|x1-y1|  3| z1|x1y1|x1=y1=z1|
|x2-y2|  2| z1|x2y2|x2=y2=z2|
|x3-y3|  1| z1|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

answered Mar 15, 2019 at 6:22

giser_yugang

6,1864 gold badges24 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Shashank BR Over a year ago

Thanks for the quick response but I still get this error when I try to use your solution : AttributeError: 'PipelinedRDD' object has no attribute 'withColumn'

giser_yugang Over a year ago

@ShanbogShashank It's so strange. Can you just run my example properly?

Shashank BR Over a year ago

When I copy paste your code, it works. I am trying to adapt it to my code and I get the error.

rdd1 = sc.parallelize([('x1-y1', 3, 'z1'),                       ('x2-y2', 2, 'z2'),                        ('x3-y3', 1, 'z3')                           ])  df1 = sqlContext.createDataFrame(rdd1)  df2 = df2.withColumn('_4', F.regexp_replace('_1', '-', '')) df2 = df2.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3'))) df2.show()

giser_yugang Over a year ago

@ShanbogShashank Maybe it's the problem of initializing Spark. Since SparkSession is the newer, I recommended this way to you.

Shashank BR Over a year ago

Great! It worked when I changed sc = pyspark.SparkContext.getOrCreate() sqlContext = SQLContext(sc) to : sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate() Thanks so much!

Collectives™ on Stack Overflow

create new column in pyspark dataframe using existing columns

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related