0

I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.

Lets say I have a dataframe that looks like this:

+-----+---+---+
|   _1| _2| _3|
+-----+---+---+
|x1-y1|  3| z1|
|x2-y2|  2| z2|
|x3-y3|  1| z3|
+-----+---+---+

I am looking for way to create a dataframe which looks like this:

+-----+---+---+----+--------+
|   _1| _2| _3|  _4|      _5|
+-----+---+---+----+--------+
|x1-y1|  3| z1|x1y1|x1=y1=z1|
|x2-y2|  2| z2|x2y2|x2=y2=z2|
|x3-y3|  1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

_4 is just '-' removed from _1 and _5 uses values from _1 and _3

  • I am using spark-2.3.3 and python 2.7

Thanks!

1 Answer 1

3

You can use pyspark.sql.functions to achieve it.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [('x1-y1', 3,'z1'),
        ('x2-y2', 2,'z2'),
        ('x3-y3', 1,'z3')]
test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])

test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
test_df.show()

+-----+---+---+----+--------+
|   _1| _2| _3|  _4|      _5|
+-----+---+---+----+--------+
|x1-y1|  3| z1|x1y1|x1=y1=z1|
|x2-y2|  2| z1|x2y2|x2=y2=z2|
|x3-y3|  1| z1|x3y3|x3=y3=z3|
+-----+---+---+----+--------+
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the quick response but I still get this error when I try to use your solution : AttributeError: 'PipelinedRDD' object has no attribute 'withColumn'
@ShanbogShashank It's so strange. Can you just run my example properly?
When I copy paste your code, it works. I am trying to adapt it to my code and I get the error. rdd1 = sc.parallelize([('x1-y1', 3, 'z1'), ('x2-y2', 2, 'z2'), ('x3-y3', 1, 'z3') ]) df1 = sqlContext.createDataFrame(rdd1) df2 = df2.withColumn('_4', F.regexp_replace('_1', '-', '')) df2 = df2.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3'))) df2.show()
@ShanbogShashank Maybe it's the problem of initializing Spark. Since SparkSession is the newer, I recommended this way to you.
Great! It worked when I changed sc = pyspark.SparkContext.getOrCreate() sqlContext = SQLContext(sc) to : sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate() Thanks so much!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.