PySpark - String matching to create new column

Question

I have a dataframe like:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha

Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

Is regex or grep better here? What kind of function should I try? Thanks!

EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

Here is a fundamental problem. If your Notes column has employee name is any place, and there can be any string in the Notes column, I mean "Checked by John " or "Double Checked on 2/23/17 by Marsha " etc etc. There is no way to find the employee name unless you find the correct regex for all possible combination. Now theoretically that could be infinitely many. Plus if a new pattern comes how would you find correct regex for that ? — Avishek Bhattacharya
– Avishek Bhattacharya, Commented Oct 3, 2017 at 16:34
can you split the string by "BY" and take the last index of the array returned? — Sanchit Grover
– Sanchit Grover, Commented Oct 6, 2017 at 14:53

Community · Accepted Answer · 2020-06-20 09:12:55Z

68

+100

In short:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))

In Detail:

Create a sample dataframe

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

Do the needed imports

from pyspark.sql.functions import regexp_extract, col

On df extract Employee name from column using regexp_extract(column_name, regex, group_number).

Here regex('(.)(by)(\s+)(\w+)') means

(.) - Any character (except newline)
(by) - Word by in the text
(\s+) - One or many spaces
(\w+) - Alphanumeric or underscore chars of length one

and group_number is 4 because group (\w+) is in 4th position in expression

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

Note:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Oct 3, 2017 at 15:01

mrsrinivas

35.7k13 gold badges133 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Norah Jones Over a year ago

How can I include curly bracets in the second group - exp. (by{)?

AJR Over a year ago

@mrsrinivas - Can you please check "added Code" in my question and tell why Folder_num is not displaying any data in my frame? stackoverflow.com/questions/64602504/…

mrsrinivas Over a year ago

@AJR: Thanks for your post here. Please consider accepting/feedback answers to your previous questions.

Arun Mohan Over a year ago

how can i extract only 'C21618616' from 'ADRIANOPICCININIC216186162022-07-27 09:36:33Z'? any suggestions would be helpful.

ctwheels · Accepted Answer · 2017-10-02 20:29:27Z

Brief

In its simplest form, and according to the example provided, this answer should suffice, albeit the OP should post more samples if other samples exist where the name should be preceded by any word other than by.

Code

See code in use here

Regex

^(\w+)[ \t]*(.*\bby[ \t]+(\w+)[ \t]*.*)$

Replacement

\1\t\2\t\3

Results

Input

2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha

Output

2345    Checked by John John
2398    Verified by Stacy   Stacy
3983    Double Checked on 2/23/17 by Marsha     Marsha

Note: The above output separates each column by the tab \t character, so it may not appear to be correct to the naked eye, but simply using an online regex parser and inserting \t into the regex match section should show you where each column begins/ends.

Explanation

Regex

^ Assert position at the beginning of the line
(\w+) Capture one or more word characters (a-zA-Z0-9_) into group 1
[ \t]* Match any number of spaces or tab characters ([ \t] can be replaced with \h in some regex flavours such as PCRE)
(.*\bby[ \t]+(\w+)[ \t]*.*) Capture the following into group 2
- .* Match any character (except newline unless the s modifier is used)
- \bby Match a word boundary \b, followed by by literally
- [ \t]+ Match one or more spaces or tab characters
- (\w+) Capture one or more word characters (a-zA-Z0-9_) into group 3
- [ \t]* Match any number of spaces or tab characters
- .* Match any character any number of times
$ Assert position at the end of the line

Replacement

\1 Matches the same text as most recently matched by the 1st capturing group
\t Tab character
\1 Matches the same text as most recently matched by the 2nd capturing group
\t Tab character
\1 Matches the same text as most recently matched by the 3rd capturing group

Matschek · Accepted Answer · 2017-10-05 06:28:15Z

2

When I read the question again, the OP may speak of a fixed list of employees ("Let's say for example there are only 3 employees to check: John, Stacy, or Marsha"). If this is really a known list, then the simplest way is to check against this list of names with word boundaries:

regexp_extract(col('Notes'), '\b(John|Stacy|Marsha)\b', 1)

answered Oct 5, 2017 at 6:28

Matschek

2151 silver badge9 bronze badges

Comments

Avishek Bhattacharya · Accepted Answer · 2017-09-25 18:19:11Z

0

Something like this should work

import org.apache.spark.sql.functions._
dataFrame.withColumn("Employee", substring_index(col("Notes"), "\t", 2))

In case you want to use regex to extract the proper value you need something like

 dataFrame.withColumn("Employee", regexp_extract(col("Notes"), 'regex', <groupId>)

edited Sep 25, 2017 at 18:19

answered Sep 25, 2017 at 18:00

Avishek Bhattacharya

7,0243 gold badges38 silver badges58 bronze badges

2 Comments

Ashley O Over a year ago

What if the employee name could be at the start, middle, or end of the string? Would this still work?

Avishek Bhattacharya Over a year ago

No, in that case you need to use regex. My solution strictly select the last one as the name of the employee. But there has to be some pattern. If you want to use patten you can use regexp_extract(col("Notes"), <regex>, <groupNumber>)

Collectives™ on Stack Overflow

PySpark - String matching to create new column

4 Answers 4

In short:

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`

In Detail:

Note:

4 Comments

Brief

Code

Results

Input

Output

Explanation

Regex

Replacement

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

In short:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

In Detail:

Note:

4 Comments

Brief

Code

Results

Input

Output

Explanation

Regex

Replacement

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))`