3

Can someone let me why the regular expression

df = df2.withColumn("extracted", F.regexp_extract("title", "[Pp]ython", 0))

Can find the pattern 'Python' or 'python' from the followng column called title

title
A fast PostgreSQL client library for Python: 3x faster than psycopg2
A project template for data science in Python
A simple python framework to build/train LUIS models
An Introduction to Stock Market Data Analysis with Python (Part 1)
Asynchronous Python
Cubr  A Rubiks Cube Solver Written in Python and using Webcam Input (2013)
Python 4 Kids: Python for Kids: Python 3  Project 10

But the regular expression can't find the pattern Python or python from the following

title
Python Core Development Sprint 2016: 3.6 and beyond
Hypothesis.works articles: 3.5.0 and 3.5.1 Releases of Hypothesis for Python
Total pip packages downloaded, separated by Python versions (June  August 2016)
PEP 530: Asynchronous Comprehensions in Python 3.6
Python 2.7 still reigns supreme in pip installs
CheckiO  games for Python and JavaScript coders. ClassRoom support is included
VR Zero, Virtual Reality on the RaspberryPi, in Python

Thanks

3
  • Any error messages? What output do you get? Commented Jul 15, 2021 at 21:16
  • Hi, I don't get any error messages, the pattern is simply not found. When I run the equivalent code in SQL %ython% all the patterns are found. Very strange Commented Jul 15, 2021 at 21:20
  • If you were to run the same PySpark regular expression you would see what I mean Commented Jul 15, 2021 at 21:25

1 Answer 1

3

Use the ignore case regex;

(?i)-ignore or case-insensitive mode ON

Data

data=[

  (1,"Python Core Development Sprint 2016: 3.6 and beyond"),
  (2,"Hypothesis.works articles: 3.5.0 and 3.5.1 Releases of Hypothesis for Python"),
  (3,"CheckiO  games for python and JavaScript coders. ClassRoom support is included")
  ]
df=spark.createDataFrame(data, ['id','title'])
df.show(truncate=False)

Solution

df.withColumn('extract', F.regexp_extract(col('title'),'(?i)[P]ython',0)).show()

Outcome

+---+--------------------+-------+
| id|               title|extract|
+---+--------------------+-------+
|  1|Python Core Devel...| Python|
|  2|Hypothesis.works ...| Python|
|  3|CheckiO  games fo...| python|
+---+--------------------+-------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.