I need to write a REGEXP_REPLACE query for a spark.sql() job. If the value, follows the below pattern then only, the words before the first hyphen are extracted and assigned to the target column 'name', but if the pattern doesn't match, the entire 'name' should be reported.
Pattern:
- Values should be hyphen delimited. Any values can be present before the first hyphen (be it numbers, alphabets, special characters or even space)
- First hyphen should be exactly followed by 2 words, separated by hyphen (it can only be numbers, alphabets or alphanumeric) (Note: Special characters & blanks are not allowed)
- Two words should be followed by one or more digits, followed by hyphen.
- Last portion should be only one or more digits.
For Example:
if name = abc45-dsg5-gfdvh6-9890-7685, output of REGEXP_REPLACE = abc45
if name = abc, output of REGEXP_REPLACE = abc
if name = abc-gf5-dfg5-asd5-98-00, output of REGEXP_REPLACE = abc-gf5-dfg5-asd5-98-00
I have
spark.sql("SELECT REGEXP_REPLACE(name , '-[^-]+-\\w{2}-\\d+-\\d+$','',1,1,'i') AS name").show();
But it does not work.
^(?:[^-\s]+(?=-\w+-\w+\d+-\d+-\d+$)|\S+)to match the desired strings. See regex101.com/r/BeCTRF/1