2

I need to write a REGEXP_REPLACE query for a spark.sql() job. If the value, follows the below pattern then only, the words before the first hyphen are extracted and assigned to the target column 'name', but if the pattern doesn't match, the entire 'name' should be reported.

Pattern:

  1. Values should be hyphen delimited. Any values can be present before the first hyphen (be it numbers, alphabets, special characters or even space)
  2. First hyphen should be exactly followed by 2 words, separated by hyphen (it can only be numbers, alphabets or alphanumeric) (Note: Special characters & blanks are not allowed)
  3. Two words should be followed by one or more digits, followed by hyphen.
  4. Last portion should be only one or more digits.

For Example:

if name = abc45-dsg5-gfdvh6-9890-7685, output of REGEXP_REPLACE = abc45

if name = abc, output of REGEXP_REPLACE = abc

if name = abc-gf5-dfg5-asd5-98-00, output of REGEXP_REPLACE = abc-gf5-dfg5-asd5-98-00

I have

spark.sql("SELECT REGEXP_REPLACE(name , '-[^-]+-\\w{2}-\\d+-\\d+$','',1,1,'i')  AS name").show();

But it does not work.

1

1 Answer 1

2

Use

^([^-]*)(-[a-zA-Z0-9]+){2}-[0-9]+-[0-9]+$

See proof. Replace with $1. If $1 does not work, use \1. If \1 does not work use \\1.

EXPLANATION

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^-]*                    any character except: '-' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2 (2 times):
--------------------------------------------------------------------------------
    -                        '-'
--------------------------------------------------------------------------------
    [a-zA-Z0-9]+             any character of: 'a' to 'z', 'A' to
                             'Z', '0' to '9' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  ){2}                     end of \2 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \2)
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
Sign up to request clarification or add additional context in comments.

1 Comment

That seems a better solution +1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.