3

I have a strings in a dataframe in the following format.

abc.T01.xyz
abc.def.T01.xyz
abc.def.ghi.xyz

I need to filter the rows where this string has values matching this expression.

[a-zA-Z].T[0-9].[a-zA-Z]

I have used the following command, but it is giving me the strings that look like this as well: [a-zA-Z].[a-zA-Z].T[0-9].[a-zA-Z] which I don't want in my result.

mydf2 = mydf1.where('col1 rlike ".*\.T.*\..*"')
mydf2.show()

I am missing something in my regex.

1 Answer 1

5

Just translate your requirements instead of using a dot-star-soup and add anchors:

# [a-zA-Z].T[0-9].[a-zA-Z]
mydf2 = mydf1.where('col1 rlike "^[a-zA-Z.]+\.T[0-9]+\.[a-zA-Z.]+$"')

See a demo on regex101.com.
Please note, that I have also added the dot to the character class (is this a requirement?), otherwise your second string won't be matched. If this is not what you want, delete it from the class.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.