0

I need a python regex which can help me eliminate illegal characters inside a word.

The conditions are as such:

  1. The first character must be a-z only
  2. All characters in the word should only be a-z (lower case) plus apostrophe ' and hyphen -
  3. The last character must be a-z or apostrophe ' only
  4. You can assume that the word is always lower-case

Test data:

 s = "there is' -potato 'all' around- 'the 'farm-"

Expected output:

>>>print(s)
there is' potato all' around the farm

My code is currently as such but it doesn't work correctly:

newLine = re.findall(r'[a-z][-\'a-z]*[\'a-z]?', s)

Any assistance would be greatly appreciated! Thanks!

1

4 Answers 4

1

Just match only the chars you don't want and remove ith through re.sub

>>> import re
>>> s = """potato
-potato
'human'
potatoes-"""
>>> m = re.sub(r"(?m)^['-]|-$", r'', s)
>>> print(m)
potato
potato
human'
potatoes

OR

>>> m = re.sub(r"(?m)^(['-])?([a-z'-]*?)-?$", r'\2', s)
>>> print(m)
potato
potato
human'
potatoes

DEMO

Sign up to request clarification or add additional context in comments.

Comments

0

Try this:

>>> b=re.findall(r'[a-z][-\'a-z]*[\'a-z]',a)
>>> for i in b: print i
... 
potato
potato
human'
potatoes

3 Comments

I tried using your regex code but it doesn't produce the expected output you have written
I have tested it on the test data you provided. It worked fine. Still, try this non-greedy version b=re.findall(r'[a-z][-\'a-z]*[\'a-z]?',a)
Switch to double quotes to eliminate the escaping: b=re.findall(r"[a-z][-'a-z]*['a-z]?",a)
0

You can try:

[a-z][a-z'\-]*[a-z]|[a-z]

1 Comment

thanks! it's almost accurate but I realize a case that was not caught when I ran a sample code. The beginning of the word contained apostrophe '
0

Well assuming every word is separated by a space you could find all the valid words with something like this regex:

(?<= |^)[a-z](?:(?:[\-\'a-z]+)?[\'a-z])?(?= |$)

But if you want to eliminate illegal characters I guess you're better of finding the illegal characters and removing them. Now we assume again that you got a string which should only contain words which are seperated by spaces and nothing else in it.

So first of all we can sub all invalid characters out of the string: [^a-z-' ]

After doing this the only thing that could still be invalid would be a ' or - in the beginning of the word or a - in the end of the word.

So we sub those out with this regex: (?<= |^)['-]+|-+(?= |$)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.