3

I am reading in a line from a file and want to split words that are delimited by nonalphanumeric ascii characters or a break statement using re.split but I am having trouble determining how to create the correct pattern. The below code yields:

split = re.split(r'(<br>)|(\W+)', 'I code<br>A project.')
split = ['', None, 'I', '', None, 'code', '', None, '<', '', None, 'br',
         '',None, '>', '', None, 'A', '', None, 'project.']

I believed I would be able to recognize a break statement or a nonascii character usig the pattern above but clearly it is incorrect. I am having trouble understanding Regex, any help fixing this would be appreciated. I would like it look like the below after split properly:

split = ['I', 'code', 'A', 'project']
1

1 Answer 1

1

You don't need the group syntax ():

>>> re.split(r'<br>|\W+', 'I code<br>A    project.')
['I', 'code', 'A', 'project', '']
Sign up to request clarification or add additional context in comments.

3 Comments

Mark, in HTML5 the <br> tag is an empty tag which means that it has no end tag.
the result i am getting is ['I', 'code', 'A', 'br', 'project', ''] how do i remove the br?
@ZachGittelman, hmm, the code snippet above is exactly what I get, what version of Python are you using? That was tested on 2.7.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.