Simple Regex Python

Question

I am reading in a line from a file and want to split words that are delimited by nonalphanumeric ascii characters or a break statement using re.split but I am having trouble determining how to create the correct pattern. The below code yields:

split = re.split(r'(<br>)|(\W+)', 'I code<br>A project.')
split = ['', None, 'I', '', None, 'code', '', None, '<', '', None, 'br',
         '',None, '>', '', None, 'A', '', None, 'project.']

I believed I would be able to recognize a break statement or a nonascii character usig the pattern above but clearly it is incorrect. I am having trouble understanding Regex, any help fixing this would be appreciated. I would like it look like the below after split properly:

split = ['I', 'code', 'A', 'project']

This is a good tutorial on python re: developers.google.com/edu/python/regular-expressions — bastelflp
– bastelflp, Commented Nov 30, 2015 at 1:08

Mark · Accepted Answer · 2015-11-30 01:11:39Z

1

You don't need the group syntax ():

>>> re.split(r'<br>|\W+', 'I code<br>A    project.')
['I', 'code', 'A', 'project', '']

edited Nov 30, 2015 at 1:11

answered Nov 30, 2015 at 1:02

Mark

109k20 gold badges180 silver badges238 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Stefan Gruenwald Over a year ago

Mark, in HTML5 the <br> tag is an empty tag which means that it has no end tag.

Zach Gittelman Over a year ago

the result i am getting is ['I', 'code', 'A', 'br', 'project', ''] how do i remove the br?

Mark Over a year ago

@ZachGittelman, hmm, the code snippet above is exactly what I get, what version of Python are you using? That was tested on 2.7.

Collectives™ on Stack Overflow

Simple Regex Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related