Regex to parse CSS selector

Question

I want to parse this CSS Selector (and others of a similar form): div.class1#myid.class2[key=value]

and have it match ".class1" and ".class2" but I can't figure out what regex to use..

example: http://www.rubular.com/r/3dxpzyJLeK

In an ideal world, I'd also want to extract the:

type (i.e. div)
class (i.e. a list of classes)
id (i.e myid)
key (i.e. key)
operator (i.e. =)
value (i.e. value)

but I can't get the basics going!

Any help would be massively appreciated :)

Thanks!

If you want all that info, you're better off using something like pyparsing. It also looks like there are a couple libraries doing this already -- cthedot.de/cssutils and code.google.com/p/css-py -- although it's not clear how complete they are. — BrenBarn
– BrenBarn, Commented Jun 23, 2012 at 20:11
In theory, there could be more than one [key=value], either using separate lists for key and value, or using an attribute list that contains key-value pairs. And "tag" might be more appropriate than "type". — Matt Coughlin
– Matt Coughlin, Commented Jun 23, 2012 at 20:17
Plus, there are more variations for an attribute, with and without quotes for the attribute values: [type], [type^=value], [type$=value], etc, if that matters, such that it may be necessary to store the attribute operator as well. — Matt Coughlin
– Matt Coughlin, Commented Jun 23, 2012 at 20:23
Study the grammar: w3.org/TR/CSS21/grammar.html and take a look at existing regex-for-CSS-selectors questions: stackoverflow.com/questions/tagged/regex+css-selectors — BoltClock
– BoltClock, Commented Jun 23, 2012 at 20:27
By the way, the "key", "operator" and "value" shouldn't be parsed separately - parse them together as an attribute selector, and capture the operator/value optionally. — BoltClock
– BoltClock, Commented Jun 23, 2012 at 20:28

John · Accepted Answer · 2012-06-26 11:17:41Z

2

Thanks all very much for your suggestions and help. I tied it all together into the following two Regex Patterns:

This one parses the CSS selector string (e.g. div#myid.myclass[attr=1,fred=3]) http://www.rubular.com/r/2L0N5iWPEJ

cssSelector = re.compile(r'^(?P<type>[\*|\w|\-]+)?(?P<id>#[\w|\-]+)?(?P<classes>\.[\w|\-|\.]+)*(?P<data>\[.+\])*$')

>>> cssSelector.match("table#john.test.test2[hello]").groups()
('table', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("table").groups()
('table', None, None, None)
>>> cssSelector.match("table#john").groups()
('table', '#john', None, None)
>>> cssSelector.match("table.test.test2[hello]").groups()
('table', None, '.test.test2', '[hello]')
>>> cssSelector.match("table#john.test.test2").groups()
('table', '#john', '.test.test2', None)
>>> cssSelector.match("*#john.test.test2[hello]").groups()
('*', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("*").groups()
('*', None, None, None)

And this one does the attributes (e.g. [link,key~=value]) http://www.rubular.com/r/2L0N5iWPEJ:

attribSelector = re.compile(r'(?P<word>\w+)\s*(?P<operator>[^\w\,]{0,2})\s*(?P<value>\w+)?\s*[\,|\]]')

>>> a = attribSelector.findall("[link, ds9 != test, bsdfsdf]")
>>> for x in a: print x
('link', '', '')
('ds9', '!=', 'test')
('bsdfsdf', '', '')

A couple of things to note: 1) This parses attributes using comma delimitation (since I am not using strict CSS). 2) This requires patterns take the format: tag, id, classes, attributes

The first regex does tokens, so the whitespace and '>' separated parts of a selector string. This is because I wanted to use it to check against my own object graph :)

Thanks again!

answered Jun 26, 2012 at 11:17

John

4056 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Niels Over a year ago

This is really helpfull, is it easy to add psuedo part? Like :first-child? It would really help me out.

Sachin Sarola Over a year ago

@John I want to same for php. would you please help me

atomAltera · Accepted Answer · 2012-06-23 20:26:36Z

1

I think you nees something like that.

(?P<tag>[a-zA-Z]+)?(\.(?P<class>[a-zA-Z0-9_-]+)?)?(#(?P<id>[a-zA-Z0-9_-])?)?\W*\{((?P<name>[a-zA-Z0-9-_]+?)=(?P<value>[a-zA-Z0-9-_]+?))*\}

Sorry if it not works, I have not test it

answered Jun 23, 2012 at 20:26

atomAltera

1,8012 gold badges22 silver badges40 bronze badges

Comments

Chris Hanson · Accepted Answer · 2012-06-23 20:29:02Z

Definitely don't try to do this with a single regexp. Regular expressions are notoriously difficult to read and debug so when you get done with the first 80% of this task and go back to try to fix a bug, the code is going to be a nightmare.

Instead, try writing functions or even a class that will allow you to do the things you want to do. Then you can use a relatively simple regexp for each specific task and use a much more intuitive syntax in your implementations.

class css_parser:

  def __init__(self):
    self.class_regexp = re.compile('\.[\w\-]*') # This is insufficient, but it's a start...

  def get_class(self, str):
    m = self.class_regexp.match(str)
    return m.group(0)

You'll want to consult The W3C CSS spec particularly section 4.

Collectives™ on Stack Overflow

Regex to parse CSS selector

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related