2

I want to parse this CSS Selector (and others of a similar form): div.class1#myid.class2[key=value]

and have it match ".class1" and ".class2" but I can't figure out what regex to use..

example: http://www.rubular.com/r/3dxpzyJLeK

In an ideal world, I'd also want to extract the:

  • type (i.e. div)
  • class (i.e. a list of classes)
  • id (i.e myid)
  • key (i.e. key)
  • operator (i.e. =)
  • value (i.e. value)

but I can't get the basics going!

Any help would be massively appreciated :)

Thanks!

5
  • 3
    If you want all that info, you're better off using something like pyparsing. It also looks like there are a couple libraries doing this already -- cthedot.de/cssutils and code.google.com/p/css-py -- although it's not clear how complete they are. Commented Jun 23, 2012 at 20:11
  • In theory, there could be more than one [key=value], either using separate lists for key and value, or using an attribute list that contains key-value pairs. And "tag" might be more appropriate than "type". Commented Jun 23, 2012 at 20:17
  • Plus, there are more variations for an attribute, with and without quotes for the attribute values: [type], [type^=value], [type$=value], etc, if that matters, such that it may be necessary to store the attribute operator as well. Commented Jun 23, 2012 at 20:23
  • Study the grammar: w3.org/TR/CSS21/grammar.html and take a look at existing regex-for-CSS-selectors questions: stackoverflow.com/questions/tagged/regex+css-selectors Commented Jun 23, 2012 at 20:27
  • By the way, the "key", "operator" and "value" shouldn't be parsed separately - parse them together as an attribute selector, and capture the operator/value optionally. Commented Jun 23, 2012 at 20:28

3 Answers 3

2

Thanks all very much for your suggestions and help. I tied it all together into the following two Regex Patterns:

This one parses the CSS selector string (e.g. div#myid.myclass[attr=1,fred=3]) http://www.rubular.com/r/2L0N5iWPEJ

cssSelector = re.compile(r'^(?P<type>[\*|\w|\-]+)?(?P<id>#[\w|\-]+)?(?P<classes>\.[\w|\-|\.]+)*(?P<data>\[.+\])*$')

>>> cssSelector.match("table#john.test.test2[hello]").groups()
('table', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("table").groups()
('table', None, None, None)
>>> cssSelector.match("table#john").groups()
('table', '#john', None, None)
>>> cssSelector.match("table.test.test2[hello]").groups()
('table', None, '.test.test2', '[hello]')
>>> cssSelector.match("table#john.test.test2").groups()
('table', '#john', '.test.test2', None)
>>> cssSelector.match("*#john.test.test2[hello]").groups()
('*', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("*").groups()
('*', None, None, None)

And this one does the attributes (e.g. [link,key~=value]) http://www.rubular.com/r/2L0N5iWPEJ:

attribSelector = re.compile(r'(?P<word>\w+)\s*(?P<operator>[^\w\,]{0,2})\s*(?P<value>\w+)?\s*[\,|\]]')

>>> a = attribSelector.findall("[link, ds9 != test, bsdfsdf]")
>>> for x in a: print x
('link', '', '')
('ds9', '!=', 'test')
('bsdfsdf', '', '')

A couple of things to note: 1) This parses attributes using comma delimitation (since I am not using strict CSS). 2) This requires patterns take the format: tag, id, classes, attributes

The first regex does tokens, so the whitespace and '>' separated parts of a selector string. This is because I wanted to use it to check against my own object graph :)

Thanks again!

Sign up to request clarification or add additional context in comments.

2 Comments

This is really helpfull, is it easy to add psuedo part? Like :first-child? It would really help me out.
@John I want to same for php. would you please help me
1

I think you nees something like that.

(?P<tag>[a-zA-Z]+)?(\.(?P<class>[a-zA-Z0-9_-]+)?)?(#(?P<id>[a-zA-Z0-9_-])?)?\W*\{((?P<name>[a-zA-Z0-9-_]+?)=(?P<value>[a-zA-Z0-9-_]+?))*\}

Sorry if it not works, I have not test it

Comments

1

Definitely don't try to do this with a single regexp. Regular expressions are notoriously difficult to read and debug so when you get done with the first 80% of this task and go back to try to fix a bug, the code is going to be a nightmare.

Instead, try writing functions or even a class that will allow you to do the things you want to do. Then you can use a relatively simple regexp for each specific task and use a much more intuitive syntax in your implementations.

class css_parser:

  def __init__(self):
    self.class_regexp = re.compile('\.[\w\-]*') # This is insufficient, but it's a start...

  def get_class(self, str):
    m = self.class_regexp.match(str)
    return m.group(0)

You'll want to consult The W3C CSS spec particularly section 4.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.