How to write regex to extract specific key format and value from CSS file?

Question

I have a CSS file generated by some tool, and it's formatted like this:

@font-face {
    font-family: 'icomoon';
    src:url('fonts/icomoon.eot?4px1bm');
    src:url('fonts/icomoon.eot?#iefix4px1bm') format('embedded-opentype'),
        url('fonts/icomoon.woff?4px1bm') format('woff'),
        url('fonts/icomoon.ttf?4px1bm') format('truetype'),
        url('fonts/icomoon.svg?4px1bm#icomoon') format('svg');
    font-weight: normal;
    font-style: normal;
}

[class^="icon-"], [class*=" icon-"] {
    font-family: 'icomoon';
    speak: none;
    font-style: normal;
    font-weight: normal;
    font-variant: normal;
    text-transform: none;
    line-height: 1;

    /* Better Font Rendering =========== */
    -webkit-font-smoothing: antialiased;
    -moz-osx-font-smoothing: grayscale;
}

.icon-pya:before {
    content: "\e60d";
}
.icon-pyp:before {
    content: "\e60b";
}
.icon-tomb:before {
    content: "\e600";
}
.icon-right:before {
    content: "\e601";
}

I want use a regular expression in Python to extract every CSS selector which starts with .icon- and its related value, e.g:

{key: '.icon-right:before', value: 'content: "\e601";'}

I only have basic regular expression knowledge, So I write this: \^.icon.*\, but it can only match the keys, not the values.

In which language you'll apply this regex? is it Javascript? — Shiplu Mokaddim
– Shiplu Mokaddim, Commented May 31, 2014 at 3:51
actually in python. but i think it shouldn't be matter. right? — LeoShi
– LeoShi, Commented May 31, 2014 at 4:04
Hey Leo, did one of the answers help with the problem, or are you still wrestling with it? — zx81
– zx81, Commented Jun 1, 2014 at 22:03

Zero Piraeus · Accepted Answer · 2014-05-31 04:58:42Z

If you're using Python, this regex works:

(\.icon-[^\{]*?)\s*\{\s*([^\}]*?)\s*\}

Example:

>>> css = """
... /* ... etc ... */
... .icon-right:before {
...     content: "\e601";
... }
... """
>>> import re
>>> pattern = re.compile(r"(\.icon-[^\{]*?)\s*\{\s*([^\}]*?)\s*\}")
>>> re.findall(pattern, css)
[
    ('.icon-pya:before', 'content: "\\e60d";'),
    ('.icon-pyp:before', 'content: "\\e60b";'),
    ('.icon-tomb:before', 'content: "\\e600";'),
    ('.icon-right:before', 'content: "\\e601";')
]

You can then convert that to a dictionary easily:

>>> dict(re.findall(pattern, css))
{
    '.icon-right:before': 'content: "\\e601";',
    '.icon-pya:before': 'content: "\\e60d";',
    '.icon-tomb:before': 'content: "\\e600";',
    '.icon-pyp:before': 'content: "\\e60b";'
}

This is usually a more sensible data structure than a sequence of {'key': ..., 'value': ...} dictionaries - if you must have the latter, I'll assume you have enough Python to work out how to get it.

Okay, that was a pretty complex regex, so taking it piece by piece:

(\.icon-[^\{]*?)

This is the first capturing group, delimited by parentheses. Inside those, we've got \.icon-, followed by [^\{]*? - which is a sequence of 0 or more (*) but as few as possible (?) of anything but a '{' ([^\{]).

Then, there's a non-captured section:

\s*\{\s*

This means any amount of whitespace (\s*), followed by a '{' (\{), followed by any amount of whitespace (\s*).

Next, our second capturing group, again enclosed in parentheses:

([^\}]*?)

... which is 0 or more (*) but as few as possible (?) of anything but a '}' ([^\}]).

Finally, the last non-captured section:

\s*\}

... which is any amount of whitespace (\s*), followed by a '}' (\}).

In case you're wondering, the reason for using *? (0 or more but as few as possible - known as a non-greedy match) is so that the match for \s* (any amount of whitespace) can consume as much whitespace as possible, and it won't end up inside the captured groups.

Hi,@zero.Thanks for your detailed explain. Actually I still don't understand why do you use \s*? instead of \s*. because I tried \s*, and it works as well. Can you give me an example that only \s*? would work but \s* won't work?

zx81 · Accepted Answer · 2014-05-31 04:41:27Z

1

With your current content, this regex would work:

(\.icon-[^\s{]+)\s*{\s*([^;]*;)

See demo (look at the substitutions at the bottom)

The name would get captured to Group 1, and the rule to Group 2.

To output in the format you specified, you have several options.

For instance, tweak the regex slighty and replace with

{key: '\1', value: '\2' }

This assumes only one rule per set of braces.

A better option is to find all the matches, then for each match output the string you want, concatenating from the Group 1 and Group 2 captures.

Here is a start:

reobj = re.compile(r"(\.icon-[^\s{]+)\s*{\s*([^;]*;)")
for match in reobj.finditer(subject):
    # Group 1: match.group(1)
    # Group 2: match.group(2)

edited May 31, 2014 at 4:41

answered May 31, 2014 at 4:36

zx81

42k10 gold badges92 silver badges106 bronze badges

Collectives™ on Stack Overflow

How to write regex to extract specific key format and value from CSS file?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related