1

I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)

So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:

background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url   ( test4/ test4.gif );
background: url( " test5/test5.gif"   );

You get the idea. This is in Java's regex implementation (not my favorite).

3
  • The last two examples are invalid, at least if I read the spec correctly. Whitespace is only allowed directly after the opening paren and directly before the closing one. Commented Jan 10, 2011 at 23:44
  • Probably invalid according to spec, but all the browsers will handle them. Commented Jan 10, 2011 at 23:45
  • Do you only want background image URLs? Those aren't the only places where the url() CSS function occurs. Commented Jan 11, 2011 at 7:49

2 Answers 2

6

The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.

First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).

You could start with the background(-image)? token to lock the first part. How to proceed? Very difficult...

You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url token. A variable number of whitespaces is [\w]*, and this becomes part of our regex.

I tried this with RegexBuddy

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);

Unfortunately, it captures whitespaces inside URLs

Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15

Matched text: background: url   ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1: 
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2:  test4/ test4.gif 
Backreference 2 offset: 138
Backreference 2 length: 18

So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif file

[Edit] I prefer the following version of the regex

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;

It tolerates more whitespaces

Sign up to request clarification or add additional context in comments.

Comments

1

Can you use ONLY regexs? Your life could be made so much easier if you used string functions to remove all the spaces, then you can write a regex that doesn't have to worry about the whitespace.

Here's a quick one, might not work very well:

background(-image)?:url\(["']?(.*)["']?\);

The second capture group should give you what you want.

The .* should probably be replaced with a character class that contains all the characters a valid path can contain.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.