(Java) RegEx to get the URLs from CSS?

Question

I'm parsing CSS to get the URLs out of linked style sheets. This is a Java app. (I tried using the CSSParser ( http://cssparser.sourceforge.net/ ), however, it is silently dropping many of the rules when it parses.)

So I'm just using Regex. I'd like a regex that gets me just the URLs, and is robust enough to deal with real css from the wild:

background-image: url('test/test.gif');
background: url("test2/test2.gif");
background-image: url(test3/test3.gif);
background: url   ( test4/ test4.gif );
background: url( " test5/test5.gif"   );

You get the idea. This is in Java's regex implementation (not my favorite).

The last two examples are invalid, at least if I read the spec correctly. Whitespace is only allowed directly after the opening paren and directly before the closing one. — Joey
– Joey, Commented Jan 10, 2011 at 23:44
Probably invalid according to spec, but all the browsers will handle them. — mtyson
– mtyson, Commented Jan 10, 2011 at 23:45
Do you only want background image URLs? Those aren't the only places where the url() CSS function occurs. — BoltClock
– BoltClock, Commented Jan 11, 2011 at 7:49

usr-local-ΕΨΗΕΛΩΝ · Accepted Answer · 2011-01-11 00:06:24Z

The problem with regexes is that they are sometimes too strict than you need. If you shown us your currently non-perfectly-working regex I would have been able to help you more.

First comment: browsers tend to tolerate the majority of HTML/CSS mistakes (NOT JavaScript, which is a programming and not a markup language).

You could start with the background(-image)? token to lock the first part. How to proceed? Very difficult...

You always have colon, so you can add to the constant part of the token, and then, judging from your example (not from CSS specs) a variable number of whitespaces followed by url token. A variable number of whitespaces is [\w]*, and this becomes part of our regex.

I tried this with RegexBuddy

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)\);

Unfortunately, it captures whitespaces inside URLs

Matched text: background-image: url('test/test.gif');
Match offset: 0
Match length: 39
Backreference 1: -image
Backreference 1 offset: 10
Backreference 1 length: 6
Backreference 2: 'test/test.gif'
Backreference 2 offset: 22
Backreference 2 length: 15

Matched text: background: url   ( test4/ test4.gif );
Match offset: 119
Match length: 39
Backreference 1: 
Backreference 1 offset: -1
Backreference 1 length: 0
Backreference 2:  test4/ test4.gif 
Backreference 2 offset: 138
Backreference 2 length: 18

So, when you get the URL with this you must trim the string. I couldn't exclude whitespaces from url group as of example 4, which, however, should match a URL with a whitespace in it, and which shouldn't be correct is this examples as soon as you don't have a %20test4.gif file

[Edit] I prefer the following version of the regex

background(-image)?: url[\s]*\([\s]*(?<url>[^\)]*)[\s]*\)[\s]*;

It tolerates more whitespaces

DanielGibbs · Accepted Answer · 2011-01-11 07:29:00Z

1

Can you use ONLY regexs? Your life could be made so much easier if you used string functions to remove all the spaces, then you can write a regex that doesn't have to worry about the whitespace.

Here's a quick one, might not work very well:

background(-image)?:url\(["']?(.*)["']?\);

The second capture group should give you what you want.

The .* should probably be replaced with a character class that contains all the characters a valid path can contain.

answered Jan 11, 2011 at 7:29

DanielGibbs

10.3k13 gold badges80 silver badges126 bronze badges

Collectives™ on Stack Overflow

(Java) RegEx to get the URLs from CSS?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest