10

I'd like to construct a regex that will check for a "path" and a "foo" parameter (non-negative integer). "foo" is optional. It should:

MATCH

path?foo=67                 # path found, foo = 67
path?foo=67&bar=hello       # path found, foo = 67
path?bar=bye&foo=1&baz=12   # path found, foo = 1
path?bar=123                # path found, foo = ''
path                        # path found, foo = ''

DO NOT MATCH

path?foo=37signals          # foo is not integer
path?foo=-8                 # foo cannot be negative
something?foo=1             # path not found

Also, I'd like to get the value of foo, without performing an additional match.

What would be the simplest regex to achieve this?

2
  • 2
    Just a note: you should put your attempts in the question as well next time :) Commented Sep 19, 2014 at 6:58
  • 2
    Should path?foo=&bar=1 match? Commented Sep 20, 2014 at 6:28

10 Answers 10

23
+100

The Answer

Screw your hard work, I just want the answer! Okay, here you go...

var regex = /^path(?:(?=\?)(?:[?&]foo=(\d*)(?=[&#]|$)|(?![?&]foo=)[^#])+)?(?=#|$)/,
    URIs = [
      'path',                 // valid!
      'pathbreak',            // invalid path
      'path?foo=123',         // valid!
      'path?foo=-123',        // negative
      'invalid?foo=1',        // invalid path
      'path?foo=123&bar=abc', // valid!
      'path?bar=abc&foo=123', // valid!
      'path?bar=foo',         // valid!
      'path?foo',             // valid!
      'path#anchor',          // valid!
      'path#foo=bar',         // valid!
      'path?foo=123#bar',     // valid!
      'path?foo=123abc',      // not an integer
    ];
      
for(var i = 0; i < URIs.length; i++) {
    var URI = URIs[i],
        match = regex.exec(URI);

    if(match) {
        var foo = match[1] ? match[1] : 'null';
        console.log(URI + ' matched, foo = ' + foo);
    } else {
        console.log(URI + ' is invalid...');
    }
}
<script src="https://getfirebug.com/firebug-lite-debug.js"></script>


Research

Your bounty request asks for "credible and/or official sources", so I'll quote the RFC on query strings.

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

This seems pretty vague on purpose: a query string starts with the first ? and is terminated by a # (start of anchor) or the end of the URI (or string/line in our case). They go on to mention that most data sets are in key=value pairs, which is what it seems like what you expect to be parsing (so lets assume that is the case).

However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.

With all this in mind, let's assume a few things about your URIs:

  1. Your examples start with the path, so the path will be from the beginning of the string until a ? (query string), # (anchor), or the end of the string.
  2. The query string is the iffy part, since RFC doesn't really define a "norm". A browser tends to expect a query string to be generated from a form submission and be a list of key=value pairs appended by & characters. Keeping this mentality:
  • A key cannot be null, will be preceded by a ? or &, and cannot contain a =, & or #.
  • A value is optional, will be preceded by key=, and cannot contain a & or #.
  1. Anything after a # character is the anchor.

Let's Begin!

Let's start by mapping out our basic URI structure. You have a path, which is characters starting at the string and up until a ?, #, or the end of the string. You have an optional query string, which starts at a ? and goes until a # or the end of the string. And you have an optional anchor, which starts at a # and goes until the end of the string.

^
([^?#]+)
(?:
  \?
  ([^#]+)
)?
(?:
  #
  (.*)
)?
$

Let's do some clean up before digging into the query string. You can easily require the path to equal a certain value by replacing the first capture group. Whatever you replace it with (path), will have to be followed by an optional query string, an optional anchor, and the end of the string (no more, no less). Since you don't need to parse the anchor, the capturing group can be replaced by ending the match at either a # or the end of the string (which is the end of the query parameter).

^path
(?:
  \?
  ([^#\+)
)?
(?=#|$)

Stop Messing Around

Okay, I've been doing a lot of setup without really worrying about your specific example. The next example will match a specific path (path) and optionally match a query string while capturing the value of a foo parameter. This means you could stop here and check for a valid match..if the match is valid, then the first capture group must be null or a non-negative integer. But that wasn't your question, was it. This got a lot more complicated, so I'm going to explain the expression inline:

^            (?# match beginning of the string)
path         (?# match path literally)
(?:          (?# begin optional non-capturing group)
 (?=\?)      (?# lookahead for a literal ?)
 (?:         (?# begin optional non-capturing group)
   [?&]      (?# keys are preceded by ? or &)
   foo       (?# match key literally)
   (?:       (?# begin optional non-capturing group)
    =        (?# values are preceded by =)
    ([^&#]*) (?# values are 0+ length and do not contain & or #)
   )         (?# end optional non-capturing group)
  |          (?# OR)
   [^#]      (?# query strings are non-# characters)
 )+          (?# end repeating non-capturing group)
)?           (?# end optional non-capturing group)
(?=#|$)      (?# lookahead for a literal # or end of the string)

Some key takeaways here:

  • Javascript doesn't support lookbehinds, meaning you can't look behind for a ? or & before the key foo, meaning you actually have to match one of those characters, meaning the start of your query string (which looks for a ?) has to be a lookahead so that you don't actually match the ?. This also means that your query string will always be at least one character (the ?), so you want to repeat the query string [^#] 1+ times.
  • The query string now repeats one character at a time in a non-capturing group..unless it sees the key foo, in which case it captures the optional value and continues repeating.
  • Since this non-capture query string group repeats all the way until the anchor or end of the URI, a second foo value (path?foo=123&foo=bar) would overwrite the initial captured value..meaning you wouldn't 100% be able to rely on the above solution.

Final Solution?

Okay..now that I've captured the foo value, it's time to kill the match on a values that are not positive integers.

^            (?# match beginning of the string)
path         (?# match path literally)
(?:          (?# begin optional non-capturing group)
 (?=\?)      (?# lookahead for a literal ?)
 (?:         (?# begin optional non-capturing group)
   [?&]      (?# keys are preceeded by ? or &)
   foo       (?# match key literally)
   =         (?# values are preceeded by =)
   (\d*)     (?# value must be a non-negative integer)
   (?=       (?# begin lookahead)
     [&#]    (?# literally match & or #)
    |        (?# OR)
     $       (?# match end of the string)
   )         (?# end lookahead)
  |          (?# OR)
   (?!       (?# begin negative lookahead)
    [?&]     (?# literally match ? or &)
    foo=     (?# literally match foo=)
   )         (?# end negative lookahead)
   [^#]      (?# query strings are non-# characters)
 )+          (?# end repeating non-capturing group)
)?           (?# end optional non-capturing group)
(?=#|$)      (?# lookahead for a literal # or end of the string)

Let's take a closer look at some of the juju that went into that expression:

  • After finding foo=\d*, we use a lookahead to ensure that it is followed by a &, #, or the end of the string (the end of a query string value).
  • However..if there is more to foo=\d*, the regex would be kicked back by the alternator to a generic [^#] match right at the [?&] before foo. This isn't good, because it will continue to match! So before you look for a generic query string ([^#]), you must make sure you are not looking at a foo (that must be handled by the first alternation). This is where the negative lookahead (?![?&]foo=) comes in handy.
  • This will work with multiple foo keys, since they will all have to equal non-negative integers. This lets foo be optional (or equal null) as well.

Disclaimer: Most Regex101 demos use PHP for better syntax highlighting and include \n in negative character classes since there are multiple lines of examples.

Sign up to request clarification or add additional context in comments.

5 Comments

Just re-read the question and saw that OP wants to capture the value of foo. If you just throw a capture group around \d+, it will be the first and only capture group and will contain the value of foo or null on a successful match. The initial "answer" section has been updated...
I also just realized I was using \d+ when I meant to be using \d* (since the parameter is optional). Also, @SteveChambers added some good tests to the list.
+1 for showing all the workings that got you there. Only criticism would be it doesn't capture the foo value in a single capturing group - if you could update it to do that it deserves the bounty for sure :-)
It seems the capture group isn't working if there are more query parameters after foo. Appears as if this is because the alternation forgets the captured group (only in JS). I'm banging my head on this one, off to Google.
Seems this is a known bug in the JavaScript Engine and easy to reproduce (JS vs. PCRE).
5

Nice question! Seems fairly simple at first...but there are a lot of gotchas. Would advise checking any claimed solution will handle the following:

ADDITIONAL MATCH TESTS

path?                  # path found, foo = ''
path#foo               # path found, foo = ''
path#bar               # path found, foo = ''
path?foo=              # path found, foo = ''
path?bar=1&foo=        # path found, foo = ''
path?foo=&bar=1        # path found, foo = ''
path?foo=1#bar         # path found, foo = 1
path?foo=1&foo=2       # path found, foo = 2
path?foofoo=1          # path found, foo = ''
path?bar=123&foofoo=1  # path found, foo = ''

ADDITIONAL DO NOT MATCH TESTS

pathbar?               # path not found
pathbar?foo=1          # path not found
pathbar?bar=123&foo=1  # path not found
path?foo=a&foofoo=1    # not an integer
path?foofoo=1&foo=a    # not an integer

The simplest regex I could come up with that works for all these additional cases is:

path(?=(\?|$|#))(\?(.+&)?foo=(\d*)(&|#|$)|((?![?&]foo=).)*$)

However, would advise adding ?: to the unused capturing groups so they are ignored and you can easily get the foo value from Group 1 - see Debuggex Demo

path(?=(?:\?|$|#))(?:\?(?:.+&)?foo=(\d*)(?:&|#|$)|(?:(?![?&]foo=).)*$)

Regular expression visualization

Comments

4
^path\b(?!.*[?&]foo=(?!\d+(?=&|#|$)))(?:.*[?&]foo=(\d+)(?=&|#|$))?

Basically I just broke it down into three parts

^path\b                         # starts with path
(?!.*[?&]foo=(?!\d+(?=&|#|$)))  # not followed by foo with an invalid value
(?:.*[?&]foo=(\d+)(?=&|#|$))?   # possibly followed by foo with a valid value

see validation here http://regexr.com/39i7g

Caveats:

will match path#bar=1&foo=27

will not match path?foo=

The OP didn't mention these requirements and since he wants a simple regex (oxymoron?) I did not attempt to solve them.

1 Comment

+1 for keeping it simple and avoiding some unmentioned items.
2
path.+?(?:foo=(\d+))(?![a-zA-Z\d])|path((?!foo).)*$

You can try this.See demo.

http://regex101.com/r/jT3pG3/10

Comments

2

You can try the following regex:

path(?:.*?foo=(\d+)\b|()(?!.*foo))

regex101 demo

There are two possible matches after path:

.*?foo=(\d+)\b i.e. foo followed by digits.

OR

()(?!.*foo) an empty string if there is no foo ahead.

Add some word boundaries (\b) if you don't want the regex to interpret other words (e.g. another parameter named barfoobar) around the foos.

path(?:.*?\bfoo=(\d+)\b|()(?!.*\bfoo\b))

12 Comments

Note: In JS, there's no way to have the value of foo and the empty string (if matched) in the same group. I would have used path(?|.*?foo=(\d+)\b|()(?!.*foo)) otherwise from PCRE.
Or if path can be compromised as well... path\b(?|.*?\bfoo=(\d+)\b|()(?!.*\bfoo\b))
Since there is no branch reset, the second group doesn't really have to be there. It returns the null match for both group anyway: ["", ""]
@Unihedron Well, in JS, you get undefined if there is no match. Granted, the second capture group is not necessary if you use a simple if like this.
Uhh, the loop exits when there's no match and I'm not JS savvy enough to solve that. Oh well.
|
1

You can check for the existence of 3rd matched group. It it is not there, the foo value would be null; otherwise, it is the group itself:

/^(path)(?:$|\?(?:(?=.*\b(foo=)(\d+)\b.*$)|(?!foo=).*?))/gm

An example on regex101: http://regex101.com/r/oP6lU7/1

4 Comments

+1 but there're no needs to capture path and foo=
@M42 Ah, yes. I was first testing it in PHP with if-then-else cases.
Looks like it's working, but to be honest I expected to see a simpler regex :)
I found many situations where it didn't work, most notably when foo is not the first parameter and negative.
1

Dealing with javascript engine to make Regular Expressions besides all the lacks it has in compare with PCRE, somehow is enjoyable!

I made this RegEx, simple and understandable:

^(?=path\?).*foo=(\d*)(?:&|$)|path$

Explanations

^(?=path\?)             # A positive lookahead to ensure we have "path" at the very begining
.*foo=(\d*)(?:&|$)  # Looking for a string includes foo=(zero or more digits) following a "&" character or end of string
|                       # OR
path$                   # Just "path" itself

Runnable snippet:

var re = /^(?=path\?).*foo=(\d*)(?:&|$)|path$/gm; 
var str = 'path?foo=67\npath?foo=67&bar=hello\npath?bar=bye&foo=1&baz=12\npath\npathtest\npath?foo=37signals\npath?foo=-8\nsomething?foo=1';
var m, n = [];
 
while ((m = re.exec(str)) != null) {
    if (m.index === re.lastIndex) {
        re.lastIndex++;
    }
    n.push(m[0]);
}

alert( JSON.stringify(n) );

Or a Live demo for more details

Comments

1
path(?:\?(?:[^&]*&)*foo=([0-9]+)(?:[&#]|$))?

This is as short as most, and reads more straightforwardly, since things that appear once in the string appear once in the RE.

We match:

  1. the initial path
  2. a question mark, (or skip to end)
  3. some blocks terminated by ampersands
  4. our parameter assignment
  5. a closing confirmation, either starting the next syntactic element, or ending the line

Unfortunately it matches foo to None rather than '' when the foo parameter is omitted, but in Python (my language of choice) that is considered more appropriate. You could complain if you wanted, or just or with '' afterwards.

Comments

0

Based on the OP's data here is my attempt pattern

^(path)\b(?:[^f]+|f(?!oo=))(?!\bfoo=(?!\d+\b))(?:\bfoo=(\d+)\b)?

if path is found: sub-pattern #1 will contains "path"
if foo is valid: sub-pattern #2 will contains "foo value if any"

Demo

  • ^(path)\b "path"
  • (?:[^f]+|f(?!oo=)) followed by anything but "foo="
  • (?!\bfoo=(?!\d+\b)) if "foo=" is found it must not see anything but \d+\b
  • (?:\bfoo=(\d+)\b)? if valid "foo=" is found, capture "foo" value

Comments

-1
t = 'path?foo=67&bar=hello';
console.log(t.match(/\b(foo|path)\=\d+\b/))

regex /\b(foo|path)\=\d+\b/

1 Comment

This doesn't check for path.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.