JavaScript RegEx Syntax

Question

I'm writing C# code to parse JavaScript into tokens, and my knowledge of JavaScript is not 100%.

One thing that threw me is that JavaScript regular expressions are not enclosed in quotes. So how would a parser detect when they start and end? It looks like they start with / but then can contain nearly any character after that.

Note that I am not asking about syntax needed to match certain characters, which is what all the results from my Google searches were about. I just want to know the rules for determining how I know where a regular expression starts and where it ends.

I know it's fun to write parsers, but depending on your requirements, you should be aware that there are ECMAScript parsers out there. If you want to count JScript, there's even a native one built into .NET. (I know, I know, but I've used it to build and run sizzle.js, so I think it's pretty compliant). Take a look at the Coco/R compiler generator, which has a C# implementation and can build parsers from BNF-style grammars. — harpo
– harpo, Commented May 11, 2011 at 17:43
Oh boy this is a tough one. I spent almost a whole year learning to parse JavaScript. — ChaosPandion
– ChaosPandion, Commented May 11, 2011 at 17:44
@harpo: Thanks, but what's the fun in using someone else's code? :-) — Jonathan Wood
– Jonathan Wood, Commented May 11, 2011 at 17:46
@ChaoesPandion: Why do you say that? Writing my tokenizer seemed very straight forward to me. I'm just not sure about regular expressions because I don't really understand the syntax. — Jonathan Wood
– Jonathan Wood, Commented May 11, 2011 at 17:47
@Jonathan - Well that year also included writing a full runtime but I also might simply be a dummy. It was a lot of fun though. — ChaosPandion
– ChaosPandion, Commented May 11, 2011 at 17:48

ChaosPandion · Accepted Answer · 2011-05-13 00:16:05Z

2

I would consider the following RegExp a reasonable approximation.

/(\\/|[^/])+/([a-zA-Z])*

The rules are formally defined:

RegularExpressionLiteral ::  See 7.8.5 
    / RegularExpressionBody / RegularExpressionFlags 

RegularExpressionBody ::  See 7.8.5 
    RegularExpressionFirstChar RegularExpressionChars 

RegularExpressionChars ::  See 7.8.5 
    [empty] 
    RegularExpressionChars RegularExpressionChar 

RegularExpressionFirstChar ::  See 7.8.5 
    RegularExpressionNonTerminator but not one of * or \ or / or [ 
    RegularExpressionBackslashSequence 
    RegularExpressionClass 

RegularExpressionChar ::  See 7.8.5 
    RegularExpressionNonTerminator but not \ or / or [ 
    RegularExpressionBackslashSequence 
    RegularExpressionClass 

RegularExpressionBackslashSequence ::  See 7.8.5 
    \ RegularExpressionNonTerminator 

RegularExpressionNonTerminator ::  See 7.8.5 
    SourceCharacter but not LineTerminator 

RegularExpressionClass ::  See 7.8.5 
    [ RegularExpressionClassChars ] 

RegularExpressionClassChars ::   See 7.8.5 
    [empty] 
    RegularExpressionClassChars RegularExpressionClassChar 

RegularExpressionClassChar ::   See 7.8.5 
    RegularExpressionNonTerminator but not ] or \ 
    RegularExpressionBackslashSequence

RegularExpressionFlags ::  See 7.8.5 
    [empty] 
    RegularExpressionFlags IdentifierPart

Full Specification

Here is some quick and dirty code that might get you started.

class CharStream
{
    private readonly Stack<int> _states;
    private readonly string _input;
    private readonly int _length;
    private int _index;

    public char Current
    {
        get { return _input[_index]; }
    }

    public CharStream(string input)
    {
        _states = new Stack<int>();
        _input = input;
        _length = input.Length;
        _index = -1;
    }

    public bool Next()
    {
        if (_index < 0)
            _index++;
        if (_index == _length)
            return false;
        _index++;
        return true;
    }

    public bool ExpectNext(char c)
    {
        if (_index == _length)
            return false;
        if (_input[_index + 1] != c)
            return false;
        _index++;
        return true;
    }

    public bool Back()
    {
        if (_index == 0)
            return false;
        _index--;
        return true;
    }

    public void PushState()
    {
        _states.Push(_index);
    }

    public T PopState<T>()
    {
        _index = _states.Pop();
        return default(T);
    }
}

static string ParseRegularExpressionLiteral(CharStream cs)
{
    string body, flags;
    cs.PushState();
    if (!cs.ExpectNext('/'))
        return cs.PopState<string>();
    if ((body = ParseRegularExpressionBody(cs)) == null)
        return cs.PopState<string>();
    if (!cs.ExpectNext('/'))
        return cs.PopState<string>();
    if ((flags = ParseRegularExpressionFlags(cs)) == null)
        return cs.PopState<string>();
    return "/" + body + "/" + flags;
}

static string ParseRegularExpressionBody(CharStream cs)
{
    string firstChar, chars;
    cs.PushState();
    if ((firstChar = ParseRegularExpressionFirstChar(cs)) == null)
        return cs.PopState<string>();
    if ((chars = ParseRegularExpressionChars(cs)) == null)
        return cs.PopState<string>();
    return firstChar + chars;
}

static string ParseRegularExpressionChars(CharStream cs)
{
    var sb = new StringBuilder();
    string @char;
    while ((@char = ParseRegularExpressionChar(cs)) != null)
        sb.Append(@char);
    return sb.ToString();
}

static string ParseRegularExpressionFirstChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionBackslashSequence(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionNonTerminator(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClass(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClassChars(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClassChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionFlags(CharStream cs)
{
    return null;
}

As to how you find the end of the literal? Well the trick is to recursively follow the productions I have listed. Consider the production RegularExpressionBody. Simply reading the production tells me that it requires RegularExpressionFirstChar followed by RegularExpressionChars. Notice how RegularExpressionChars has either [empty] or RegularExpressionChars RegularExpressionChar. Essentially it is defined by itself. Once that production terminates with [empty] you know that the only valid character should be the closing /. If that is not found this is not a valid literal.

edited May 13, 2011 at 0:16

answered May 11, 2011 at 17:46

ChaosPandion

78.4k18 gold badges122 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

ChaosPandion Over a year ago

@Kobi - Nothing all that relevant unless you are building a full implementation of ECMAScript. What I posted here is a portion of the lexical grammar.

Jonathan Wood Over a year ago

@ChaosPandion: I appreciate the detailed information but must admit I'm having trouble making heads or tails of it. If you asked me, for example, which characters can be included in an integer, I might say something like "0123456789". Nice and simple. Where in either the formal rules above or your code example does it tell me which characters are legal in the regular expression and which are legal in the flags portion? I must be missing something here.

Kobi Over a year ago

@Jonathan - You would be wrong. JavaScript also supports negative numbers (-12), octal literals (010 = 8), hexadecimal literals (0xa = 10), and floats (2e3 = 2000). It's really fun. Looking at the formal grammar of the language is a very good idea.

ChaosPandion Over a year ago

@Jonathan - There are a couple productions missing which I'll add for you.

ChaosPandion Over a year ago

@Jonathan - I decided to add a full link to the specification so I don't need to manually format everything. You most likely want to look at the 7.6 Identifier Names and Identifiers as this portion of the grammar is used for RegExp flags.

|

Shaz · Accepted Answer · 2011-05-11 18:05:12Z

2

var test = new RegExp("\\d/\\d/\\d", "g"); (flags are a second param)
test = /\d\/\d\/\d/;
test = /\d\/\d\/\d/g (flags are after the last /)

Use a / to escape characters in the second one. Explained:
/      - Start of RegExp
\d    - Digit character
\/    - Escaped / (which matches the actual / character)
\d    - Digit character
\/    - Escaped / (which matches the actual / character)
\d    - Digit character
/      - End of RegExp

This would match 1/2/3.

edited May 11, 2011 at 18:05

answered May 11, 2011 at 17:47

Shaz

15.9k4 gold badges44 silver badges60 bronze badges

5 Comments

mVChr Over a year ago

Backslashes have to be doubled for new RegExp("\\d/\\d/\\d")

Shaz Over a year ago

@mVChr: Whoops, forgot to add that in. Thanks! :)

BestPractices Over a year ago

dont forget that javascript isn't that strict of a language. If you are parsing javascript that is not coded in a consistent manner, there would be more test cases than just 1/2/3. Folks could stick in quotes around their regexs or forget to put in the semicolon, etc.

Shaz Over a year ago

@BestPractices: Like what I did in #3 :(

Jonathan Wood Over a year ago

Thanks, but I'm not entirely clear how to tell if I've found the end of the expression or if the / simply signifies the start of the flags.

BestPractices · Accepted Answer · 2011-05-11 17:44:57Z

1

Literal Javascript regular expressions can look like this:

/myregularexpressionliteral/

or

/yregularexpressionlitera/myregex flags

See more here: http://lawrence.ecorp.net/inet/samples/regexp-intro.php

answered May 11, 2011 at 17:44

BestPractices

12.9k31 gold badges102 silver badges143 bronze badges

3 Comments

Jonathan Wood Over a year ago

So, when I encounter the second /, how do I know if I've found the end of the expression or the start of the flags?

BestPractices Over a year ago

You would likely find a ";" at the end of the line. It would depend how the regexs are coded though since someone could plce them into javascript in any manner of ways. For example: myregex = /test/; or myregex = "/test/";

Jonathan Wood Over a year ago

Couldn't they also be passed as arguments? If so, they would not be immediately followed by ;. I guess I'm looking for something a little more definitive than "likely".

artificialidiot · Accepted Answer · 2011-05-11 17:45:57Z

1

a literal / is escaped like \/ for the match characters so you shouldn't have any difficulties finding the end /. After that comes the flags and there are finite number of them.

answered May 11, 2011 at 17:45

artificialidiot

5,37933 silver badges27 bronze badges

1 Comment

Jonathan Wood Over a year ago

Other answers seem to suggest the flags are optional. If so, how do I know if the second / is the end or the start of the flags?

Jonathan Wood · Accepted Answer · 2011-05-22 17:34:07Z

1

The way I ended up resolving this was to rely on the previous token.

The main problem was distinguishing between a regular expression literal and the divide symbol. The only way for me to do this is by looking at the context in which it appears.

So, for example, if the previous token was a number, a forward slash could only be a division operator.

There are still some cases where this is not 100% reliable. But for my purposes, it seems to be the best solution.

answered May 22, 2011 at 17:34

Jonathan Wood

68.1k86 gold badges308 silver badges542 bronze badges

Collectives™ on Stack Overflow

JavaScript RegEx Syntax

5 Answers 5

14 Comments

5 Comments

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

14 Comments

5 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related