2

I'm writing C# code to parse JavaScript into tokens, and my knowledge of JavaScript is not 100%.

One thing that threw me is that JavaScript regular expressions are not enclosed in quotes. So how would a parser detect when they start and end? It looks like they start with / but then can contain nearly any character after that.

Note that I am not asking about syntax needed to match certain characters, which is what all the results from my Google searches were about. I just want to know the rules for determining how I know where a regular expression starts and where it ends.

10
  • I know it's fun to write parsers, but depending on your requirements, you should be aware that there are ECMAScript parsers out there. If you want to count JScript, there's even a native one built into .NET. (I know, I know, but I've used it to build and run sizzle.js, so I think it's pretty compliant). Take a look at the Coco/R compiler generator, which has a C# implementation and can build parsers from BNF-style grammars. Commented May 11, 2011 at 17:43
  • Oh boy this is a tough one. I spent almost a whole year learning to parse JavaScript. Commented May 11, 2011 at 17:44
  • @harpo: Thanks, but what's the fun in using someone else's code? :-) Commented May 11, 2011 at 17:46
  • @ChaoesPandion: Why do you say that? Writing my tokenizer seemed very straight forward to me. I'm just not sure about regular expressions because I don't really understand the syntax. Commented May 11, 2011 at 17:47
  • @Jonathan - Well that year also included writing a full runtime but I also might simply be a dummy. It was a lot of fun though. Commented May 11, 2011 at 17:48

5 Answers 5

2

I would consider the following RegExp a reasonable approximation.

/(\\/|[^/])+/([a-zA-Z])*

The rules are formally defined:

RegularExpressionLiteral ::  See 7.8.5 
    / RegularExpressionBody / RegularExpressionFlags 

RegularExpressionBody ::  See 7.8.5 
    RegularExpressionFirstChar RegularExpressionChars 

RegularExpressionChars ::  See 7.8.5 
    [empty] 
    RegularExpressionChars RegularExpressionChar 

RegularExpressionFirstChar ::  See 7.8.5 
    RegularExpressionNonTerminator but not one of * or \ or / or [ 
    RegularExpressionBackslashSequence 
    RegularExpressionClass 

RegularExpressionChar ::  See 7.8.5 
    RegularExpressionNonTerminator but not \ or / or [ 
    RegularExpressionBackslashSequence 
    RegularExpressionClass 

RegularExpressionBackslashSequence ::  See 7.8.5 
    \ RegularExpressionNonTerminator 

RegularExpressionNonTerminator ::  See 7.8.5 
    SourceCharacter but not LineTerminator 

RegularExpressionClass ::  See 7.8.5 
    [ RegularExpressionClassChars ] 

RegularExpressionClassChars ::   See 7.8.5 
    [empty] 
    RegularExpressionClassChars RegularExpressionClassChar 

RegularExpressionClassChar ::   See 7.8.5 
    RegularExpressionNonTerminator but not ] or \ 
    RegularExpressionBackslashSequence

RegularExpressionFlags ::  See 7.8.5 
    [empty] 
    RegularExpressionFlags IdentifierPart

Full Specification

Here is some quick and dirty code that might get you started.

class CharStream
{
    private readonly Stack<int> _states;
    private readonly string _input;
    private readonly int _length;
    private int _index;

    public char Current
    {
        get { return _input[_index]; }
    }

    public CharStream(string input)
    {
        _states = new Stack<int>();
        _input = input;
        _length = input.Length;
        _index = -1;
    }

    public bool Next()
    {
        if (_index < 0)
            _index++;
        if (_index == _length)
            return false;
        _index++;
        return true;
    }

    public bool ExpectNext(char c)
    {
        if (_index == _length)
            return false;
        if (_input[_index + 1] != c)
            return false;
        _index++;
        return true;
    }

    public bool Back()
    {
        if (_index == 0)
            return false;
        _index--;
        return true;
    }

    public void PushState()
    {
        _states.Push(_index);
    }

    public T PopState<T>()
    {
        _index = _states.Pop();
        return default(T);
    }
}

static string ParseRegularExpressionLiteral(CharStream cs)
{
    string body, flags;
    cs.PushState();
    if (!cs.ExpectNext('/'))
        return cs.PopState<string>();
    if ((body = ParseRegularExpressionBody(cs)) == null)
        return cs.PopState<string>();
    if (!cs.ExpectNext('/'))
        return cs.PopState<string>();
    if ((flags = ParseRegularExpressionFlags(cs)) == null)
        return cs.PopState<string>();
    return "/" + body + "/" + flags;
}

static string ParseRegularExpressionBody(CharStream cs)
{
    string firstChar, chars;
    cs.PushState();
    if ((firstChar = ParseRegularExpressionFirstChar(cs)) == null)
        return cs.PopState<string>();
    if ((chars = ParseRegularExpressionChars(cs)) == null)
        return cs.PopState<string>();
    return firstChar + chars;
}

static string ParseRegularExpressionChars(CharStream cs)
{
    var sb = new StringBuilder();
    string @char;
    while ((@char = ParseRegularExpressionChar(cs)) != null)
        sb.Append(@char);
    return sb.ToString();
}

static string ParseRegularExpressionFirstChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionBackslashSequence(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionNonTerminator(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClass(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClassChars(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionClassChar(CharStream cs)
{
    return null;
}

static string ParseRegularExpressionFlags(CharStream cs)
{
    return null;
}

As to how you find the end of the literal? Well the trick is to recursively follow the productions I have listed. Consider the production RegularExpressionBody. Simply reading the production tells me that it requires RegularExpressionFirstChar followed by RegularExpressionChars. Notice how RegularExpressionChars has either [empty] or RegularExpressionChars RegularExpressionChar. Essentially it is defined by itself. Once that production terminates with [empty] you know that the only valid character should be the closing /. If that is not found this is not a valid literal.

Sign up to request clarification or add additional context in comments.

14 Comments

@Kobi - Nothing all that relevant unless you are building a full implementation of ECMAScript. What I posted here is a portion of the lexical grammar.
@ChaosPandion: I appreciate the detailed information but must admit I'm having trouble making heads or tails of it. If you asked me, for example, which characters can be included in an integer, I might say something like "0123456789". Nice and simple. Where in either the formal rules above or your code example does it tell me which characters are legal in the regular expression and which are legal in the flags portion? I must be missing something here.
@Jonathan - You would be wrong. JavaScript also supports negative numbers (-12), octal literals (010 = 8), hexadecimal literals (0xa = 10), and floats (2e3 = 2000). It's really fun. Looking at the formal grammar of the language is a very good idea.
@Jonathan - There are a couple productions missing which I'll add for you.
@Jonathan - I decided to add a full link to the specification so I don't need to manually format everything. You most likely want to look at the 7.6 Identifier Names and Identifiers as this portion of the grammar is used for RegExp flags.
|
2
  1. var test = new RegExp("\\d/\\d/\\d", "g"); (flags are a second param)
  2. test = /\d\/\d\/\d/;
  3. test = /\d\/\d\/\d/g (flags are after the last /)

Use a / to escape characters in the second one. Explained:
/      - Start of RegExp
\d    - Digit character
\/    - Escaped / (which matches the actual / character)
\d    - Digit character
\/    - Escaped / (which matches the actual / character)
\d    - Digit character
/      - End of RegExp

This would match 1/2/3.

5 Comments

Backslashes have to be doubled for new RegExp("\\d/\\d/\\d")
@mVChr: Whoops, forgot to add that in. Thanks! :)
dont forget that javascript isn't that strict of a language. If you are parsing javascript that is not coded in a consistent manner, there would be more test cases than just 1/2/3. Folks could stick in quotes around their regexs or forget to put in the semicolon, etc.
@BestPractices: Like what I did in #3 :(
Thanks, but I'm not entirely clear how to tell if I've found the end of the expression or if the / simply signifies the start of the flags.
1

Literal Javascript regular expressions can look like this:

/myregularexpressionliteral/

or

/yregularexpressionlitera/myregex flags

See more here: http://lawrence.ecorp.net/inet/samples/regexp-intro.php

3 Comments

So, when I encounter the second /, how do I know if I've found the end of the expression or the start of the flags?
You would likely find a ";" at the end of the line. It would depend how the regexs are coded though since someone could plce them into javascript in any manner of ways. For example: myregex = /test/; or myregex = "/test/";
Couldn't they also be passed as arguments? If so, they would not be immediately followed by ;. I guess I'm looking for something a little more definitive than "likely".
1

a literal / is escaped like \/ for the match characters so you shouldn't have any difficulties finding the end /. After that comes the flags and there are finite number of them.

1 Comment

Other answers seem to suggest the flags are optional. If so, how do I know if the second / is the end or the start of the flags?
1

The way I ended up resolving this was to rely on the previous token.

The main problem was distinguishing between a regular expression literal and the divide symbol. The only way for me to do this is by looking at the context in which it appears.

So, for example, if the previous token was a number, a forward slash could only be a division operator.

There are still some cases where this is not 100% reliable. But for my purposes, it seems to be the best solution.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.