Converting Javascript RegEx to C# Regex

Question

I have a Javascript regex that tokenizes words from a sentence which is like the following:

/\\[^]|\.+|\w+|[^\w\s]/g

Like if a sentence is entered like Hello World. the above regex will tokenize it into words:

Hello, World, .

I am trying to convert the above regex in C#, but its not able to group it. I have tried removing the / and the \g from the beginning and the end respectively, in order to make it compatible with .NET regex engine. But its still not working.

Below is the C# code I am trying:

public static void Main()
{
        string pattern = @"\\[^]|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern, RegexOptions.ECMAScript))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

Can anyone help me converting the above regex into C#?

Wiktor Stribiżew · Accepted Answer · 2018-04-27 06:47:46Z

4

Note that RegexOptions.ECMAScript just makes sure shorthand character classes (here, \w and \s) only match ASCII letters, digits and whitespace. You can't expect this option to "convert" the whole pattern for use in .NET regex library.

Here, [^] construct was used in JS regex to match any char. You may use . with a RegexOptions.Singleline option (and then you will have to remove the RegexOptions.ECMAScript option) instead of [^], or just use [\s\S] to match any char:

public static void Main()
{
        string pattern = @"\\.|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern,  RegexOptions.Singleline))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

See the C# demo, its output:

'hello' found at index 0.
'world' found at index 6.
'.' found at index 11.

NOTE: \w and \s are Unicode aware in .NET regex, the match all Unicode letters with some diacritics, too. If you only want to handle ASCII, use

string pattern = @"\\.|\.+|[A-Za-z0-9_]+|[^A-Za-z0-9_\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]";

More details

Word Character: \w in .NET regex
White-Space Character: \s in .NET regex

edited Apr 27, 2018 at 6:47

answered Apr 27, 2018 at 6:26

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Kunal Mukherjee Over a year ago

Its not working, I am getting a System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.

Wiktor Stribiżew Over a year ago

I did not check the code, but the regex is OK. Let me add a demo. Here is a C# demo

Kunal Mukherjee Over a year ago

Yeah, do I need to change my Regex to make it .NET compatible?

Wiktor Stribiżew Over a year ago

To tokenize a sentence like this, you may use a regex like the one you have. It will work a bit differently now, as \w and \s are Unicode aware in .NET regex library. If you only want to handle ASCII, use

string pattern = @"\\.|\.+|[A-Za-z0-9_]+|[^A-Za-z0-9_\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]";

Wiktor Stribiżew Over a year ago

@KunalMukherjee Try this C# demo solution. The @"[-+]?\d*\.?\d+(\d[-+]?\d+)?|\w+|[^\w\s]" pattern will tokenize into numbers, words and single punctuation/symbol chars.

|

Collectives™ on Stack Overflow

Converting Javascript RegEx to C# Regex

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related