4

I have a Javascript regex that tokenizes words from a sentence which is like the following:

/\\[^]|\.+|\w+|[^\w\s]/g

Like if a sentence is entered like Hello World. the above regex will tokenize it into words:

Hello, World, .

I am trying to convert the above regex in C#, but its not able to group it. I have tried removing the / and the \g from the beginning and the end respectively, in order to make it compatible with .NET regex engine. But its still not working.

Below is the C# code I am trying:

public static void Main()
{
        string pattern = @"\\[^]|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern, RegexOptions.ECMAScript))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

Can anyone help me converting the above regex into C#?

1 Answer 1

4

Note that RegexOptions.ECMAScript just makes sure shorthand character classes (here, \w and \s) only match ASCII letters, digits and whitespace. You can't expect this option to "convert" the whole pattern for use in .NET regex library.

Here, [^] construct was used in JS regex to match any char. You may use . with a RegexOptions.Singleline option (and then you will have to remove the RegexOptions.ECMAScript option) instead of [^], or just use [\s\S] to match any char:

public static void Main()
{
        string pattern = @"\\.|\.+|\w+|[^\w\s]";
        string input = @"hello world.";

        foreach (Match m in Regex.Matches(input, pattern,  RegexOptions.Singleline))
        {
            Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
        }
}

See the C# demo, its output:

'hello' found at index 0.
'world' found at index 6.
'.' found at index 11.

NOTE: \w and \s are Unicode aware in .NET regex, the match all Unicode letters with some diacritics, too. If you only want to handle ASCII, use

string pattern = @"\\.|\.+|[A-Za-z0-9_]+|[^A-Za-z0-9_\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]";

More details

Sign up to request clarification or add additional context in comments.

10 Comments

Its not working, I am getting a System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
I did not check the code, but the regex is OK. Let me add a demo. Here is a C# demo
Yeah, do I need to change my Regex to make it .NET compatible?
To tokenize a sentence like this, you may use a regex like the one you have. It will work a bit differently now, as \w and \s are Unicode aware in .NET regex library. If you only want to handle ASCII, use string pattern = @"\\.|\.+|[A-Za-z0-9_]+|[^A-Za-z0-9_\f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]";
@KunalMukherjee Try this C# demo solution. The @"[-+]?\d*\.?\d+(\d[-+]?\d+)?|\w+|[^\w\s]" pattern will tokenize into numbers, words and single punctuation/symbol chars.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.