Unicode REGEX in Sql Server CLR function

Question

I have a REGEX SQL CLR function:

var rule1 = new Regex("شماره\\s?\\d{1,10}")

Calling it on SQL Server 2016, however, returns this error:

System.ArgumentException: parsing "?????\s?\d{1,10}" - Quantifier {x,y} following nothing.
at System.Text.RegularExpressions.Regex..ctor(String pattern)

It seems that my unicode characters are changed to question marks, which makes the whole Regex wrong.

Seems like the parameter is either not an nvarchar, or the return type isn't; at a guess. No SQL or code to debug here, so impossible to suggest more. — Thom A
– Thom A ♦, Commented Jan 4, 2020 at 13:10
Can you share some more code of the actual CLR method as well as how you call the method. — Niels Berglund
– Niels Berglund, Commented Jan 4, 2020 at 15:03
Seems like a RightToLeft issue - {1,10} would precede \d in this case. — AlwaysLearning
– AlwaysLearning, Commented Jan 4, 2020 at 23:55
@AlwaysLearning the problem is several question marks are not valid regex, but why are they here in the first place — mohas
– mohas, Commented Jan 5, 2020 at 13:06

Solomon Rutzky · Accepted Answer · 2020-01-05 20:01:11Z

1

This issue has nothing to do with datatypes, whether for input parameters or return values, as the code provided, while sparse on detail, does show enough to see that:

there is no input parameter being used (the string is hard-coded).
the error is being thrown by System.Text.RegularExpressions.Regex, so has nothing to do with T-SQL or return values / types.

Also, while the error message does mention "Quantifier {x,y}", and there is indeed a {1,10} quantifier being used in the Regular Expression, it is a false correlation (albeit a rather understandable one) that the error message is referring to that specific quantifier. If you shorten the Regular Expression down to just "شماره", you will get the same error, except it will report the Regular Expression as being just "?????". Hence, "Quantifier {x,y}" actually refers to the first "?" in the expression shown in the error message (you will get the same error even if the Regular Expression is nothing more than "ش"). I figure that "Quantifier {x,y}" is the generalized way of looking at the ?, +, and * quantifiers as they can also be expressed as {0,1}, {1,}, and {0,}, respectively (or at least they should be).

This issue has nothing to do with SQL Server, or even Regular Expressions. This is an encoding issue, and RegEx is reporting the problem because it is being given ????? instead of شماره.

<TL;DR> Check your source code file's encoding. You might need to go to "Save As...", click on the down-arrow to the right of the word "Save" on the "Save" button, select "Save with Encoding...", and then select "Unicode (UTF-8 with signature) - Codepage 65001".

There is a problem with the project configuration and/or the compiler. I placed the following string in both a Console Application and a Database Project:

"-😈-ŏ-א---\U0001F608-\u014F-\u05D0-"

(The second half of that test string, after the ---, is merely the escape sequences for the same three characters as appear in the first half, and in the same order.)

I compiled both and inspected the compiled output (meaning: it hasn't been deployed to SQL Server yet). That string appears in the EXE file (Console App) as:

2D003DD808DE2D004F012D00D0052D002D002D003DD808DE2D004F012D00D0052D00
which is the UTF-16 LE encoding for: -😈-ŏ-א---😈-ŏ-א-

Yet, it appears in the DLL file (SQLCLR Assembly) as:

2D003F003F002D003F002D003F002D002D002D003DD808DE2D004F012D00D0052D00
which is the UTF-16 LE encoding for: -??-?-?---😈-ŏ-א-

I even changed the output type of the Console App project to be "Class Library" and the string still got embedded correctly in that DLL file. So, for some reason the literal characters are being turned into literal question marks when compiled into a SQLCLR Assembly. I haven't yet figured out what is causing this as a quick look at the config settings and command-line flags for csc.exe seems to show them being effectively the same.

In either case, it should be clear that specifying the Arabic characters via escape sequences, while cumbersome, will at least work, hence providing a (hopefully short-term) work-around so that you can move forward on this. I will continue looking to see what could be causing this difference in behavior.

UPDATE

In order to determine if the string was being converted to an 8-bit encoding or something else, I added two characters to the test string (one in both Windows-1252 and ISO-8859-1, and one only in Windows-1252):

§ = 0xA7 in CP-1252, 0xA7 in ISO-8859-1, and 0x00A7 in UTF-16
œ = 0x9C in CP-1252, not in ISO-8859-1, and 0x0153 in UTF-16

The new test string is:

"-😈-ŏ-א-§-œ---\U0001F608-\u014F-\u05D0-\x00A7-\x0153-"

That string appears in the EXE file (Console App) as:

2D003DD808DE2D004F012D00D0052D00A7002D0053012D002D002D003DD808DE2D004F012D00D0052D00A7002D0053012D00
which is the UTF-16 LE encoding for: -😈-ŏ-א-§-œ---😈-ŏ-א-§-œ-

Yet, it appears in the DLL file (SQLCLR Assembly) as:

2D003F003F002D003F002D003F002D00A7002D0053012D002D002D003DD808DE2D004F012D00D0052D00A7002D0053012D00
which is the UTF-16 LE encoding for: -??-?-?-§-œ---😈-ŏ-א-§-œ-

So, because both § and œ came through correctly in the SQLCLR Assembly, it is clearly not ISO-8859-1. And, it is either Code Page Windows-1252 or some other that supports both of those characters (CP-1252 being the most likely given that my system is using it).

Still investigating the root cause...

UPDATE 2

Ok, I feel kinda silly. Sometimes it helps to close a file (or the entire solution sometimes) and reopen it. Doing so I noticed that my test string now appeared as:

"-??-?-?-?-?---\U0001F608-\u014F-\u05D0-\x00A7-\x0153-"

Funny, I don't remember pasting that in ;-). So, I checked the file encoding that Visual Studio was saving it as and sure enough it was "Western European (Windows) - Codepage 1252". And just to be extra special certain, I checked the file for the Console App and it was correctly set to "Unicode (UTF-8 with signature) - Codepage 65001". D'oh! Changing the file encoding under "Save As..." to "Unicode (UTF-8 with signature) - Codepage 65001", I then replaced both the test string and the O.P.'s Regular Expression. Both came through perfectly, no errors or question marks.

edited Jan 5, 2020 at 20:01

answered Jan 5, 2020 at 8:36

Solomon Rutzky

49.3k11 gold badges141 silver badges184 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

mohas Over a year ago

Thank you, I hate people demanding more source code, etc. while failing to understand the nature of the problem

Thom A Over a year ago

You mean you hate people asking you to provide an MRE @mohas ..?

Solomon Rutzky Over a year ago

@mohas While some frustration is understandable, please try to keep in mind that people here are also just trying to help and are unpaid volunteers. So perhaps "hate" is a bit too strong of a word to use in this context, especially when it's unfortunately not uncommon for people to ask for help while not providing enough info to go on. Such was not the case here, but it happens often enough that sometimes it even helps to add a reason or two to the question as to why you are not providing more info (i.e. "This is clearly a .NET error due to...").

Solomon Rutzky Over a year ago

@mohas Also, I believe I have figured it out. Please review the updates I have made. Hopefully the cause of this for you is as simple as it was for me :-).

Solomon Rutzky Over a year ago

@Larnu To be fair, the O.P. kinda did provide enough info to go on. I could see from the single line of code that no input parameter was used. And I could see from the error message that the error occurred in the RegEx class within .NET, so this really had nothing to do with SQL Server or return types. Just to be sure, I took that single line of code and added it to a new / empty SQLCLR stored procedure (no input params, result set, or return value), executed it, and received the same error.

Collectives™ on Stack Overflow

Unicode REGEX in Sql Server CLR function

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related