Essentially, my question boils down to observing these two query plans (from SQL Server 2019):
- Plan using a SQL-defined scalar function.
- Plan using a CLR-defined scalar function.
I defined a scalar function to parse out a string representation of an IP address into the binary(16) value of the IPv6 address (mapping IPv4 addresses over to IPv6). I first implemented it in SQL, but then I also implemented it in C# using the built-in IPAddress class to parse the value. I'm attempting to use this function to join a table containing IP address strings to a table with a list of parsed out CIDR blocks (clustered index on [Start] and [End] binary(16) values).
There were two ways that I wrote the SQL query:
- Use the scalar function directly in the
JOINcriteria.
SELECT
*
FROM
[dbo].[values]
val
LEFT JOIN
[dbo].[cidr]
cidr
ON [dbo].[fn_ParseIP](val.[IpAddress]) BETWEEN cidr.[Start] AND cidr.[End]
;
- Compute the scalar function in an
APPLYclause before referencing it in theJOIN.
SELECT
val.*, cidr.*
FROM
[dbo].[values]
val
CROSS APPLY
(
SELECT [ParsedIpAddress] = [dbo].[fn_ParseIP](val.[IpAddress])
)
calc
LEFT JOIN
[dbo].[cidr]
cidr
ON calc.[ParsedIpAddress] BETWEEN cidr.[Start] AND cidr.[End]
;
In my testing, [dbo].[values] contained 17 rows (and was defined using VALUES rather than being an actual table) and [dbo].[cidr] contained 986,320 rows.
When using a scalar function defined in SQL, query 1 takes about 7.5 minutes to run and query 2 takes under 1 second.
When using a CLR scalar function, both queries take about 2.5 minutes to run, but query 2 has an extra node in the query plan to compute the scalar function after the join.
The difference is ultimately that when referencing the scalar function that was defined in SQL, I'm able to get it to generate the first plan where it computes the results of the scalar function first and then uses those as the seek predicate into the [dbo].[cidr] clustered index when performing the join. But when using the CLR function, it always performs the calculation as part of the clustered index seek (and filter), so it is running the function significantly more often to get the results.
My assumption is that this behavior could be due to the query planner believing the function to be nondeterministic, but I have the CLR function implemented with the following attributes:
[SqlFunction(
DataAccess = DataAccessKind.None,
IsDeterministic = true,
IsPrecise = true
)]
[return: SqlFacet(IsFixedLength = true, IsNullable = true, MaxSize = 16)]
public static SqlBinary fn_CLR_ParseIP([SqlFacet(MaxSize = 50)] SqlString ipAddress)
{ }
My hope was that I could rely on the .NET standard library to deal with the IP address parsing for me in SQL. Currently, we have some processes that work with IPv4 addresses only, and I need to update them to work with IPv6 as well. In some of our large databases, this processing is very slow, so I was hoping that the parsing logic in .NET would be more efficient. It seems like the CLR function itself is faster than my SQL implementation, but the effect on the query plan is significantly worse.
I can likely rewrite the query to use temporary tables to parse the IP addresses out first, and that should resolve this issue. I can also get decent results when defining an equivalent CLR table-valued function that returns just a single row.
However, I'd like to know if there is something that I'm missing that would make it easier to use a CLR scalar function as part of a filter predicate. Is it just a bad idea and I should proceed with some of these alternatives, or is there something I that I could do that would make it easier to work with the CLR function as a drop-in replacement for the SQL function?
For anybody that is interested, here is the final query that is performing well using the concept given in T N's answer.
SELECT
*
FROM
[dbo].[values]
val
CROSS APPLY
(
SELECT [IpAddress] = [dbo].[fn_CLR_ParseIP](val.[IpAddress])
)
parsed
OUTER APPLY
(
SELECT TOP (1)
*
FROM
[dbo].[cidr]
_cidr
WHERE
_cidr.[range_start] <= parsed.[IpAddress]
AND _cidr.[range_end] >= parsed.[IpAddress]
ORDER BY
_cidr.[range_start] DESC
)
cidr
;


cidrtable have rows with overlapping start/end ranges, or are the ranges distinct and non-overlapping? If the ranges are non-overlapping, with at most one match for any given IP address, it should be possible to use theOUTER APPLY(SELECT TOP 1 ...)pattern to efficiently select a single candidate row and then verify the range.LEFT JOINto anOUTER APPLYworks efficiently with the CLR function, so that's a good solution that I hadn't considered, and it would probably be the solution I use for my queries if I use the CLR function. However my broader question is ultimately if there's some sort of magic that I'm missing for deploying or defining CLR functions that would play nicer with the query planner, or if this is just emblematic of the difficulties with using CLR functions.cidrtable. I think a big part of the 11 second runtime from yesterday was that I was also running a resource intensive process on my computer when doing some testing here (I'm testing on a local instance). Today it is closer to 3 seconds, and it is the same when doing the subquery in theOUTER APPLYlike you were suggesting. And yes, that condition holds true and there are no overlaps in the CIDR ranges from the table.