0

In SQL Server I have a table of bank transactions, and a table of payees. I want to link the transaction to the correct payee. I have used CHARINDEX (PayeeName, TxnDescription,1) > 0 to find the links.

It works OK for some transactions, but I have hit a problem with this situation:

There are 2 payees, "Tesco" and "Tesco Pet Insurance" – when the transaction description contains the words "Tesco Pet Insurance" it matches on both payees, and I only want a match on payee "Tesco Pet Insurance".

Can someone suggest an efficient method of achieving this.

This is my SQL statement:

SELECT
    tx.Description
    , py.Payee
FROM tblTxns TX
INNER JOIN vwPayeeNames py
    ON CHARINDEX(py.Payee, tx.Description, 1) > 0
WHERE tx.Description LIKE 'Tesco%'
AND LEN(tx.Description) >= LEN(py.Payee)
GROUP BY tx.Description
    , py.Payee
4
  • 1
    If you provide some sample data as DDL+DML (or a DBFiddle) you make it much easier for people to assist you. Commented Aug 6 at 2:11
  • 10
    Just to voice an opinion: Fuzzy matching of unstructured data is no way to write a financial application. I suggest that you take a step back and see if you can analyze the layout of your transaction text values to identify a rule or rules that can be used to reliably separate the payee name from other data. If you can do that, update your data design to include dedicated columns for each data element of interest, and then rewrite your query to perform exact matches. Even then, you may have ambiguities if you have similarly named payees and no distinct identifier (account #) to work with. Commented Aug 6 at 3:58
  • 2
    @TN I think you should make your comment an answer because I don't think there is a better solution than doing what you described. Commented Aug 6 at 5:45
  • You haven't shown the definition of your table(s) and indexes, without which it's hard to give a good answer. See Why should I provide an MCVE for what seems to me to be a very simple SQL query? Commented Aug 6 at 10:58

3 Answers 3

2

You can use OUTER APPLY to:

  • match all possible payees in the description.

  • rank them by length (or CHARINDEX position, if needed).

  • pick the longest match (most specific)

DDL:

-- Transactions table
CREATE TABLE tblTxns (
    TxnID INT IDENTITY(1,1) PRIMARY KEY,
    Description NVARCHAR(255)
);

-- Payees table or view
CREATE TABLE vwPayeeNames (
    PayeeID INT IDENTITY(1,1) PRIMARY KEY,
    Payee NVARCHAR(100)
);

INSERT INTO vwPayeeNames (Payee)
VALUES 
    ('Tesco'),
    ('Tesco Pet Insurance'),
    ('Amazon'),
    ('Amazon Marketplace'),
    ('Shell'),
    ('Starbucks');

INSERT INTO tblTxns (Description)
VALUES 
    ('Payment to Tesco'),
    ('Tesco Pet Insurance monthly premium'),
    ('Amazon Marketplace Order #123'),
    ('Amazon Prime subscription'),
    ('Fuel from Shell garage'),
    ('Coffee at Starbucks'),
    ('Tesco Groceries'),
    ('Tesco Bank Credit Card Payment');

And the query:

SELECT
    tx.TxnID,
    tx.Description,
    bestMatch.Payee
FROM tblTxns tx
OUTER APPLY (
    SELECT TOP 1 py.Payee
    FROM vwPayeeNames py
    WHERE CHARINDEX(py.Payee, tx.Description) > 0
    ORDER BY LEN(py.Payee) DESC
) bestMatch
ORDER BY tx.TxnID;

DB fiddle

Sign up to request clarification or add additional context in comments.

6 Comments

but does this work if I only search for "Tesco"?
Yes, of course :)
@BendingRodriguez Joining was using partial marches. For filtering you can use strict equality against Player column
Yes, this will work for equality filter - not if LIKE is used. OP, for some reason, asked about using LIKE operator (Tesco%) and the issue of getting two payees instead of one.
I had not come across OUTER APPLY before. I will try this out tomorrow
|
1

I am answering this on the assumption that you are not writing a financial application, but just want something for personal use and you have data in a certain form, which it is not worth reworking.

Essentially what you want to do, is to get a "best match". Tesco and Tesco Pet Insurance matches your current query, but you want the best fit. One way to do this is to select a third column, which replaces the Payee inside the Description with nothing. The resultant column with the shortest length (i.e. the one where Payee has replaced the most) is the best fit.

Using this technique, something like the following should do the trick:

declare @tbltxns table ([Description] nvarchar(100), Amount decimal(10,2));
declare @tblPayee table (Payee nvarchar(100));

INSERT INTO @tbltxns VALUES
('Tesco Pet Insurance Dog Health Care Year Premium', 250.0),
('MyFitness Gym Monthly fee', 30.0);

INSERT INTO @tblPayee VALUES
('Tesco'),
('Tesco Pet Insurance'),
('MyFitness');

WITH CTE AS
(SELECT
    tx.[Description], py.Payee, REPLACE(tx.[Description], py.Payee, '') AS NoPayee
FROM @tblTxns TX
INNER JOIN @tblPayee py
    ON CHARINDEX(py.Payee, tx.Description, 1) > 0),
CTE2 AS 
(SELECT c.[Description], c.Payee, ROW_NUMBER() OVER(PARTITION BY c.[Description] ORDER BY LEN(c.NoPayee)) rn
FROM CTE c)
SELECT c2.[Description], c2.Payee
FROM CTE2 c2
WHERE rn = 1;

For future reference, when asking a database question, please provide table definitions and sample data along the lines that I have used. Just as an illustration, I am using table variables, as they don't have to be deleted, but CREATE TABLE would be quite acceptable. Sample data in the form of INSERT statements is desirable. Why? Simply so that people here are spared a bit of time and effort, in trying to provide you with a workable answer.

Comments

0

Beside the fact that you should not make your data dependant on such joins - there is an option to make it more reliable - try to use the DIFFERENCE() function to help you fetch the rows you want. It returns an integer value measuring the difference between the SOUNDEX() values of two different character expressions.

WITH    --  S a m p l e    D a t a :
  lookUp AS 
    ( Select 'Tesco' as lkp ), --  testing lookup value
  vwPayeeNames  AS
    ( Select 1 as id, 'Tesco' as payee Union All 
      Select 2, 'Tesco pet insurance' Union All
      Select 3, 'Sellers Co.' 
    ), 
  tblTxns AS 
    ( Select 1001 as id, 'Bill 123 - Tesco - payed' as Description Union All
      Select 1002, 'Is this - Tesco pet insurance - last payment' Union All
      Select 1003, 'What on Earth did Sellers Co. payed this for' Union All
      Select 1004, 'Bill 123 - Tesco - again' Union All
      Select 1005, 'Payed by Tesco'
    )

Main SQL - creates Row_Number() for payee (difference ordered) and for the difference (length ordered) that should match for the row to be fetched.

SELECT   x.tx_id, x.Description, x.payee_id, x.Payee, x.rn_payee, x.rn_sndx 
FROM   ( SELECT      tx.id as tx_id, tx.Description, py.id as payee_id, py.Payee, 
                     Row_Number() Over(Partition By py.Payee
                                       Order By DIFFERENCE(lp.lkp, tx.Description)) as rn_payee,
                     Row_Number() Over(Partition By DIFFERENCE(lp.lkp, tx.Description)
                                       Order By DIFFERENCE(lp.lkp, tx.Description), LEN(py.Payee)) as rn_sndx,
                     DIFFERENCE(lp.lkp, tx.Description) as diff
         FROM        tblTxns tx
         INNER JOIN  lookUp lp ON 1 = 1
         INNER JOIN  vwPayeeNames py ON CHARINDEX(py.Payee, tx.Description, 1) > 0
         WHERE       py.Payee Like Concat(lp.lkp, '%') And 
                     LEN(py.Payee) >= LEN(lp.lkp)
       ) x
WHERE   x.rn_payee = x.rn_sndx

Result: ( 'Tesco' )

tx_id Description payee_id Payee rn_payee rn_sndx
1001 Bill 123 - Tesco - payed 1 Tesco 1 1
1004 Bill 123 - Tesco - again 1 Tesco 2 2
1005 Payed by Tesco 1 Tesco 3 3

fiddle

NOTE: Once more, this is not bullet proof solution - reconsider restructuring your data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.