1

I have 3 tables in the Excel workbook that I access with SQL.
There is Inscriptions table that holds the AGENT_ID and MLS_ID, PHOTOS table that holds all the photos that came in recent feed for MLS_ID, and PHOTOS_CURRENT that holds all the photos that are currently in the system for MLS_ID.
The goal is to find if there are photos in the new feed that are not in the system currently.

I tried to query using NOT EXISTS and NOT IN approach. Both take too long to run (sometimes 2 minutes per AGENT_ID).

NOT EXISTS approach:

sqlQuery = "SELECT DISTINCT INSCR.MLS_ID FROM [INSCRIPTIONS_CURRENT$] INSCR, [PHOTOS$] P1 " & _
                "WHERE INSCR.AGENT_ID = " & inpAgentId & _
                " AND INSCR.MLS_ID = P1.MLS_ID AND NOT exists (select 1 from [PHOTOS_CURRENT$] PC1 where PC1.MLS_ID = P1.MLS_ID and PC1.PHOTO_ID = P1.PHOTO_ID)"

NOT IN approach:

sqlQuery = "SELECT DISTINCT INSCR.MLS_ID FROM [INSCRIPTIONS_CURRENT$] INSCR, [PHOTOS$] P1 " & _
                "WHERE INSCR.AGENT_ID = " & inpAgentId & _
                " AND INSCR.MLS_ID = P1.MLS_ID AND INSCR.MLS_ID NOT IN (select MLS_ID from [PHOTOS_CURRENT$] PC1 where PC1.MLS_ID = P1.MLS_ID and PC1.PHOTO_ID = P1.PHOTO_ID)"

DB connection is done as follows:

Sub Connect()

    Set objConnection = CreateObject("ADODB.Connection")
    objConnection.CommandTimeout = 120

End Sub

The query is sent to the procedure for processing as follows:

Function select_query(sqlQuery As String) As ADODB.Recordset

    Dim objRecordset As ADODB.Recordset

    Const adOpenStatic = 3
    Const adLockOptimistic = 3
    Const adCmdText = &H1

    Set objRecordset = CreateObject("ADODB.Recordset")

    objConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
    "Data Source=" & ThisWorkbook.FullName & _
    ";Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1"";"

    objRecordset.Open sqlQuery, objConnection, adOpenStatic, adLockOptimistic, 
    adCmdText

    Set select_query = objRecordset

End Function

Any suggestions to improve the performance?

9
  • How large are the tables? Commented Jul 24, 2019 at 15:21
  • @TimWilliams Around 20,000 records in PHOTOS and PHOTOS_CURRENT, around 2,000 in INSCRIPTIONS Commented Jul 24, 2019 at 15:59
  • Please provide a fuller code block not line snippets so we can see whole process. Otherwise, remember excel is not a database, so use an actual one like its sibling ms access, that can index on fields for faster table scans! Commented Jul 24, 2019 at 16:13
  • @Parfait there is no much sense to add anything else to what I already provided. If any records found by the query they are inserted into the output array and sent back to the calling routine. I know Excel is not a real database, I work in the constraints that the company imposes on me. This Excel has about 20 worksheets and is used as a database, take it or leave it. I did my proposal for improvement to the company, but while they make a correct decision I have to make this piece of code work. Commented Jul 24, 2019 at 16:23
  • 1
    If you're running this in a loop it seems like you could create a table out of your NOT EXISTS query and join on it instead of repeating that query for every agent. Commented Jul 24, 2019 at 17:35

2 Answers 2

0

Consider the following tips that may help:

  • Explicit JOIN: Right now you are running the outdated implicit join with a match of IDs in the WHERE clause and not the current standard of the explicit JOIN clause. In most database engines, this should not change performance but anecdotal evidence suggests on specific use cases it can:

    SELECT DISTINCT INSCR.MLS_ID 
    FROM [INSCRIPTIONS_CURRENT$] INSCR
    INNER JOIN [PHOTOS$] P1 ON INSCR.MLS_ID = P1.MLS_ID 
    WHERE INSCR.AGENT_ID = " & inpAgentId & _
      AND NOT EXISTS (select 1 from [PHOTOS_CURRENT$] PC1  
                      where PC1.MLS_ID = P1.MLS_ID and PC1.PHOTO_ID = P1.PHOTO_ID)
    
  • GROUP BY vs DISTINCT: This is a regular debate in SQL where different database engines process the non-duplicates queries differently. In theory, there should be no difference in performance but anecdotal evidence suggest otherwise. Therefore, consider an equivalent GROUP BY version:

    SELECT INSCR.MLS_ID 
    FROM [INSCRIPTIONS_CURRENT$] INSCR
    INNER JOIN [PHOTOS$] P1 ON INSCR.MLS_ID = P1.MLS_ID 
    WHERE INSCR.AGENT_ID = " & inpAgentId & _
      AND NOT EXISTS (select 1 from [PHOTOS_CURRENT$] PC1  
                      where PC1.MLS_ID = P1.MLS_ID and PC1.PHOTO_ID = P1.PHOTO_ID)
    GROUP BY INSCR.MLS_ID 
    
  • DAO Connection: Since querying workbooks utilizes the JET/ACE SQL Engine, consider DAO as a specific interface that can exploit many advantages of this engine and not ADO a more generalized interface across any data source (Oracle, SQL Server, Postgres, etc.).

    ' ADD REFERENCE: Microsoft Office #.# Access Database Engine Object Library
    Dim conn As New DAO.DBEngine, db As DAO.Database, qdef As DAO.QueryDef, rst As DAO.Recordset
    
    Set db = conn.OpenDatabase("C:\Path\To\Workbook.xls", False, True, "Excel 8.0;HDR=Yes;")
    Set rst = db.OpenRecordset(sqlQuery)
    
    ...    
    
    rst.Close: db.Close
    Set rst = Nothing: Set db = Nothing: Set conn = Nothing
    
  • OLEDB (ACE) Connection: Consider the newer OLEDB provider which should still work with any version of Excel (.xls or .xlsx, .xlsb, .xlsm). Check available providers with this PowerShell script.

    objConnection.Open "Provider=Microsoft.ACE.OLEDB.12.0;" ...
    
    objConnection.Open "Provider=Microsoft.ACE.OLEDB.16.0;" ...
    
  • ODBC Connection: Connecting interfaces can pose different performance of query execution where anecdotal evidence differs from theory. Therefore, consider replacing the OLEDB provider for ODBC driver connection:

    ' DRIVER VERSION
    objConnection.Open "DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};" _
                           & "DBQ=C:\Path\To\Excel.xls;"
    
    ' DSN VERSION
    objConnection.Open "DSN=Excel Files;DBQ=C:\Path\To\Excel.xls;"
    
  • Cursor/Lock Types: Experiment with the cursor types as performance can vary such as adOpenForwardOnlyvs adOpenStatic and even LockType with adLockOptimistic vs adLockReadOnly.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you @Parfait. Your suggestion would help if I had power to change the whole system. Unfortunately, as I said, I am bound to the existing Connect-Select-Disconnect approach which is used by hundreds of queries in this Excel database. The full regression testing is out of question. I'd rather wait for them to approve the move to a real DB and make it work there.
I do not undestand your comment. All of these solutions can be handled in Excel.
0

Thanks @TimWilliams, your comment was most helpful in solving this problem. What I ended up doing is writing a separate routine that, during feed load, creates a table of all photos that were changed like this:

sqlQuery = "INSERT INTO [PHOTO_UPDATES$] SELECT P1.* " & _
                "FROM [PHOTOS$] P1 LEFT JOIN [PHOTOS_CURRENT$] PC1 " & _
                "ON P1.MLS_ID = PC1.MLS_ID AND P1.PHOTO_ID = PC1.PHOTO_ID WHERE PC1.PHOTO_ID is NULL"

Then, when creating the worklist per agent, the following is done:

sqlQuery = "SELECT DISTINCT INSCR.MLS_ID " & _
                "FROM [PHOTO_UPDATES$] PU1 , [INSCRIPTIONS_CURRENT$] INSCR " & _
                "WHERE INSCR.AGENT_ID = " & inpAgentId & " " & _
                "AND PU1.MLS_ID = INSCR.MLS_ID "

Both routines take less than 1 second to run.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.