3

I have a table with following columns:

  • ID: identifies an imported document (think filename). This is unique for a combination of ImportId and ImportTime.
  • SomeData: some data column. In the real table there are more columns.
  • ImportId: an ID in the format of YYYY-MM-DD, eg "2022-05-14" (this is a string column)
  • ImportTime: the date and time the import was done (this is a string column)

The RowNum is NOT part of the table but used here to be able to reference records/rows.

RowNum ID Value ImportId ImportTime
1 A Doc A content as of May 11, 2022 2022-05-11 2022-05-11 13:00
2 B Doc B content as of May 11, 2022 2022-05-11 2022-05-11 13:00
3 A Doc A content as of May 11, 2022 2022-05-11 2022-05-11 17:00
4 B Doc B content as of May 11, 2022 2022-05-11 2022-05-11 17:00
5 A Doc A content as of May 14, 2022 2022-05-14 2022-05-17 08:00
6 B Doc B content as of May 14, 2022 2022-05-14 2022-05-17 08:00
7 A Doc A content as of May 14, 2022 2022-05-14 2022-05-17 10:00
8 B Doc B content as of May 14, 2022 2022-05-14 2022-05-17 10:00
9 A Doc A content as of May 11, 2022 2022-05-11 2022-05-18 15:00
10 B Doc B content as of May 11, 2022 2022-05-11 2022-05-18 15:00
  • In the table above there were three imports for May 11 (ImportId = "2022-05-11") and two imports for data from May 14 (ImportId = "2022-05-14").
  • The latest import run (ImportTime) was at 2022-05-18 15:00
  • The latest ImportTime does not necessarily correlate with the latest import data. In my example above, someone ran an import on May 18 at 15:00 but imported the state of the catalog as it was on May 11 (ImportId = "2022-05-11").

Challenge: I need to get the records with the newest ImportId (which would be "2022-05-14") and the latest ImportTime (which would be "2022-05-18 15:00").

For the example above, the result should contain the two rows with ImportId "2022-05-14" and ImportTime "2022-05-17 10:00" (row numbers 7 and 8).

What I tried:

Approach 1

I used arg_max() on ImportTime:

T
| summarize arg_max(ImportTime, *) by ID

This returns the last two rows (9 and 10), where ImportId is "2022-05-11". That's not what I'm after because the newest ImportId is "2022-05-14".

Approach 2

If I use arg_max(ImportId, *) by ID instead, I am getting the ones for "2022-05-14" (rows 5 and 6), but not the ones with the latest ImportTime.

Approach 3

I combined ImportTime and ImportId into an extended column and applied arg_max() on that. This seems to work but I'm unsure if it's correct in all cases?

T
| extend Combined = strcat(ImportId, ImportTime)
| summarize arg_max(Combined, *) by ID

This returns the expected rows 7 and 8 for "2022-05-14" at the import time of "2022-05-17 10:00".

Are there better options?

6
  • It's shown that you put some effort in writing this post, however the description of the scenario is very confusing and requires to combine the pieces together in order to understand what you want. "there were three imports for May 11" - why? because there are 3 distinct ImportTime values for ImportId with value "2022-05-11"? Is "import" by this definition is a combination of ImportTime and ImportId regardless of Value? By the use of Value in your code it seems that Value is the ID of an import, and not just "some content". Commented May 20, 2022 at 15:22
  • Following that you insert a new term - "import run" without defining it. You are also don't supply the requested results, just commenting the results of your attempts. Commented May 20, 2022 at 15:23
  • Sorry if it's not 100% clear. Sometimes it's hard to create an abstraction. In the real table, there's not just "Value" but many columns with various pieces of data. And you are right, value can be seen as a unique ID for the row and it will be the same for every import. The import ID is in the format of "yyyy-mm-dd". That's why I'm saying there's an import for "May 11, 2022" because that's "2022-05-11". An import is indeed a combination of ImportID + ImportTime, yes. In the actual table, there will obviously be additional columns. Commented May 20, 2022 at 16:00
  • Edited the question to make things clearer. Hopefully. Commented May 20, 2022 at 16:13
  • The role of ID is still not clear in the context of your desired results. If RowNum 6 & 8 would have been removed from the data sample, what would be the desired result then? RowNum 7? 7 & 10? Other? Commented May 20, 2022 at 16:39

2 Answers 2

3

Check out top-nested operator:

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| top-nested of Value by ignore=max(1),
  top-nested 1 of ImportId by max(ImportId),
  top-nested 1 of ImportTime by max(ImportTime)
| project Value, ImportId, ImportTime
Value ImportId ImportTime
A 2022-05-14 00:00:00.0000000 2022-05-17 10:00:00.0000000
B 2022-05-14 00:00:00.0000000 2022-05-17 10:00:00.0000000
Sign up to request clarification or add additional context in comments.

3 Comments

Interesting. I didn't know this operator. However, it's failing on the datatypes my data has: top-nested expects only aggregate scalar functions that produces numeric values. . ImportId and ImportTime are string.
you can try casting ImportTime to datetime using todatetime(ImportTime). BTW, another alternative would be using the partition operator, e.g. T | where ImportId == toscalar(T | summarize max(ImportId)) | partition by Value ( top 1 by ImportTime desc )
Yoni & Yifat describe two different logics. If you'll change all values of Value, to A for ImportId = 2022-05-14, you'll see that you are getting different results. Please edit your post and clarify what you are asking.
0

You can try this approach as well using the unlimited partition operator:

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| partition hint.strategy = native by Value
(
    partition hint.strategy = native by ImportId
    (
        top 1 by ImportTime
    )
    | top 1 by ImportId
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.