How to summarize data with arg_max() in KQL using two columns?

Question

I have a table with following columns:

ID: identifies an imported document (think filename). This is unique for a combination of ImportId and ImportTime.
SomeData: some data column. In the real table there are more columns.
ImportId: an ID in the format of YYYY-MM-DD, eg "2022-05-14" (this is a string column)
ImportTime: the date and time the import was done (this is a string column)

The RowNum is NOT part of the table but used here to be able to reference records/rows.

RowNum	ID	Value	ImportId	ImportTime
1	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-11 13:00
2	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-11 13:00
3	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-11 17:00
4	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-11 17:00
5	A	Doc A content as of May 14, 2022	2022-05-14	2022-05-17 08:00
6	B	Doc B content as of May 14, 2022	2022-05-14	2022-05-17 08:00
7	A	Doc A content as of May 14, 2022	2022-05-14	2022-05-17 10:00
8	B	Doc B content as of May 14, 2022	2022-05-14	2022-05-17 10:00
9	A	Doc A content as of May 11, 2022	2022-05-11	2022-05-18 15:00
10	B	Doc B content as of May 11, 2022	2022-05-11	2022-05-18 15:00

In the table above there were three imports for May 11 (ImportId = "2022-05-11") and two imports for data from May 14 (ImportId = "2022-05-14").
The latest import run (ImportTime) was at 2022-05-18 15:00
The latest ImportTime does not necessarily correlate with the latest import data. In my example above, someone ran an import on May 18 at 15:00 but imported the state of the catalog as it was on May 11 (ImportId = "2022-05-11").

Challenge: I need to get the records with the newest ImportId (which would be "2022-05-14") and the latest ImportTime (which would be "2022-05-18 15:00").

For the example above, the result should contain the two rows with ImportId "2022-05-14" and ImportTime "2022-05-17 10:00" (row numbers 7 and 8).

What I tried:

Approach 1

I used arg_max() on ImportTime:

T
| summarize arg_max(ImportTime, *) by ID

This returns the last two rows (9 and 10), where ImportId is "2022-05-11". That's not what I'm after because the newest ImportId is "2022-05-14".

Approach 2

If I use arg_max(ImportId, *) by ID instead, I am getting the ones for "2022-05-14" (rows 5 and 6), but not the ones with the latest ImportTime.

Approach 3

I combined ImportTime and ImportId into an extended column and applied arg_max() on that. This seems to work but I'm unsure if it's correct in all cases?

T
| extend Combined = strcat(ImportId, ImportTime)
| summarize arg_max(Combined, *) by ID

This returns the expected rows 7 and 8 for "2022-05-14" at the import time of "2022-05-17 10:00".

Are there better options?

It's shown that you put some effort in writing this post, however the description of the scenario is very confusing and requires to combine the pieces together in order to understand what you want. "there were three imports for May 11" - why? because there are 3 distinct ImportTime values for ImportId with value "2022-05-11"? Is "import" by this definition is a combination of ImportTime and ImportId regardless of Value? By the use of Value in your code it seems that Value is the ID of an import, and not just "some content". — David דודו Markovitz
– David דודו Markovitz, Commented May 20, 2022 at 15:22
Following that you insert a new term - "import run" without defining it. You are also don't supply the requested results, just commenting the results of your attempts. — David דודו Markovitz
– David דודו Markovitz, Commented May 20, 2022 at 15:23
Sorry if it's not 100% clear. Sometimes it's hard to create an abstraction. In the real table, there's not just "Value" but many columns with various pieces of data. And you are right, value can be seen as a unique ID for the row and it will be the same for every import. The import ID is in the format of "yyyy-mm-dd". That's why I'm saying there's an import for "May 11, 2022" because that's "2022-05-11". An import is indeed a combination of ImportID + ImportTime, yes. In the actual table, there will obviously be additional columns. — Krumelur
– Krumelur, Commented May 20, 2022 at 16:00
The role of ID is still not clear in the context of your desired results. If RowNum 6 & 8 would have been removed from the data sample, what would be the desired result then? RowNum 7? 7 & 10? Other? — David דודו Markovitz
– David דודו Markovitz, Commented May 20, 2022 at 16:39

yifats · Accepted Answer · 2022-05-20 14:17:43Z

3

Check out top-nested operator:

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| top-nested of Value by ignore=max(1),
  top-nested 1 of ImportId by max(ImportId),
  top-nested 1 of ImportTime by max(ImportTime)
| project Value, ImportId, ImportTime

Value	ImportId	ImportTime
A	2022-05-14 00:00:00.0000000	2022-05-17 10:00:00.0000000
B	2022-05-14 00:00:00.0000000	2022-05-17 10:00:00.0000000

answered May 20, 2022 at 14:17

yifats

2,7899 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Krumelur Over a year ago

Interesting. I didn't know this operator. However, it's failing on the datatypes my data has: top-nested expects only aggregate scalar functions that produces numeric values. . ImportId and ImportTime are string.

Yoni L. Over a year ago

you can try casting ImportTime to datetime using todatetime(ImportTime). BTW, another alternative would be using the partition operator, e.g. T | where ImportId == toscalar(T | summarize max(ImportId)) | partition by Value ( top 1 by ImportTime desc )

David דודו Markovitz Over a year ago

Yoni & Yifat describe two different logics. If you'll change all values of Value, to A for ImportId = 2022-05-14, you'll see that you are getting different results. Please edit your post and clarify what you are asking.

Ziad Hamod · Accepted Answer · 2022-05-22 10:22:20Z

You can try this approach as well using the unlimited partition operator:

datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
    "A",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 13:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-11 17:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 08:00),
    "A",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "B",    datetime(2022-05-14),   datetime(2022-05-17 10:00),
    "A",    datetime(2022-05-11),   datetime(2022-05-18 15:00),
    "B",    datetime(2022-05-11),   datetime(2022-05-18 15:00)
]
| partition hint.strategy = native by Value
(
    partition hint.strategy = native by ImportId
    (
        top 1 by ImportTime
    )
    | top 1 by ImportId
)

Collectives™ on Stack Overflow

How to summarize data with arg_max() in KQL using two columns?

Approach 1

Approach 2

Approach 3

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Approach 1

Approach 2

Approach 3

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related