DistinctBy, but preserve the last element of each group of duplicates

Question

I have a list of items that contains duplicates, and I want to keep the last item of each group of duplicates. The native DistinctBy LINQ operator keeps the first item from each group, so it doesn't solve my problem. I tried to implement a custom DistinctLastBy operator like this:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    return source
        .GroupBy(keySelector, comparer)
        .Select(g => g.Last());
}

...but it's not working correctly. It does return the last item of each group, but the items are not in the same order as in the source sequence. Here is a minimal demonstration of the desirable behavior:

string[] source = ["giraffe", "Elephant", "Cat", "Eagle", "Gazelle", "Cow", "chicken"];
IEnumerable<string> distinct = source
    .DistinctLastBy(x => x.Substring(0, 1), StringComparer.OrdinalIgnoreCase);
Console.WriteLine(String.Join(", ", distinct));

The animals are considered duplicates if they start with the same letter. The desirable output is:

Eagle, Gazelle, chicken

The actual output of my flawed implementation is:

Gazelle, Eagle, chicken

This output is incorrect because the "Eagle" is placed before the "Gazelle" in the input, so it should also be placed before the "Gazelle" in the output.

Online demo.

How can I fix my DistinctLastBy operator so that it produces the correct output?

I searched for a duplicate and I didn't find any.

Martheen · Accepted Answer · 2025-11-20 00:33:18Z

4

Store the last index, then make it lazy by iterating the resulting collection

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey>? comparer)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);

    var set = new Dictionary<TKey, (TSource element, int lastIndex)>(comparer);
    var lastIndex = 0;

    foreach (TSource element in source)
    {
        TKey key = keySelector(element);
        set[key] = (element, lastIndex++);
    }

    foreach (var el in set.Values.OrderBy(x => x.lastIndex).Select(x => x.element))
    {
        yield return el;
    }
}

edited 2 days ago

answered Nov 19 at 5:40

Martheen

5,7774 gold badges37 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Theodor Zoulias Nov 19 at 6:13

Thanks Martheen! Your answer solves my problem.

Theodor Zoulias Nov 19 at 8:19

In retrospect this solution has the drawback of enumerating the source eagerly, during the construction of the query, not lazily during the execution of the query. This can be a problem if the source is not ready to be enumerated yet.

Martheen Nov 19 at 8:56

@TheodorZoulias I added another approach adapted from github.com/dotnet/runtime/blob/…

Theodor Zoulias 2 days ago

Martheen unfortunately your second implementation has the same issue with the first. The eager enumeration of the source is still here. It is enumerated in the line var distinct = source.DistinctLastBy(..., not in the line Console.WriteLine(String.Join(", ", distinct));.

Martheen 2 days ago

@TheodorZoulias Yeah. I'd say his approach is more elegant too, without wasting a hashtable.

|

Guru Stron · Accepted Answer · 2025-11-19 20:01:36Z

3

Based on the task description you can just do a double Reverse:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
  this IEnumerable<TSource> source,
  Func<TSource, TKey> keySelector,
  IEqualityComparer<TKey> comparer = default)
{
  return source
    .Reverse()
    .DistinctBy(keySelector, comparer)
    .Reverse();
}

And version via GroupBy just for funsies:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
  this IEnumerable<TSource> source,
  Func<TSource, TKey> keySelector,
  IEqualityComparer<TKey> comparer = default)
{
  ArgumentNullException.ThrowIfNull(source);
  ArgumentNullException.ThrowIfNull(keySelector);

  return source
    .Select((el, i) => (el, i)) // merge with original index
    .GroupBy(t => keySelector(t.el), comparer) // group
    .Select(g => g.Last()) // select the last value with the last index
    .OrderBy(t => t.i) // order
    .Select(t => t.el); // get the result
}

Demo fiddle - https://dotnetfiddle.net/6TVPC1

UPD

There is one problem with using the DistinctBy - according in the double reverse approach - according to the docs it is not guaranteed to return the correctly ordered values (Distinct too):

The DistinctBy<TSource,TKey>(IEnumerable, ..) method returns an unordered sequence that contains no duplicate values

Though current implementation actually should return data in "stable" manner (i.e. first encountered element is returned). I doubt that this will change but as usual - it's better to be safe then sorry, so you probably will want to fix that in advance, for example by just copying current implementation should do the trick:

public static class Exts
{
    extension<TSource>(IEnumerable<TSource> source) 
    {
        public IEnumerable<TSource> DistinctLastBy<TKey>(Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer = default)
        {
            return source
                .Reverse()
                .DistinctByStable(keySelector, comparer)
                .Reverse();
        }

        private IEnumerable<TSource> DistinctByStable<TKey>(Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer)
        {
            using IEnumerator<TSource> enumerator = source.GetEnumerator();
            if (enumerator.MoveNext())
            {
                var set = new HashSet<TKey>(comparer); // TODO - play with size?
                do
                {
                    TSource element = enumerator.Current;
                    if (set.Add(keySelector(element)))
                    {
                        yield return element;
                    }
                }
                while (enumerator.MoveNext());
            }
        }
    }   
}

Updated fiddle - https://dotnetfiddle.net/Ze2tva

edited 2 days ago

answered Nov 19 at 9:46

Guru Stron

150k11 gold badges186 silver badges231 bronze badges

6 Comments

Theodor Zoulias 2 days ago

Thanks Guru Stron for the answer. Both solutions in your answer solve my problem correctly. They are also the slowest and fastest solutions that have been posted so far. The .Reverse().DistinctBy(...).Reverse(); is by far the fastest implementation. If you reverse the order of the two solutions and put the second first, I'll change my 'Accept' vote and I'll accept this answer.

Guru Stron 2 days ago

Yes, sure. Updated.

Charlieface 2 days ago

@TheodorZoulias Bear in mind that if source isn't a collection then Reverse needs to cache the whole thing first, then Distinct caches again. The final reverse should be without an extra cache though.

Theodor Zoulias 2 days ago

@Charlieface I am testing for performance with a non-collection source. Basically I am hiding the type of my source array using the Hide LINQ operator.

Guru Stron 2 days ago

Please check out the update - found interesting caveat in the docs)

Theodor Zoulias 2 days ago

I hate it when Microsoft opts for behavioral vagueness, for the sake of potential future optimizations, that rarely occur in practice.

Yong Shun · Accepted Answer · 2025-11-19 06:02:47Z

1

Store the grouped last elements into a hashset.
Filter the source with hashset to maintain the original sequence.

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);

    var hashset = source.GroupBy(keySelector, comparer)
        .Select(g => g.Last())
        .ToHashSet();

    return source.Where(x => hashset.Contains(x));
}

Demo @ .NET Fiddle

answered Nov 19 at 6:02

Yong Shun

54.5k6 gold badges38 silver badges66 bronze badges

1 Comment

Theodor Zoulias Nov 19 at 6:22

Thanks Yong Shun for the answer. Your answer solves my problem as well, but it has the drawback of enumerating the source more than once. It is also considerably slower than Martheen answer (in my PC it's about 2.5 times slower).

Theodor Zoulias · Accepted Answer · 2025-11-21 06:19:22Z

Here is a DistinctLastBy version that implements the same strategy with Guru Stron's .Reverse().DistinctBy().Reverse() solution, but more explicitly:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    TSource[] array = source.ToArray();
    if (array.Length == 0) yield break;
    List<int> indices = new();
    {
        // Local scope to shorten the lifespan of the HashSet.
        HashSet<TKey> set = new(comparer);
        for (int i = array.Length - 1; i >= 0; i--)
            if (set.Add(keySelector(array[i])))
                indices.Add(i);
    }
    for (int i = indices.Count - 1; i >= 0; i--)
        yield return array[indices[i]];
}

Online demo.

I expected to see some performance improvement over using the stock .NET LINQ methods, but I didn't see any. It performs more or less the same.

Edit: Preallocating a bool[] selected instead of an expandable List<int> indices is slightly faster, presumably because it avoids growing the backing array of the list. A BitArray selected can also be used instead, for increased memory efficiency at the cost of slightly worse performance.

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    TSource[] array = source.ToArray();
    if (array.Length == 0) yield break;
    bool[] selected = new bool[array.Length];
    //BitArray selected = new(array.Length);
    {
        // Local scope to shorten the lifespan of the HashSet.
        HashSet<TKey> set = new(comparer);
        for (int i = array.Length - 1; i >= 0; i--)
            if (set.Add(keySelector(array[i])))
                selected[i] = true;
    }
    for (int i = 0; i < array.Length; i++)
        if (selected[i]) yield return array[i];
}

Dmitrii Bychenko · Accepted Answer · 2025-11-19 09:39:06Z

0

I suggest implementing an overload to current DistinctBy implementation (let's solve the problem in general case).

Let's add Func<TItem, TItem, TItem>? resolver which takes original and current duplicates and return the value we want to keep:

(original, current) => original; // Business as usual
(original, current) => current;  // Take last (our case)

The implementation can be

public static IEnumerable<TItem> DistinctBy<TItem, TSelector>(
  this IEnumerable<TItem> source, 
  Func<TItem, TSelector> selector,
  IEqualityComparer<TSelector>? comparer = default, 
  Func<TItem, TItem, TItem>? resolver = default) 
{
  comparer ??= EqualityComparer<TSelector>.Default;
  resolver ??= (TItem original, TItem current) => original;

  var distinct = new Dictionary<TSelector, int>(comparer);

  List<TItem> result = new();

  foreach (var item in source) {
    var key = selector(item);

    if (distinct.TryGetValue(key, out var currentIndex))
      result[currentIndex] = resolver(result[currentIndex], item);
    else {
      distinct.Add(key, result.Count);
      result.Add(item);
    } 
  }

  return result;
}

Then we can use it as follow:

string[] source = [
  "giraffe", "Elephant", "Cat", "Eagle", "Gazelle", "Cow", "chicken"
];

var distinct = source.DistinctBy(
  x => x.Substring(0, 1), 
  StringComparer.OrdinalIgnoreCase,
  (origical, current) => current); // <- Extra resolver

Console.WriteLine(String.Join(", ", distinct));

Fiddle

edited Nov 19 at 9:39

answered Nov 19 at 9:33

Dmitrii Bychenko

188k20 gold badges178 silver badges231 bronze badges

1 Comment

Theodor Zoulias 2 days ago

Thanks Dmitrii for the answer. I don't think that there is much generality to add to this problem. It's either the first or the last, and the first is already implemented natively by the DistinctBy. So we just need a variant that returns the last. Besides that, your solution has the same drawback with Martheen's solution: it enumerates the source eagerly instead of lazily. LINQ operators are supposed to be lazy.

Charlieface · Accepted Answer · 2025-11-20 18:23:30Z

0

A more efficient solution might be to use an OrderedDictionary (available in .NET 9+). This means you don't need to sort at the end. All we need to do is check if the key exists already, if so then remove it and re-add.

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    var dict = new OrderedDictionary<TKey, TSource>(comparer);
    foreach (TSource element in source)
    {
        var key = keySelector(element);
        var index = dict.IndexOf(key);
        if (index >= 0)
            dict.RemoveAt(index);

        dict.Add(key, element);
    }

    foreach (var item in dict.Values)
        yield return item;    // need to yield to get lazy enumeration
}

If it's likely that very few duplicates are found then pre-sizing the OrderedDictionary might be wise.

edited yesterday

answered 2 days ago

Charlieface

78.5k8 gold badges35 silver badges77 bronze badges

7 Comments

Theodor Zoulias 2 days ago

Thanks Charlieface for the answer. It solves my problem correctly but it's veeeeery slow. It seems that there is some O(N^2) operation inside your implementation. It has also the same issue with Martheen's and Dmitrii's answers. The source is enumerated eagerly during the construction of the query, instead of lazily during the execution of the query.

Charlieface 2 days ago

What percentage of values are duplicates?

Theodor Zoulias 2 days ago

I am testing for performance with half of the values being duplicates. This is an arbitrary percentage. In my actual program where I intend to use the DistinctLastBy, duplicates are expected to be very rare.

Charlieface 2 days ago

Because I suspect (without profiling) that it's the RemoveAt which is slow, which only happens with duplicates.

Theodor Zoulias 2 days ago

From the docs: "Operations on the collection have algorithmic complexities that are similar to that of the List<T> class, except with lookups by key similar in complexity to that of Dictionary<TKey,TValue>."

|

Collectives™ on Stack Overflow

DistinctBy, but preserve the last element of each group of duplicates

6 Answers 6

7 Comments

6 Comments

1 Comment

Comments

1 Comment

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

6 Comments

1 Comment

Comments

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related