5

I have a list of items that contains duplicates, and I want to keep the last item of each group of duplicates. The native DistinctBy LINQ operator keeps the first item from each group, so it doesn't solve my problem. I tried to implement a custom DistinctLastBy operator like this:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    return source
        .GroupBy(keySelector, comparer)
        .Select(g => g.Last());
}

...but it's not working correctly. It does return the last item of each group, but the items are not in the same order as in the source sequence. Here is a minimal demonstration of the desirable behavior:

string[] source = ["giraffe", "Elephant", "Cat", "Eagle", "Gazelle", "Cow", "chicken"];
IEnumerable<string> distinct = source
    .DistinctLastBy(x => x.Substring(0, 1), StringComparer.OrdinalIgnoreCase);
Console.WriteLine(String.Join(", ", distinct));

The animals are considered duplicates if they start with the same letter. The desirable output is:

Eagle, Gazelle, chicken

The actual output of my flawed implementation is:

Gazelle, Eagle, chicken

This output is incorrect because the "Eagle" is placed before the "Gazelle" in the input, so it should also be placed before the "Gazelle" in the output.

Online demo.

How can I fix my DistinctLastBy operator so that it produces the correct output?

I searched for a duplicate and I didn't find any.

0

6 Answers 6

4

Store the last index, then make it lazy by iterating the resulting collection

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey>? comparer)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);

    var set = new Dictionary<TKey, (TSource element, int lastIndex)>(comparer);
    var lastIndex = 0;

    foreach (TSource element in source)
    {
        TKey key = keySelector(element);
        set[key] = (element, lastIndex++);
    }

    foreach (var el in set.Values.OrderBy(x => x.lastIndex).Select(x => x.element))
    {
        yield return el;
    }
}
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks Martheen! Your answer solves my problem.
In retrospect this solution has the drawback of enumerating the source eagerly, during the construction of the query, not lazily during the execution of the query. This can be a problem if the source is not ready to be enumerated yet.
@TheodorZoulias I added another approach adapted from github.com/dotnet/runtime/blob/…
Martheen unfortunately your second implementation has the same issue with the first. The eager enumeration of the source is still here. It is enumerated in the line var distinct = source.DistinctLastBy(..., not in the line Console.WriteLine(String.Join(", ", distinct));.
@TheodorZoulias Yeah. I'd say his approach is more elegant too, without wasting a hashtable.
|
3

Based on the task description you can just do a double Reverse:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
  this IEnumerable<TSource> source,
  Func<TSource, TKey> keySelector,
  IEqualityComparer<TKey> comparer = default)
{
  return source
    .Reverse()
    .DistinctBy(keySelector, comparer)
    .Reverse();
}

And version via GroupBy just for funsies:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
  this IEnumerable<TSource> source,
  Func<TSource, TKey> keySelector,
  IEqualityComparer<TKey> comparer = default)
{
  ArgumentNullException.ThrowIfNull(source);
  ArgumentNullException.ThrowIfNull(keySelector);

  return source
    .Select((el, i) => (el, i)) // merge with original index
    .GroupBy(t => keySelector(t.el), comparer) // group
    .Select(g => g.Last()) // select the last value with the last index
    .OrderBy(t => t.i) // order
    .Select(t => t.el); // get the result
}

Demo fiddle - https://dotnetfiddle.net/6TVPC1

UPD

There is one problem with using the DistinctBy - according in the double reverse approach - according to the docs it is not guaranteed to return the correctly ordered values (Distinct too):

The DistinctBy<TSource,TKey>(IEnumerable, ..) method returns an unordered sequence that contains no duplicate values

Though current implementation actually should return data in "stable" manner (i.e. first encountered element is returned). I doubt that this will change but as usual - it's better to be safe then sorry, so you probably will want to fix that in advance, for example by just copying current implementation should do the trick:

public static class Exts
{
    extension<TSource>(IEnumerable<TSource> source) 
    {
        public IEnumerable<TSource> DistinctLastBy<TKey>(Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer = default)
        {
            return source
                .Reverse()
                .DistinctByStable(keySelector, comparer)
                .Reverse();
        }

        private IEnumerable<TSource> DistinctByStable<TKey>(Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer)
        {
            using IEnumerator<TSource> enumerator = source.GetEnumerator();
            if (enumerator.MoveNext())
            {
                var set = new HashSet<TKey>(comparer); // TODO - play with size?
                do
                {
                    TSource element = enumerator.Current;
                    if (set.Add(keySelector(element)))
                    {
                        yield return element;
                    }
                }
                while (enumerator.MoveNext());
            }
        }
    }   
}

Updated fiddle - https://dotnetfiddle.net/Ze2tva

6 Comments

Thanks Guru Stron for the answer. Both solutions in your answer solve my problem correctly. They are also the slowest and fastest solutions that have been posted so far. The .Reverse().DistinctBy(...).Reverse(); is by far the fastest implementation. If you reverse the order of the two solutions and put the second first, I'll change my 'Accept' vote and I'll accept this answer.
Yes, sure. Updated.
@TheodorZoulias Bear in mind that if source isn't a collection then Reverse needs to cache the whole thing first, then Distinct caches again. The final reverse should be without an extra cache though.
@Charlieface I am testing for performance with a non-collection source. Basically I am hiding the type of my source array using the Hide LINQ operator.
Please check out the update - found interesting caveat in the docs)
I hate it when Microsoft opts for behavioral vagueness, for the sake of potential future optimizations, that rarely occur in practice.
1
  1. Store the grouped last elements into a hashset.
  2. Filter the source with hashset to maintain the original sequence.
public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);

    var hashset = source.GroupBy(keySelector, comparer)
        .Select(g => g.Last())
        .ToHashSet();

    return source.Where(x => hashset.Contains(x));
}

Demo @ .NET Fiddle

1 Comment

Thanks Yong Shun for the answer. Your answer solves my problem as well, but it has the drawback of enumerating the source more than once. It is also considerably slower than Martheen answer (in my PC it's about 2.5 times slower).
1

Here is a DistinctLastBy version that implements the same strategy with Guru Stron's .Reverse().DistinctBy().Reverse() solution, but more explicitly:

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    TSource[] array = source.ToArray();
    if (array.Length == 0) yield break;
    List<int> indices = new();
    {
        // Local scope to shorten the lifespan of the HashSet.
        HashSet<TKey> set = new(comparer);
        for (int i = array.Length - 1; i >= 0; i--)
            if (set.Add(keySelector(array[i])))
                indices.Add(i);
    }
    for (int i = indices.Count - 1; i >= 0; i--)
        yield return array[indices[i]];
}

Online demo.

I expected to see some performance improvement over using the stock .NET LINQ methods, but I didn't see any. It performs more or less the same.

Edit: Preallocating a bool[] selected instead of an expandable List<int> indices is slightly faster, presumably because it avoids growing the backing array of the list. A BitArray selected can also be used instead, for increased memory efficiency at the cost of slightly worse performance.

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer = default)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    TSource[] array = source.ToArray();
    if (array.Length == 0) yield break;
    bool[] selected = new bool[array.Length];
    //BitArray selected = new(array.Length);
    {
        // Local scope to shorten the lifespan of the HashSet.
        HashSet<TKey> set = new(comparer);
        for (int i = array.Length - 1; i >= 0; i--)
            if (set.Add(keySelector(array[i])))
                selected[i] = true;
    }
    for (int i = 0; i < array.Length; i++)
        if (selected[i]) yield return array[i];
}

Comments

0

I suggest implementing an overload to current DistinctBy implementation (let's solve the problem in general case).

Let's add Func<TItem, TItem, TItem>? resolver which takes original and current duplicates and return the value we want to keep:

(original, current) => original; // Business as usual
(original, current) => current;  // Take last (our case)

The implementation can be

public static IEnumerable<TItem> DistinctBy<TItem, TSelector>(
  this IEnumerable<TItem> source, 
  Func<TItem, TSelector> selector,
  IEqualityComparer<TSelector>? comparer = default, 
  Func<TItem, TItem, TItem>? resolver = default) 
{
  comparer ??= EqualityComparer<TSelector>.Default;
  resolver ??= (TItem original, TItem current) => original;

  var distinct = new Dictionary<TSelector, int>(comparer);

  List<TItem> result = new();

  foreach (var item in source) {
    var key = selector(item);

    if (distinct.TryGetValue(key, out var currentIndex))
      result[currentIndex] = resolver(result[currentIndex], item);
    else {
      distinct.Add(key, result.Count);
      result.Add(item);
    } 
  }

  return result;
}

Then we can use it as follow:

string[] source = [
  "giraffe", "Elephant", "Cat", "Eagle", "Gazelle", "Cow", "chicken"
];

var distinct = source.DistinctBy(
  x => x.Substring(0, 1), 
  StringComparer.OrdinalIgnoreCase,
  (origical, current) => current); // <- Extra resolver

Console.WriteLine(String.Join(", ", distinct));

Fiddle

1 Comment

Thanks Dmitrii for the answer. I don't think that there is much generality to add to this problem. It's either the first or the last, and the first is already implemented natively by the DistinctBy. So we just need a variant that returns the last. Besides that, your solution has the same drawback with Martheen's solution: it enumerates the source eagerly instead of lazily. LINQ operators are supposed to be lazy.
0

A more efficient solution might be to use an OrderedDictionary (available in .NET 9+). This means you don't need to sort at the end. All we need to do is check if the key exists already, if so then remove it and re-add.

public static IEnumerable<TSource> DistinctLastBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey>? comparer)
{
    ArgumentNullException.ThrowIfNull(source);
    ArgumentNullException.ThrowIfNull(keySelector);
    var dict = new OrderedDictionary<TKey, TSource>(comparer);
    foreach (TSource element in source)
    {
        var key = keySelector(element);
        var index = dict.IndexOf(key);
        if (index >= 0)
            dict.RemoveAt(index);

        dict.Add(key, element);
    }

    foreach (var item in dict.Values)
        yield return item;    // need to yield to get lazy enumeration
}

If it's likely that very few duplicates are found then pre-sizing the OrderedDictionary might be wise.

7 Comments

Thanks Charlieface for the answer. It solves my problem correctly but it's veeeeery slow. It seems that there is some O(N^2) operation inside your implementation. It has also the same issue with Martheen's and Dmitrii's answers. The source is enumerated eagerly during the construction of the query, instead of lazily during the execution of the query.
What percentage of values are duplicates?
I am testing for performance with half of the values being duplicates. This is an arbitrary percentage. In my actual program where I intend to use the DistinctLastBy, duplicates are expected to be very rare.
Because I suspect (without profiling) that it's the RemoveAt which is slow, which only happens with duplicates.
From the docs: "Operations on the collection have algorithmic complexities that are similar to that of the List<T> class, except with lookups by key similar in complexity to that of Dictionary<TKey,TValue>."
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.