22

FINAL EDIT:

I've chosen Timothy's answer but if you want a cuter implementation that leverages the C# yield statement check Eamon's answer: https://stackoverflow.com/a/19825659/145757


By default LINQ queries are lazily streamed.

ToArray/ToList give full buffering but first they're eager and secondly it may take quite some time to complete with an infinite sequence.

Is there any way to have a combination of both behaviors : streaming and buffering values on the fly as they are generated, so that the next querying won't trigger the generation of the elements that have already been queried.

Here is a basic use-case:

static IEnumerable<int> Numbers
{
    get
    {
        int i = -1;

        while (true)
        {
            Console.WriteLine("Generating {0}.", i + 1);
            yield return ++i;
        }
    }
}

static void Main(string[] args)
{
    IEnumerable<int> evenNumbers = Numbers.Where(i => i % 2 == 0);

    foreach (int n in evenNumbers)
    {
        Console.WriteLine("Reading {0}.", n);
        if (n == 10) break;
    }

    Console.WriteLine("==========");

    foreach (int n in evenNumbers)
    {
        Console.WriteLine("Reading {0}.", n);
        if (n == 10) break;
    }
}

Here is the output:

Generating 0.
Reading 0.
Generating 1.
Generating 2.
Reading 2.
Generating 3.
Generating 4.
Reading 4.
Generating 5.
Generating 6.
Reading 6.
Generating 7.
Generating 8.
Reading 8.
Generating 9.
Generating 10.
Reading 10.
==========
Generating 0.
Reading 0.
Generating 1.
Generating 2.
Reading 2.
Generating 3.
Generating 4.
Reading 4.
Generating 5.
Generating 6.
Reading 6.
Generating 7.
Generating 8.
Reading 8.
Generating 9.
Generating 10.
Reading 10.

The generation code is triggered 22 times.

I'd like it to be triggered 11 times, the first time the enumerable is iterated.

Then the second iteration would benefit from the already generated values.

It would be something like:

IEnumerable<int> evenNumbers = Numbers.Where(i => i % 2 == 0).Buffer();

For those familiar with Rx it's a behavior similar to a ReplaySubject.

7
  • 2
    It's not really the the LINQ that needs caching but the IEnumerable, and there are a few examples of that already on the internet. Commented Nov 6, 2013 at 23:29
  • 1
    This was on reddit yesterday (here) with this exact scenario. I'd rather not steal that author's solution. Commented Nov 6, 2013 at 23:30
  • @ScottChamberlain: thanks for the link, Google was not my friend on this one. Commented Nov 6, 2013 at 23:38
  • @AustinSalonen: crazy coincidence and thanks for the link. :) Commented Nov 6, 2013 at 23:38
  • 1
    The general term for this is "memoization". Note that many of the implementations here handle some of the simple cases, but don't handle multiple enumerators enumerating the result before one has finished completely, don't handle parallelized enumeration of different enumerators, don't dispose of the underlying enumerable if the whole sequence isn't iterated, etc. To handle these more complex issues you're best off using an existing library implementation. Commented Nov 8, 2013 at 15:49

8 Answers 8

16

IEnumerable<T>.Buffer() extension method

public static EnumerableExtensions
{
    public static BufferEnumerable<T> Buffer(this IEnumerable<T> source)
    {
        return new BufferEnumerable<T>(source);
    }
}

public class BufferEnumerable<T> : IEnumerable<T>, IDisposable
{
    IEnumerator<T> source;
    List<T> buffer;
    public BufferEnumerable(IEnumerable<T> source)
    {
        this.source = source.GetEnumerator();
        this.buffer = new List<T>();
    }
    public IEnumerator<T> GetEnumerator()
    {
        return new BufferEnumerator<T>(source, buffer);
    }
    public void Dispose()
    {
        source.Dispose()
    }
}

public class BufferEnumerator<T> : IEnumerator<T>
{
    IEnumerator<T> source;
    List<T> buffer;
    int i = -1;
    public BufferEnumerator(IEnumerator<T> source, List<T> buffer)
    {
        this.source = source;
        this.buffer = buffer;
    }
    public T Current
    {
        get { return buffer[i]; }
    }
    public bool MoveNext()
    {
        i++;
        if (i < buffer.Count)
            return true;
        if (!source.MoveNext())
            return false;
        buffer.Add(source.Current);
        return true;
    }
    public void Reset()
    {
        i = -1;
    }
    public void Dispose()
    {
    }
}

Usage

using (var evenNumbers = Numbers.Where(i => i % 2 == 0).Buffer())
{
    ...
}

Comments

The key point here is that the IEnumerable<T> source given as input to the Buffer method only has GetEnumerator called once, regardless of how many times the result of Buffer is enumerated. All enumerators for the result of Buffer share the same source enumerator and internal list.

Sign up to request clarification or add additional context in comments.

17 Comments

It immediately evaluates Numbers completely, even before evenNumbers is ever used
Well Timothy as I said on an infinite sequence ToList is quite long. ;)
@sinelaw: as you say "completely", even if there is no completion ;)
@Pragmateek I missed that point. I figured out what you want and have updated the answer.
@TimothyShields: thanks for your implementation. I really hoped there was a standard way of doing this but nothing is perfect. You get this one. :)
|
8

You can use the Microsoft.FSharp.Collections.LazyList<> type from the F# power pack (yep, from C# without F# installed - no problem!) for this. It's in Nuget package FSPowerPack.Core.Community.

In particular, you want to call LazyListModule.ofSeq(...) which returns a LazyList<T> that implements IEnumerable<T> and is lazy and cached.

In your case, usage is just a matter of...

var evenNumbers = LazyListModule.ofSeq(Numbers.Where(i => i % 2 == 0));
var cachedEvenNumbers = LazyListModule.ofSeq(evenNumbers);

Though I personally prefer var in all such cases, note that this does mean the compile-time type will be more specific than just IEnumerable<> - not that this is likely to ever be a downside. Another advantage of the F# non-interface types is that they expose some efficient operations you can't do efficienly with plain IEnumerables, such as LazyListModule.skip.

I'm not sure whether LazyList is thread-safe, but I suspect it is.


Another alternative pointed out in the comments below (if you have F# installed) is SeqModule.Cache (namespace Microsoft.FSharp.Collections, it'll be in GACed assembly FSharp.Core.dll) which has the same effective behavior. Like other .NET enumerables, Seq.cache doesn't have a tail (or skip) operator you can efficiently chain.

Thread-safe: unlike other solutions to this question Seq.cache is thread-safe in the sense that you can have multiple enumerators running in parallel (each enumerator is not thread safe).

Performance I did a quick benchmark, and the LazyList enumerable has at least 4 times more overhead than the SeqModule.Cache variant, which has at least three times more overhead than the custom implementation answers. So, while the F# variants work, they're not quite as fast. Note that 3-12 times slower still isn't very slow compared to an enumerable that does (say) I/O or any non-trivial computation, so this probably won't matter most of the time, but it's good to keep in mind.

TL;DR If you need an efficient, thread-safe cached enumerable, just use SeqModule.Cache.

2 Comments

Thanks Eamon, F# is full of surprise. :) +1
@Pragmateek Yeah - this is just Seq.cache in F#
7

Building upon Eamon's answer above, here's another functional solution (no new types) that works also with simultaneous evaluation. This demonstrates that a general pattern (iteration with shared state) underlies this problem.

First we define a very general helper method, meant to allow us to simulate the missing feature of anonymous iterators in C#:

public static IEnumerable<T> Generate<T>(Func<Func<Tuple<T>>> generator)
{
    var tryGetNext = generator();
    while (true)
    {
        var result = tryGetNext();
        if (null == result)
        {
            yield break;
        }
        yield return result.Item1;
    }
}

Generate is like an aggregator with state. It accepts a function that returns initial state, and a generator function that would have been an anonymous with yield return in it, if it were allowed in C#. The state returned by initialize is meant to be per-enumeration, while a more global state (shared between all enumerations) can be maintained by the caller to Generate e.g. in closure variables as we'll show below.

Now we can use this for the "buffered Enumerable" problem:

public static IEnumerable<T> Cached<T>(IEnumerable<T> enumerable)
{
    var cache = new List<T>();
    var enumerator = enumerable.GetEnumerator();

    return Generate<T>(() =>
    {
        int pos = -1;
        return () => {
            pos += 1;
            if (pos < cache.Count())
            {
                return new Tuple<T>(cache[pos]);
            }
            if (enumerator.MoveNext())
            {
                cache.Add(enumerator.Current);
                return new Tuple<T>(enumerator.Current);
            }
            return null;
        };
    });
}

16 Comments

Thanks for this one sinelaw. :) +1
The use of Tuple<T> as an optional T is actually something I had never thought of before. A great trick for sure. +1
@TimothyShields Hmm, I don't think that's such a good trick - it's somewhat misleading. If you want and optional value, why make the (trivial) class OptionalValue or OptionalReference - well-chosen names help code maintainability.
@sinelaw: I like the idea, but you're being unnecessarily creative with your parameter passing: you can avoid the "reference to int via array" trick using a closure (i.e. Generate paratemer could be Func<Func<Tuple<T>> then); and you might want to name the concept of the generator state (i.e. Generate parameter could be be Func<Func<ValueOrEnd>>.
Nice answer, thanks. I started to use this code as a jumping off point, and was writing some tests for it. My test exposed the fact that 'MoveNext' is called on the original enumerator once for each re-use of the buffered results (when the 'end' is reached). This will almost never be a problem as you'd imagine most implementations of IEnumerator will have some state and know they are finished, but I'm not sure if that's guaranteed. If the intention is to replay exactly what happened the first time then there should arguably be another state variable in the closure, e.g. bool completed
|
7

I hope this answer combines the brevity and clarity of sinelaw's answer and the support for multiple enumerations of Timothy's answer:

public static IEnumerable<T> Cached<T>(this IEnumerable<T> enumerable) {
    return CachedImpl(enumerable.GetEnumerator(), new List<T>());
}

static IEnumerable<T> CachedImpl<T>(IEnumerator<T> source, List<T> buffer) {
    int pos=0;
    while(true) {
        if(pos == buffer.Count) 
            if (source.MoveNext()) 
                buffer.Add(source.Current); 
            else 
                yield break;
        yield return buffer[pos++];
    }
}

Key ideas are to use the yield return syntax to make for a short enumerable implementation, but you still need a state-machine to decide whether you can get the next element from the buffer, or whether you need to check the underlying enumerator.

Limitations: This makes no attempt to be thread-safe, nor does it dispose the underlying enumerator (which, in general, is quite tricky to do as the underlying uncached enumerator must remain undisposed as long as any cached enumerabl might still be used).

14 Comments

Nice. It also passes the Zip test.
Yeah. Shame that it needs a pointless wrapper method as you point out, but still nicer that all that manual interface implementation stuff.
I've added another solution that's longer but uses a general pattern for simulating anonymous iterators, so a bit more pretty.
It's generally a good idea to use braces around your if when you have a dangling else, as you have here.
@Sevvy that's a matter of taste. I contend that with automatic indenting, the purpose of those braces is almost nil - the indent, if you're used to it, makes it immediately obvious which if the else applies to (and the auto-indent means that making a mistake here is extremely unlikely): I've been programming .NET almost since day 1 and I've never ever seen this get confusingly messed up. Additionally: in an online forum like this I find it's helpful to have only the critical long blurbs of code; the extra visual distraction of extra braces isn't ideal.
|
4

As far as I know there is no built-in way to do this, which - now that you mention it - is slightly surprising (my guess is, given the frequency with which one would want to use this option, it was probably not worth the effort needed to analyse the code to make sure that the generator gives the exact same sequence every time).

You can however implement it yourself. The easy way would be on the call-site, as

var evenNumbers = Numbers.Where(i => i % 2 == 0).
var startOfList = evenNumbers.Take(10).ToList();

// use startOfList instead of evenNumbers in the loop.

More generally and accurately, you could do it in the generator: create a List<int> cache and every time you generate a new number add it to the cache before you yield return it. Then when you loop through again, first serve up all the cached numbers. E.g.

List<int> cachedEvenNumbers = new List<int>();
IEnumerable<int> EvenNumbers
{
  get
  {
    int i = -1;

    foreach(int cached in cachedEvenNumbers)
    {
      i = cached;
      yield return cached;
    }

    // Note: this while loop now starts from the last cached value
    while (true) 
    {
        Console.WriteLine("Generating {0}.", i + 1);
        yield return ++i;
    }
  }
}

I guess if you think about this long enough you could come up with a general implementation of a IEnumerable<T>.Buffered() extension method - again, the requirement is that the enumeration doesn't change between calls and the question is if it is worth it.

3 Comments

My answer supplies the general-purpose "Buffered" method you're talking about.
Thanks for your answer CompuChip, and yes this is a generic solution that I'm seeking. Anyway +1. :)
@TimothyShields I see you edited your answer after I posted mine. Nice one, thanks!
3

Here's an incomplete yet compact 'functional' implementation (no new types defined).

The bug is that it does not allow simultaneous enumeration.


Original description: The first function should have been an anonymous lambda inside the second, but C# does not allow yield in anonymous lambdas:

// put these in some extensions class

private static IEnumerable<T> EnumerateAndCache<T>(IEnumerator<T> enumerator, List<T> cache)
{
    while (enumerator.MoveNext())
    {
        var current = enumerator.Current;
        cache.Add(current);
        yield return current;
    }
}
public static IEnumerable<T> ToCachedEnumerable<T>(this IEnumerable<T> enumerable)
{
    var enumerator = enumerable.GetEnumerator();
    var cache = new List<T>();
    return cache.Concat(EnumerateAndCache(enumerator, cache));
}

Usage:

var enumerable = Numbers.ToCachedEnumerable();

4 Comments

This is buggy: it doesn't support multiple simultaneous iterations. E.g. cached.ZipWith(cached.Skip(1), Tuple.Create) would crash - and note that that's a particularly interesting case to support because caching that simultaneously ensures the list is evaluated just once, but it's also lazy.
Also, there's no need for the double-nested func's - you're evaluating em right away anyhow.
Oops, that double anonymous lambda slipped thorugh. Fixed.
You're also right about the bug. I'll leave this answer as a "how not to do it"
3

Full credit to Eamon Nerbonne and sinelaw for their answers, just a couple of tweaks! First, to release the enumerator when it is completed. Secondly to protect the underlying enumerator with a lock so the enumerable can be safely used on multiple threads.

// This is just the same as @sinelaw's Generator but I didn't like the name
public static IEnumerable<T> AnonymousIterator<T>(Func<Func<Tuple<T>>> generator)
{
    var tryGetNext = generator();
    while (true)
    {
        var result = tryGetNext();
        if (null == result)
        {
            yield break;
        }
        yield return result.Item1;
    }
}

// Cached/Buffered/Replay behaviour
public static IEnumerable<T> Buffer<T>(this IEnumerable<T> self)
{
    // Rows are stored here when they've been fetched once
    var cache = new List<T>();

    // This counter is thread-safe in that it is incremented after the item has been added to the list,
    // hence it will never give a false positive. It may give a false negative, but that falls through
    // to the code which takes the lock so it's ok.
    var count = 0;

    // The enumerator is retained until it completes, then it is discarded.
    var enumerator = self.GetEnumerator();

    // This lock protects the enumerator only. The enumerable could be used on multiple threads
    // and the enumerator would then be shared among them, but enumerators are inherently not
    // thread-safe so a) we must protect that with a lock and b) we don't need to try and be
    // thread-safe in our own enumerator
    var lockObject = new object();

    return AnonymousIterator<T>(() =>
    {
        int pos = -1;
        return () =>
        {
            pos += 1;
            if (pos < count)
            {
                return new Tuple<T>(cache[pos]);
            }
            // Only take the lock when we need to
            lock (lockObject)
            {
                // The counter could have been updated between the check above and this one,
                // so now we have the lock we must check again
                if (pos < count)
                {
                    return new Tuple<T>(cache[pos]);
                }

                // Enumerator is set to null when it has completed
                if (enumerator != null)
                {
                    if (enumerator.MoveNext())
                    {
                        cache.Add(enumerator.Current);
                        count += 1;
                        return new Tuple<T>(enumerator.Current);
                    }
                    else
                    {
                        enumerator = null;
                    }
                }
            }
        }
        return null;
    };
});

}

2 Comments

There is a race condition which keeps this code from being thread safe. Two threads try to get the last item in the list. Thread A checks pos < count to see if there's a cached result for it; there isn't. Thread B checks pos < count to see if there's a cached result for it; there isn't. Thread B moves to the last item and returns it. Thread B tries to get the next item, encounters the end of the list, and sets enumerator=null. Thread A checks enumerator != null, sees that it is null and return null instead of returning the last item.
You were right there was, thanks! I've edited the code to remove the outer check on the enumerator, which I think resolves the problem. Do you agree?
0

I use the following extension method.

This way, the input is read at maximum speed, and the consumer processes at maximum speed.

public static IEnumerable<T> Buffer<T>(this IEnumerable<T> input)
{
    var blockingCollection = new BlockingCollection<T>();

    //read from the input
    Task.Factory.StartNew(() =>
    {
        foreach (var item in input)
        {
            blockingCollection.Add(item);
        }

        blockingCollection.CompleteAdding();
    });

    foreach (var item in blockingCollection.GetConsumingEnumerable())
    {
        yield return item;
    }
}

Example Usage

This example has a fast producer (find files), and a slow consumer (upload files).

long uploaded = 0;
long total = 0;

Directory
    .EnumerateFiles(inputFolder, "*.jpg", SearchOption.AllDirectories)
    .Select(filename =>
    {
        total++;
        return filename;
    })
    .Buffer()
    .ForEach(filename =>
    {
        //pretend to do something slow, like upload the file.
        Thread.Sleep(1000);
        uploaded++;

        Console.WriteLine($"Uploaded {uploaded:N0}/{total:N0}");
    });

enter image description here

4 Comments

Have you measured this to see if your assertion is correct? My experience with a ConcurrentQueue is that the locking will make this much slower.
This also will ramp up the CPU. The yield return loop just spins on the CPU if the input is slow.
Thanks @Enigmativity, I changed it from ConcurrentQueue to a BlockingCollection
Sorry, any form of concurrent or blocking collection is the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.