Interlocked.* code section guard with minimal inter-core interference?

Question

In order to guard a code section against repeat or concurrent execution we can use Interlocked functionality. Guarding against repeat execution is necessary for things like Dispose(), and guarding against concurrent execution is a fundamental part of high-performance multithreading.

Originally I used Interlocked.Increment() because if effectively records the total number of calls and thus has greater diagnostic power than the options based on exchange operations. However, was wondering whether Interlocked.CompareExchange() might not be a better option with regard to minimising CPU bus noise and inter-core interference related to inter-core cache update operations.

When guarding against repeat execution there is probably not much difference because the expected number of calls is exactly 1 in non-pathological cases and hence Interlocked.CompareExchange() will need to do a volatile write just like the Interlocked.Exchange() or Interlocked.Increment(). Also, the total number of executions in use cases like Dispose() will be comparatively small, so who cares.

However, things are different when guarding against concurrent execution, like for a non-blocking critical section. In that case there can be huge numbers of unsuccessful execution attempts and for these there should be no volatile write if Interlocked.CompareExchange() is used, only the volatile read that has to be performed in any case. This should give the CAS operation an edge over the other two options. Also, in this case Interlocked.Increment() has the additional disadvantage that even failed acquisition attempts need a volatile undo operation in the form of Interlocked.Decrement().

From this I conclude that it makes sense to use Interlocked.CompareExchange() for guarding against repeat or concurrent execution if there are no other, overriding concerns at play.

Is this reasoning sound or am I overlooking something?

The objective is to choose the best option for the general case, as a coding convention, and also to be aware when further, case-specific analysis may be in order (like doing a volatile read before the CAS as indicated by Ahmed AEK in a comment).

P.S.: I originally cut my teeth in/on Turbo Pascal and Delphi, which did not have anything comparable to Volatile.Write() and offered interlocked operations only as imports from the Win32 API. This may have led to an undue preference for solutions involving interlocked increment and decrement rather than CompareExchange with release via a volatile write.

Concrete examples as requested by multiple comment respondents

Candidate patterns for preventing repeat execution:

if (Interlocked.Increment(ref m_disposed) == 1)
{
    // ... disposal code ...
}

versus (current preference)

if (Interlocked.CompareExchange(ref m_disposed, 1, 0) == 0)
{
    // ... disposal code ...
}

Candidates for preventing concurrent execution:

var times_entered = Interlocked.Increment(ref m_active);

try
{
    if (times_entered == 1)  // no-one else here
    {
        // ... guarded code section ...
    }
}
finally
{
    Interlocked.Decrement(ref times_entered);
}

versus (current preference, but may need amendment as indicated by Ahmed AEK)

try
{
    if (Interlocked.CompareExchange(ref m_active, 1, 0) == 0)
    {
        // ... guarded code section ...
    }
    finally
    {
        Volatile.Write(ref m_active, 0);
    }
}

actually people will do an atomic read then a CAS, as in this question stackoverflow.com/q/79717187/15649230 , but mutexes generally go for the CAS first under the assumption that mutexes are unlocked most of the time. — Ahmed AEK
– Ahmed AEK, Commented Aug 9 at 13:11
fyi, the question statement can be marginally improved, figuring out the code you are talking about in the text is painful, just add a piece of code that prevents repeated or concurrent execution in the question and say like "is this better or that better" — Ahmed AEK
– Ahmed AEK, Commented Aug 9 at 13:15
also relevant stackoverflow.com/questions/5339769/… , at least on x86, there's no difference between an RMW and a CAS, so ... just go for the CAS ... or whatever will result in less atomic operations, but as i said above people will do an optimistic atomic read first to avoid locking the cache line. — Ahmed AEK
– Ahmed AEK, Commented Aug 9 at 14:19
If your question boils down to Which is faster in .NET on ARM when optimized, Interlocked.CompareExchange() vs Interlocked.Exchange() vs Interlocked.Increment(), honestly, I can only recommend you check out Which is faster? by Eric Lippert. Make a test case yourself and use BenchmarkDotNet. to test it. Honestly I suspect the difference will be miniscule compared to the overhead of the actual work of disposal, the .NET runtime, and whatever else you are doing. — dbc
– dbc, Commented Aug 9 at 14:29
@dbc: exchange is how compilers implement C++ std::atomic_flag's test_and_set, at least on x86 where it's a good choice; I haven't checked ARM, and IDK if that's the right tuning choice for ARM. On modern ARMv8.1 using single-instruction atomics, it's probably like x86 where either option needs the cache line in Exclusive state. Possibly ARM CAS could avoid dirtying it on failure, unlike x86. With old LL/SC which compilers probably only use on old CPUs, potentially CAS gives you an early-out on false. — Peter Cordes
– Peter Cordes, Commented Aug 9 at 16:38

Theodor Zoulias · Accepted Answer · 2025-08-09 16:20:31Z

3

Below is what I consider the standard way to mimic the Monitor.TryEnter/Monitor.Exit methods:

if (Interlocked.CompareExchange(ref m_active, 1, 0) == 0)
{
    try
    {
        // ... guarded code section ...
    }
    finally
    {
        Interlocked.Exchange(ref m_active, 0);
    }
}

The reason to prefer the Interlocked.Exchange over the Volatile.Write is to prevent instructions after the finally to be reordered and moved before the finally. The Interlocked.Exchange is a full memory barrier, while the Volatile.Write is only half. Allowing this reordering could increase the amount of instructions inside the guarded section, delaying the release of this improvised "lock". What would be the effect of the delay is not clear, because we don't have a broad view of your program.

answered Aug 9 at 16:20

Theodor Zoulias

46.1k8 gold badges112 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

DarthGizka Aug 9 at 16:25

Thank you, the danger of stuff potentially being reordered before the release in the case of a volatile write had not even occurred to me. Although it cannot affect correctness, it can affect performance, as you said. Kudos!

Peter Cordes Aug 9 at 16:45

@DarthGizka: Having the unlock reorder with later instructions probably won't happen statically (at compile time), and letting it happen at runtime won't slow down the unlock, it just gets useful work done while waiting instead of spending extra time on an atomic RMW and full memory barrier. CPUs generally try to complete the oldest-ready thing first, so memory reordering with later stores should only happen if your spinlock unlock misses in cache. And StoreLoad reordering with later loads is essential for CPUs to pipeline operations and hide latency.

Peter Cordes Aug 9 at 16:53

There could be some secondary downside in rare(?) cases, like a later load missing in cache so the release-store unlock's read-for-ownership might be delayed. IDK how common that is. I'm curious if Stephen Toub has tested this, or if we're both working off intuition and getting different guesses. IDK what he's talking about with ARM having a cost for a store-release "fence". If we're talking ARMv8, there's no fence instruction, just stlr instead of str, same instruction for release or seq_cst store.

Peter Cordes Aug 9 at 16:55

(ldar SC loads are what avoid StoreLoad reordering with SC or release stores, vs. ldapr for only acquire, introduced later in ARMv8.something). A full barrier does need a separate fence which is much more expensive and destroys all memory-level parallelism across it. It's certainly possible Stephen Toub has tested this in realistic use-cases and found it's better, but microbenchmarking locks is very hard: it's easy to test max-contention with all threads hammering on it all the time, but that's not realistic to threads doing useful work that also takes time.

dbc Aug 9 at 17:36

It would be interesting to see how, in .NET 9, the performance of the code in this answer compares to the performance of the new Lock type, and Lock.TryEnter().

|

Collectives™ on Stack Overflow

Interlocked.* code section guard with minimal inter-core interference?

Concrete examples as requested by multiple comment respondents

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Concrete examples as requested by multiple comment respondents

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related