Is it possible on any real hardware, for the updated value of an atomic integer to become visible earlier via an indirect path than via a direct path?

Question

Is it possible on any real hardware in the real world, for the updated value of an atomic integer written by one thread to become visible to another thread earlier via an indirect path, where a third party (thread) forwards the value through another atomic integer, than via a direct path (the originally written atomic integer)?

This question is based on an example I saw in the Linux docs:(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/memory-barriers.txt#n1370)

Let's walk through their example (note that they have a version with, and without the general_barrier, which are then compared), which I've turned into C++ code and added comments to help clarify my question. Rephrasing my question into the context of the example:

Is it possible for the new value of y to be visible to thread3 before the new value of x is visible to thread3?

#define general_barrier std::atomic_thread_fence(memory_order_seq_cst)
#define read_barrier    std::atomic_thread_fence(memory_order_acquire)

// Though not written explicitly, these initially have value 0.
std::atomic<int> x, y;

void thread1() {
  x.store(1, std::memory_order_relaxed); // W0, value goes to thread2 and thread3
  // No barrier should be needed here because there's only one operation,
  // so there is nothing to reorder.
}

void thread2() {
  int r2 = x.load(std::memory_order_relaxed); // R0
  general_barrier; // F1, is this full memory barrier really necessary?
  // You can't do the write until you know the final (non speculative) value of r2,
  // so the read MUST have completed before the write.
  // Therefore I actually think no barrier should be needed at all here.
  y.store(r2, std::memory_order_relaxed); // W1, direct forwarding of value of x
}

void thread3() {
  int r3_a = y.load(std::memory_order_relaxed); // R1, value of x via indirect path
  read_barrier; // F2, required to prevent reordering of loads
  int r3_b = x.load(std::memory_order_relaxed); // R2, value of x via direct path

  if( r3_a == 1 ){
    assert(r3_b == 1); // Is it possible for this to go wrong?
  }
}

About the example with general_barrier, they say the following (They call it CPU instead of thread, though I'm keeping it general, assuming that the threads may or may not run on separate cores or CPU's, anything is possible):

Suppose that CPU 2's load from X returns 1, which it then stores to Y, and CPU 3's load from Y returns 1. This indicates that CPU 1's store to X precedes CPU 2's load from X and that CPU 2's store to Y precedes CPU 3's load from Y. In addition, the memory barriers guarantee that CPU 2 executes its load before its store, and CPU 3 loads from Y before it loads from X. The question is then "Can CPU 3's load from X return 0?"

Because CPU 3's load from X in some sense comes after CPU 2's load, it is natural to expect that CPU 3's load from X must therefore return 1. This expectation follows from multicopy atomicity: if a load executing on CPU B follows a load from the same variable executing on CPU A (and CPU A did not originally store the value which it read), then on multicopy-atomic systems, CPU B's load must return either the same value that CPU A's load did or some later value. However, the Linux kernel does not require systems to be multicopy atomic.

The use of a general memory barrier in the example above compensates for any lack of multicopy atomicity. In the example, if CPU 2's load from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load from X must indeed also return 1.

My comment on this part:

The property that I'm looking for doesn't require multicopy-atomicity or even other multicopy-atomicity, it's even weaker. The requirement is that an indirect path from thread1 to thread3, via forwarding through thread2, can't be faster than the direct path from thread1 to thread3. I suppose you could call it "forwarding-atomicity" or "middleman-atomicity" ?

Then on the version without general_barrier, here's what they have to say:

This substitution allows non-multicopy atomicity to run rampant: in this example, it is perfectly legal for CPU 2's load from X to return 1, CPU 3's load from Y to return 1, and its load from X to return 0.

The key point is that although CPU 2's data dependency orders its load and store, it does not guarantee to order CPU 1's store. Thus, if this example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a store buffer or a level of cache, CPU 2 might have early access to CPU 1's writes. General barriers are therefore required to ensure that all CPUs agree on the combined order of multiple accesses.

My comment on this part:

In proposed hardware architecture I still think the desired property holds. Assuming thread1 is on CPU 1, thread2 on CPU 2, and thread3 on CPU 3, if thread2 reads (x == 1), that proves that the value is at least in their shared store buffer (shared between thread1 and thread2, but not thread3). Then thread2 writes y = 1, but the value can only become visible at thread3 when the shared store buffer of thread1 and thread2 has been flushed. Since the store buffer already contained the new value of x, the new value of y can't become visible without also making the new value of x visible to thread3 at the same time or earlier.

So it looks like the big names in Linux would answer my question with:

"Yes, it is possible for the new value of y to be visible to thread3 before the new value of x is visible to thread3, and you need a full memory barrier in thread2 to prevent this."

However I was able to reason that even in their proposed counterexample, the desired property still holds. So I am wondering if it could always hold, or if there exists a real hardware architecture that is a counterexample and violates the property?

What I am looking for in an answer:

If you argue that the property does always hold, it would be nice if it can be shown with C++ formalism, e.g. using the documentation on memory ordering. Otherwise it should be a hardware and cache-coherence protocol discussion. Explain why no current hardware violates the property, and by what mechanism this is achieved.
If you argue that the property can be violated, please explain through a counterexample. Explain what hardware architecture and cache-coherence protocol violates the property, and by what mechanism this happens. Walk through the example to show where it goes wrong.

Credits to @Peter Cordes for pointing me to the Linux docs.

Sources:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/memory-barriers.txt#n1370

Pretty sure it is possible for the assert to fail on real POWER hardware, with t1 and t2 on the same physical core, t3 on another. t2 store-forwards from t1 and commits the store to y, all this core gets exclusive ownership of the cache line holding x. You're correct about the data dependency being sufficient in real hardware; ISO C++ would require consume (which is deprecated an promoted to acquire). A release or acquire fence would also be sufficient in real hardware; in ISO C++ only release would be sufficient. (Or better making the store a release op.) — Peter Cordes
– Peter Cordes, Commented Aug 19 at 22:46
On x86, the memory model requires stores to commit in program order. On weakly-ordered ISAs, they can commit into their equivalent of line-fill-buffers for example, to be merged into the in-flight cache line's data when it arrives. Or other equivalent effect, like looking past the head of the store buffer if the cache line isn't exclusively owned yet. This lets younger stores commit first, introducing StoreStore reordering which is absolutely a thing even in other-mutlicopy-atomic ISAs. I'm proposing StoreStore reordering from the PoV of a separate physical core, vs. how it looks inside. — Peter Cordes
– Peter Cordes, Commented Aug 19 at 23:32
@QwertYuiop: Yes, that's the whole point of consume; ordering for one operation, not for all with a fence instruction. But since no compiler ever implemented it, we only have things like Linux using restrictions on source code such that compiler optimization can't break the dependency; see C++11: the difference between memory_order_relaxed and memory_order_consume. So (&shared_var)[tmp-tmp] is something you'd have to do in asm since tmp-tmp only has one possible value. Math insns do carry a dependency on ISAs like ARM and POWER. — Peter Cordes
– Peter Cordes, Commented Aug 20 at 1:22
I mentioned IRIW because I'd already written about store-forwarding and actual mechanisms for different cores to disagree about store orders. The data dependency does make this different and requires more argumentation to get to the point where we can argue that reordering can happen. Again, I'm not 100% sure I'm right about all this stuff, but we know how C++ compiles for PowerPC, and we know that ISO C++ rules do guarantee (in your other question) that release is enough. So that asm works on PowerPC, or (unlikely) we've discovered a defect in ISO C++ and/or its mapping to PPC. — Peter Cordes
– Peter Cordes, Commented Aug 20 at 1:29
Defects have been discovered before, of things the standard guarantees but which HW doesn't provide in weird corner cases. But I still think that's unlikely, and I think lwsync (lightweight sync) ordering later stores after all earlier stores including from other logical cores makes a lot of sense, and is plausible in HW. (As well as after loads from this logical core at least. preshing.com/20120913/acquire-and-release-semantics) — Peter Cordes
– Peter Cordes, Commented Aug 20 at 1:33

Collectives™ on Stack Overflow

Is it possible on any real hardware, for the updated value of an atomic integer to become visible earlier via an indirect path than via a direct path?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked