The sequential consistent order of C++11 vs traditional GCC built-ins like `__sync_synchronize`

Question

So I've came across Jeff Preshing's wonderful blog posts on what's Acquire/Release and how they may be achieved with some CPU barriers.

I've also read that SeqCst is about some total order that's guaranteed to be consistent with not coherence-after relation - though at times it might contradict with happens-before relation established by plain Acquire/Release operations due to historical reasons.

My question is, how does the old GCC built-ins map into the memory model proposed by C++11 (and later revisions)? In particular, how to map __sync_synchronize() into C++11 or later modern C/C++?

In the GCC manual this call is simply described as a full memory barrier, which I suppose is the combination of all four major kind of barrier i.e. LoadLoad/LoadStore/StoreLoad/StoreStore barriers all at once. But is sync_synchronize equivalent to std::atomic_thread_fence(memory_order_seq_cst)? Or maybe, formally speaking, one of them is stronger than the other (which I suppose is the case here: in general a SeqCst fence should be stronger, since it requires the toolchain/platform to improvise a global ordering somehow, no?), and it just happens that most of the CPUs out there provides only instructions that satisfies both (full memory barrier by __sync_synchronize, total sequential ordering by std::atomic_thread_fence(memory_order_seq_cst)) at once, for example x86 mfence and PowerPC hwsync?

Either __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst) are formally equal or they are effectively equal (i.e. formally speaking they are different but no commercialized CPU bother to differentiate between the two), technically speaking a memory_order_relaxed load on the same atomic still may not be relied upon to synchronize-with/create happens-before relation with it, no?

I.e. technically speaking all of these assertions are allowed to fail, right?

// Experiment 1, using C11 `atomic_thread_fence`: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // We should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 2, using `SeqCst` directly on the atomic store

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
atomic_store_explicit(&lock, true, memory_order_seq_cst);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // Again we should really `memory_order_acquire` the `lock`
    // or `atomic_thread_fence(memory_order_acquire)` here,
    // or this assertion may fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

// Experiment 3, using GCC built-in: assertion is allowed to fail, right?

// global
static atomic_bool lock = false;
static atomic_bool critical_section = false;

// thread 1
atomic_store_explicit(&critical_section, true, memory_order_relaxed);
__sync_synchronize();
atomic_store_explicit(&lock, true, memory_order_relaxed);

// thread 2
if (atomic_load_explicit(&lock, memory_order_relaxed)) {
    // we should somehow put a `LoadLoad` memory barrier here,
    // or the assert might fail, no?
    assert(atomic_load_explicit(&critical_section, memory_order_relaxed));
}

I've tried these snippets on my RPi 5 but I don't see assertions fails. Yes this doesn't formally prove anything but it also doesn't shed light on differentiating between __sync_synchronize and std::atomic_thread_fence(memory_order_seq_cst).

This post seems to contain two questions that are really independent. The second half just seems to be about the fact that in order to actually achieve any synchronization or deduce anything about ordering, you need appropriate barriers in both threads, which seems to me pretty obvious. LoadLoad reordering in Thread 2 could certainly make the assert fail, and there's no way that any actions whatsoever in Thread 1 could prevent Thread 2 from doing so. — Nate Eldredge
– Nate Eldredge, Commented Mar 23 at 19:43
Here's a test that should work on your RPi 5 (it does on mine): Example of LoadLoad reordering. The two keys are (1) have the variables in separate cache lines; (2) have a test that you can repeat quickly, without re-running the program or spawning new threads every time, and that isn't reliant on any particular timing synchronization between the threads. — Nate Eldredge
– Nate Eldredge, Commented Mar 23 at 20:02
On my machine the number of required iterations is usually in the thousands, which is instantaneous when you're running them at full speed, but might never be seen if you're just taking one shot per run of your program. Another trick that can help (though not needed here) is to manually evict a cache line if you want a load or store to be delayed after one that's later in program order. — Nate Eldredge
– Nate Eldredge, Commented Mar 23 at 20:04
Thx for the snippet. I was not aware of the cache line part, and I did spawn thread each time, which makes it quite slow. Thx for the tips on how to experiment on such concurrency problems. — Not A Name
– Not A Name, Commented Mar 24 at 13:22

Peter Cordes · Accepted Answer · 2025-03-23 19:39:26Z

3

Yes, __sync_synchronize() is at least in practice equivalent to std::atomic_thread_fence(memory_order_seq_cst).

Formally, __sync_synchronize() operates in terms of memory barriers and blocking memory reordering, since it predates the existence of C++11's formal memory model. atomic_thread_fence operates in terms of C++11's memory model; compiling to a full-barrier instruction is an implementation detail.

So for example it's not required by the standard for thread_fence to do anything in a program where there aren't any std::atomic<> objects because its behaviour is only defined in terms of atomics. While __sync_synchronize() (and thread_fence in practice as an implementation detail in GCC/clang) could let you hack something up in terms of synchronizing on plain int variables. That's UB in C++11, and a bad idea even in terms of a known implementation like GCC; see Who's afraid of a big bad optimizing compiler? re: the obvious vs. non-obvious badness (like invented loads) that can happen when you just use memory barriers instead of std::atomic with relaxed for shared variables to stop a compilers from keeping them in registers.

But my point is, in practice they work the same, but they're from different memory models: the __sync builtins are in terms of barriers against local reordering of accesses to cache-coherent shared memory (i.e. a CPU-architecture view), vs. C++11 std::atomic stuff being in terms of its formalism with modification orders and syncs-with / happens-before. Which formally allows some things that aren't plausible on a real CPU which uses cache-coherent shared memory.

Yes, in your code blocks, the assertion could fail on a CPU where LoadLoad reordering is possible. It's probably not possible with both variables in the same cache line. See C++ atomic variable memory order problem can not reproduce LoadStore reordering example for another case of trying to repro memory-reordering.

answered Mar 23 at 19:39

Peter Cordes

377k50 gold badges741 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Not A Name Mar 24 at 2:03

Yeah, doubling down these functions do indeed stem from different model/regime helps. So may I claim that (when manipulating atomic objects) a full memory barrier_/__sync_synchronize, e.g. hwsync on PowerPC and mfence on x86, in practice they actually achieves *more* than what's formally required by std::atomic_thread_fence(memory_order_seq_cst) on a _modern CPU with cache-coherent shared memory?

Not A Name Mar 24 at 2:08

Additionally, I wonder how to express memory_order_seq_cst with the four major (LoadLoad/LoadStore/StoreLoad/StoreStore) kind of barriers? Is it possible, or do we need more mathematically/formally rigid language to fully capture the concept? Some text/paper would be hugely appreciated.

Peter Cordes Mar 24 at 10:07

@NotAName: AArch64 is an interesting example: seq_cst operations like x.store(1, seq_cst) can't reorder with later y.load(seq_cst), but can reorder with weaker or non-atomic operations. SC requires a StoreLoad barrier between SC stores and SC loads, but putting a full barrier after each SC store is just an implementation detail on ISAs without AArch64's special interaction between stlr and ldar where ldar has to wait for any stlr ops to drain from the store buffer, but otherwise stlr is just a release-store.

Peter Cordes Mar 24 at 10:09

@NotAName: In terms of barriers mapping to acquire / release and seq_cst, see preshing.com/20120913/acquire-and-release-semantics. seq_cst is like acquire + release with the additional requirement that no seq_cst operation can reorder with any other seq_cst operation. Acquire is LoadLoad + LoadStore and Release is StoreStore + LoadStore; only seq_cst ever requires a StoreLoad barrier.

Not A Name Mar 24 at 13:20

That SeqCst often requires StoreLoad barrier of some sort due to the fact both stores and load, in particular SeqCst store followed by SeqCst load, should be consistent with the global SeqCst ordering is to me quite on point. Thx.

Collectives™ on Stack Overflow

The sequential consistent order of C++11 vs traditional GCC built-ins like `__sync_synchronize`

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related