6

In C++, I have two threads. Each thread does a store first on one variable, then a load on another variable, but in reversed order:

std::atomic<bool> please_wake_me_up{false};
uint32_t cnt{0};

void thread_1() {
    std::atomic_ref atomic_cnt(cnt);

    please_wake_me_up.store(true, std::memory_order_seq_cst);
    atomic_cnt.load(std::memory_order_seq_cst); // <-- Is this line necessary or can it be omitted?
    futex_wait(&cnt, 0); // <-- The performed syscall must read the counter.
                         //     But with which memory ordering?
}

void thread_2() {
    std::atomic_ref atomic_cnt(cnt);

    atomic_cnt.store(1, std::memory_order_seq_cst);
    if (please_wake_me_up.load(std::memory_order_seq_cst)) {
        futex_wake(&cnt);
    }
}

Full code example: Godbolt.

If all of the four atomic accesses are performed with sequential consistency, it's guaranteed that at least one thread will see the store of the other thread when performing the load. This is what I want to achieve.

As the futex-syscall must perform a load of the variable it performs on internally, I'm wondering if I can omit the (duplicated) load right before the syscall.

  • Every syscall should lead to a compiler memory barrier, right?
  • Do syscalls in general act like full memory barriers?
  • As the futex syscall is guaranteed to read the counter, is it safe to omit the marked line? Is there any guarantee the load inside the syscall occurs with sequential consistency?
  • If the line is necessary, would a std::atomic_thread_fence(std::memory_order_seq_cst) be better, as I'm not needing the value, just a fence?

If the answer to the question is architecture-specific, I would be interested in x86_64 and arm64.

2 Answers 2

5

Any syscalls is a compiler barrier, like any non-inline function.

Not necessarily full barriers against runtime reordering, though, although they might well be in practice, especially since they usually take long enough that the store buffer would have time to probably drain on its own. (Especially with Spectre and MDS mitigation in place (on x86 getting extra microcode to run to flush stuff), taking many extra cycles between reaching the syscall entry point and actually dispatching to a kernel function.)

atomic_thread_fence is probably worse, e.g. on x86-64 that would be an extra mfence or dummy locked operation, while an atomic load would be basically free since it'll normally still be hot in L1d from the xchg store for seq_cst.

On AArch64 stlr / ldar is still sufficient: the reload can't happen until the store commits to cache, and is itself an acquire load. So yes it will keep all later loads/stores (including of cnt by the futex system call) after please_wake_me_up.store. It should be no worse than a stand-alone full barrier, which would have to drain all previous stores from the store buffer, not just stlr seq_cst / release stores. Earlier cache-miss stores could potentially still be in flight... except that stlr is a release store so all earlier loads and stores need to be completed before it can commit.

If anything in the kernel uses an ldar (instead of ARMv8.3 ldapr just acquire not seq_cst), then you'd still be safe and more work could get into the pipeline while waiting for the please_wake_me_up.store to drain from the store buffer. But there's no guarantee that's safe, unfortunately; the futex man page doesn't say it does a seq_cst load.

Sign up to request clarification or add additional context in comments.

5 Comments

Now I think that at least for x86_64 the extra load is definitely not needed. As every load is a seq_cst load, the load inside the kernel is guaranteed to be seq_cst. For AArch64, the kernel could do a ldr and early return from the syscall, and there seems to be no guarantee it uses ldar internally. The manual load is necessary to ensure correctness.
@sedor: That's correct (unless Belal Anas's answer is correct that futex wait specifically does a seq_cst load, which is very plausible; I only checked the man page). If you're going to use atomic_thread_fence anyway, you could make the store relaxed. That's a lot less bad on x86-64 than SC store + SC fence = two locked operations; mov + a dummy lock add byte [rsp], 0 is only slightly more expensive than xchg. Either way draining store buffer before syscall. On AArch64, though, just an stlr would be a lot better than store + SC fence if futex_wait is guaranteed to use ldar.
@sedor: Actually I forgot, you're just using a reload of the same variable, not atomic_thread_fence. SC store + SC load should be at least as cheap as relaxed store + SC fence on x86-64 and probably AArch64. Probably not on PowerPC where even SC loads require significant fencing.
I checked the kernel code and the futex wait code path arrives at futex_wait_setup. From there the value seems to be read at up to two places, but always at futex_read_inatomic. The comment say it's an atomic userspace read, whatever this means. From there it seems to be complicated, and for arm64 ldtr seems to be used in the end. But I'm not sure on that and especially if I have overlooked a barrier in between.
@sedor: Yeah, no clear documentation that it's an SC load. Still waiting on BelalAnas to edit their answer with support for their claim. ldtr is just an ldr that's allowed to access user pages when in kernel mode with hardening flags set. (It's atomic because get_futex_key checked it was aligned, according to comments on inatomic.) Not looking good for any future-proof guarantee of safety, even If it happens to be safe in practice currently because of some implementation detail. Unless there's explicit documentation somewhere that we haven't noticed.
0

in your C++ code you can likely remove the explicit load of cnt before the futex_wait call. The futex_wait syscall internally performs a load of the futex word cnt with sequential consistency. This ensures the atomicity of checking the value of the futex word and the blocking action as well as properly sequencing them, which is no different from std::memory_order_seq_cst 's guarantees.

why this works?

- Futex operations are ordered inside, and the load inside futex_wait coordinates the memory synchronization as necessary.

- This eliminates the need for an explicit initial load, and your code is still correct and optimal.

so,

- do not use explicit load,

- assume sequential consistency of futex_wait

1 Comment

The futex_wait syscall internally performs a load of the futex word cnt with sequential consistency. - Is that documented anywhere? Like in the source code with a comment that it's intentional and won't change? Because that's the core of the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.