AFAIK part of the reason why we need the C++11 memory model (and later patches/variants) is the fact we trade various things for single threaded executions to be fast, with only one main criteria i.e. not to break single threaded executions; as for how to achieve this, it's completely up to the CPU/compiler vendors. Specifically one of the various tricks they are allowed to do is reorder writes: from PoV of other parts of the system, those writes may not happen strictly in the order of the source code.
My question is, in general when we're dealing with low level code, e.g. manipulating the interrupt enable/disable bit in mstatus/sstatus in RISC-V terms, do we need to have some sort of barriers around these instructions, too?
Specifically, suppose an OS kernel want to implement some generic spinlock. A classical approach for acquiring the spinlock is to loop RMW (while(!compare_exchange_weak_explicit(/* snip */)), etc) to set the lock, before which the OS should disable interrupts for avoiding deadlocks. The compare equal and success part of the RMW atomic operation is itself a store operation. So to ensure that we did disable interrupt before lock acquisition, maybe in general we should enforce some memory ordering here: maybe memory_order_acq_rel for the lock acquisition instead of the typical memory_order_acquire, since for compare_exchange_weak_explicit, memory_order_acquire upon success implies the store part is effectively memory_order_relaxed, or StoreStore memory barrier (in RISC-V terms, something like __asm__ volatile("fence w, w" : : :);)? Or maybe a compiler fence suffices here if RISC-V had already set some ordering constraints on CSR mstatus/sstatus manipulations? But I'm not sure that's the case... if that's the case I'd appreciate the source of information.
"memory"clobber, allowing compile-time reordering with loads/stores. That makes it mostly useless as a memory barrier in high-level source code.volatileand inline asm to roll your own instead of using C++11, that means you need a"memory"clobber on the asm statement that enables or disables, but you don't need a separatefenceinstruction AFAIK.lr.w.aqandsc.w), and using exchange (single-instructionamoswap.w.aq).intr_on/intr_offto the other, you are dead. The machine architecture can't help you with that.