1

Consider this outline of a simple multi-threaded application. It has one writer thread, and ten reader threads.

    #include <atomic>
    #include <thread>
    
    const int Num_readers{10};
    
    std::atomic<bool> write_lock;
    
    std::atomic<bool> read_lock[Num_readers];
    
    void reader_thread(int reader_number)
    {
        auto &rl{read_lock[reader_number]};
    
        for (;;)
        {
            for (;;)
            {
                rl.store(true, std::memory_order_seq_cst); // #1
    
                if (not write_lock.load(std::memory_order_seq_cst)) // #2
                    break;
    
                // Avoid writer starvation.
                rl.store(false, std::memory_order_seq_cst); // #3
    
                while (write_lock.load(std::memory_order_relaxed)) // #4
                    std::this_thread::yield();
            }
    
            // Read stuff from writer.
        }
    }
    
    void writer_thread()
    {
        for (;;)
        {
            write_lock.store(true, std::memory_order_seq_cst); // #5
    
            bool have_lock;
            do
            {
                have_lock = true;
                for (const auto &rl : read_lock)
                    if (rl.load(std::memory_order_seq_cst)) // #6
                    {
                        have_lock = false;
                        std::this_thread::yield();
                        break;
                    }
            }
            while (not have_lock);
    
            // Write stuff for readers.
            
            write_lock.store(false, std::memory_order_seq_cst); // #7
    
            std::this_thread::yield();
        }
    }

(Assume the threads are static.) The purpose of the atomic variables is mutual exclusion of writing and reading. Could the memory order for any of the atomic loads or stores be made less strict?

20
  • 1
    Are you asking us to guess what this code does? Is writer_thread really supposed to inspect only two of the ten read_lock values? Commented Oct 1 at 0:35
  • 3
    You are casting std::atomic<bool> to int, which looks like a bug. Commented Oct 1 at 0:41
  • 2
    @DrewDormann: Well spotted. I assume for (int i : read_lock) { ... } was supposed to be a loop over the array, checking if there were any readers (and spin-waiting until there aren't any), because that would I think be a readers/writers version of Peterson's. Not using their bool values as 0/1 indexes back into the same array like the code is currently doing! for (const auto &rl_entry : read_lock) { if (rl_entry) { ...} } or similar. Commented Oct 1 at 0:45
  • 2
    @DrewDormann oops sorry fixed Commented Oct 1 at 0:53
  • 1
    You can potentially use a single seq_cst fence before and/or after looping over all the read_lock[] entries, instead of making each load seq_cst. That would be a lot faster on PowerPC (where an SC load has to include a full barrier), and maybe on ARMv7 for the same reason, but slower on x86 where an SC load is just a plain mov instruction, but an SC fence is slow. Commented Oct 1 at 15:34

1 Answer 1

1

Partial answer:

seq_cst is at least essential for the main pair of "store true to my lock, load from the other one to see if it's already taken" in each of the reader and writer.

We can see this just by reducing to a simpler model where loads and stores can be reordered in a single thread if not blocked by barriers, but become globally visible in a consistent total order. In this model, a StoreLoad barrier is inserted only by seq_cst ordering. And it's easy to see that if the main store and load could be reordered with one another, we are dead.

Another way to think about it is to ask: what is it fundamentally that determines who has the lock, when a reader and writer compete? It's whoever got their store in first. And the only part of the C++ memory model that provides a total order between two stores to different objects is the sequential consistency total order. (Happens-before and friends are only partial orders, and would allow for the two stores to be unordered with respect to each other, in which case there would be no way to infer who owns the lock.) So the stores used to take the locks have to be seq_cst. And the subsequent loads, which test if the other lock was taken first, need to observe this ordering between the stores, so they have to be seq_cst as well.


The stores that drop the locks following the critical section have to be release at minimum; otherwise they could be reordered with the critical section. I still have to think about whether they need to be seq_cst.

The other store in reader_thread, which de-asserts its read lock if it finds that the write lock is already held, I also need to think about. At first glance I don't see any obvious reason why it needs to be stronger than relaxed. It's already blocked from reordering with the load from write_lock because seq_cst loads are also acquire. But that's not a proof, of course.

Sign up to request clarification or add additional context in comments.

5 Comments

To be explicit: A StoreLoad barrier is only guaranteed if both the store and the load use seq_cst. On x86 for example, a weaker store followed by a seq_cst load can reorder, but on AArch64 it's seq_cst loads (ldar) that have a special interaction with previous stlr (release or seq_cst store); Acquire loads (ldapr) don't. So SC stores can reorder with later non-SC loads. On some ISAs (PowerPC), each seq_cst load includes a full fence instruction, so it would be cheaper there to use one barrier before looping over relaxed loads. But that's worse on many other ISAs.
Actually the loads need to be at least acquire, they need to sync-with the release-store from the other thread. So if you're on an ISA where each op needs a 2-way fence instruction, you could put a atomic_thread_fence(acquire) after a loop over relaxed loads.
I agree with your reasoning, and that's what I was thinking too for the minimum ordering required. Locking without atomic RMWs requires avoiding StoreLoad reordering. This algo is similar to en.wikipedia.org/wiki/Peterson%27s_algorithm but without a turn variable: here readers always defer to writers. (And one of the two threads in Peterson's is instead multiple readers, so we loop.) I always forget exactly what turn is doing in Dekker's and Peterson's, but if it's needed for correctness I'm a little suspicious about this algo which doesn't use it, but I don't see a problem.
Oh, I think turn is avoiding deadlock and/or starvation. But this algo avoids that a different way, by having readers retract their attempt to take the lock if the writer is waiting or locked. (But they only check the writer after having set their flag to true! So for the writer to get the lock, its loop needs to get lucky with timing for all 10 readers which are toggling their flags. This is really bad!)
This livelock could be fixed by having readers skip their store if write_lock.load(relaxed) is true, so they stay read-only while a writer is waiting or in its critical section. But then fairness is heavily tilted towards writers. OTOH with only one writer thread vs. many reader threads, a reader will probably get in there often enough. Generally this lock algorithm looks bad, with a lot of over-caution and not distinguishing between waiting vs. already in the critical section. But as far as correctness, yeah seq_cst for setting/checking flags, release for unlock.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.