Skip to main content
added 1248 characters in body
Source Link
J_H
  • 8k
  • 1
  • 18
  • 28

To clarify, I'm not contemplating sending NTP synchronized wallclock readings anywhere in this, as they only cause trouble when trying to reason about causality. When I speak of a Network Clock or a Lamport Clock, that is in the generally accepted sense of some server locally incrementing a sequence number and sending that number as its clock. Essentially time ticks forward by one with every message sent by that server.

Suppose that Alice's server A was at sequence number 7, and Bob's server B was at 1003. Then we might see messages like

  • A: (7, launch_missle)
  • B: (1003, missle_launched)
  • A: (8, acknowledged)

We tend to think of wallclock time as global. Network clocks are always w.r.t. the sending server. So we might speak of these three observed network clocks:

  • (A, 7)
  • (B, 1003)
  • (A, 8)

Viewing a number like 7 or 8 in isolation isn't meaningful, because a distributed system doesn't really have any "global" timestamps. We always mention the sender when logging a network clock or when sending it in a message. It is legitimate for Bob to forward Alice's (A, 7) onward to Charlie's server C, to show Charlie the causal chain.


To clarify, I'm not contemplating sending NTP synchronized wallclock readings anywhere in this, as they only cause trouble when trying to reason about causality. When I speak of a Network Clock or a Lamport Clock, that is in the generally accepted sense of some server locally incrementing a sequence number and sending that number as its clock. Essentially time ticks forward by one with every message sent by that server.

Suppose that Alice's server A was at sequence number 7, and Bob's server B was at 1003. Then we might see messages like

  • A: (7, launch_missle)
  • B: (1003, missle_launched)
  • A: (8, acknowledged)

We tend to think of wallclock time as global. Network clocks are always w.r.t. the sending server. So we might speak of these three observed network clocks:

  • (A, 7)
  • (B, 1003)
  • (A, 8)

Viewing a number like 7 or 8 in isolation isn't meaningful, because a distributed system doesn't really have any "global" timestamps. We always mention the sender when logging a network clock or when sending it in a message. It is legitimate for Bob to forward Alice's (A, 7) onward to Charlie's server C, to show Charlie the causal chain.

Source Link
J_H
  • 8k
  • 1
  • 18
  • 28

Your contrived example doesn't appear to be about GAAP finance accounting, and is just trying to get at causality in a distributed system. So I will ignore the XY aspect and move on, inventing some tighter assumptions for the example.

Thank you for the narrative description of the problem setup; that's good. However, we don't quite see what the Invariants and the Correctness Requirements are, so I will flesh those out. Typically it is reasoning about invariants that will let us decide whether a distributed system is correct or is buggy / racy.

A (not shown) actor will consume the Account Balance channel and take actions like sending a statement via Postal Service or via e-mail, for January and later for February. That balance consumer should avoid sending twice in a month, and it must send an accurate account balance figure which goes with the given month. The consumer can have a local state machine which helps it suppress duplicate sends, and helps it retry / recover when it reboots or other exciting events happen, like healing of a network partition.

The chief difficulty with your "Correct" and "Incorrect Order" examples is that those messages omit info which the balance consumer would need in order to properly interpret their meaning. You have stripped timestamp / causality information from them. So let's revisit the design of this Public API and its Messages.

Actor hears a message and in response to that sends an Income message. The income message should mention the clock message which triggered it, perhaps as the tuple:
(end_of_January, 1000).

When the Account Service sends a message, it should mention the clock message that triggered it, as well:
(start_of_February, (end_of_January, 1000))

We're worried about this racy sequence of arrivals heard by the Account Service:

  • Clock: start_of_February (what?!? no income?), followed by
  • Actor: 1000

Depending on the business rules of your Use Case, and CAP requirements, you may wish for the Account Service to be in an "inhibited" state until all Actors have sent in an end_of_January income update. Keep that Clock message buffered, ready to be delivered as soon as we're no longer inhibited.


As a practical matter, many distributed system servers will locally increment a serial number and send that as a message ID. That lets peers conveniently summarize "I have seen all of Alice's message up through serial 7", or perhaps holes such as "I have seen through 7, and also 9, but not yet 8". TCP SACK does this all the time.

To properly interpret and act on a message, an actor like the Balance Consumer may need to see several IDs in that message. Perhaps IDs from Clock, Actor, Clock, and Account Balance. We might call this a Vector Clock, or Lamport Clock. A well-designed Account Balance service might offer the computational service of eliminating some race ambiguity, such as by being inhibited or by retrying, and that can let it get away with sending smaller messages containing fewer upstream IDs.