6.824 2015 Lecture 12: Eventual Consistency
Note: These lecture notes were slightly modified from the ones posted on the 6.824 course website from Spring 2015.
Exam
- Bring papers and lecture notes for exam
Bayou: Eventual consistency
- a set of copies of the data, where applications can use any copy of the data
- local read/write
- even if the network breaks, I can still use the local copy
- disconnected operation
- ad-hoc synchronization
- laptop, phone, tablet can synchronize amongst each other instead of relying on Internet connection
- can work with database servers that have different data and synchronize with each other
- similar to Ficus, but Bayou has more sophisticated conflict resolution
Conflicts
- what to do about the inevitable conflicts that happen when you allow people to write to their local copies and synchronize them later
Meeting room scheduler
Traditional approach (central server):
PDA
|-----------------
|9am 824 staff |----------------|
|-- | Server |
|10am -------------> | DB | |
|-- | 9am | |
|11am | 10am | |
|-- |
|12pm |
|--
Not a good approach because it requires everyone to have connectivity to the server.
Would be nice if you have PDA send appointment to laptop, which can then send it to the server.
PDA
|-----------------
|9am 824 staff |----------------|
|-- | Server |
|10am | DB | |
|-- | 9am | | <-----\
|11am | 10am | | \
|-- | |
|12pm | laptop
|-- \ /
\----------------------------------->/
Update functions
Main idea: Update functions. Instead of the application saying "write this DB record", the application hands a function that behaves differently based on what's in the DB.
Example:
- if free at 10am
- reserve @10am
- else if free at 9am
- reserve @9am
- else
- reserve
Bayou takes this function from the PDA and gives it to the laptop.
Suppose A and B want the same times:
- A wants: either staff meeting at 10 or 11
- B wants: hiring meeting at 10 or 11
If you simply apply these functions to node A's and B's databases, that's not enough:
- X syncs with A: X gets 10am staff meeting
- X syncs with B: X gets 11am hiring meeting
- Y syncs with B: Y gets 10am hiring meeting
- Y syncs with A: Y gets 11am staff meeting
- Bad: now X and Y have differing views
=>
have to execute A
's and B
's update functions in the same order
Numbering updates
Next idea: number update functions, so that you can view them as being a log
- Classic way to order things is to stamp them with numbers and sorting
- initially let the Bayou update ID be
<time T, nodeId>
- possible for time
T
to be the same for two update IDs, but then the node IDs will differ (presumably)
- possible for time
- ordering rules:
a < b
ifa.T < b.T
ora.T == b.T and a.ID < b.ID
If we take the previous example:
<T=10, nodeId=A>, A wants: either staff meeting at 10 or 11
<T=20, nodeId=B>, B wants: hiring meeting at 10 or 11
- When Y syncs with B and then with A, it'll see A's update occurred earlier
- so it undoes B's update, applies A's and then B's again
We need to be able to roll back and re-execute the log.
Are the updates consistent with causality?
- PDA A adds a meeting
- A synchronizes with B
- B deletes A's meeting
If some 3rd node sees these updates, it would be necessary to have the meeting creation timestamp be smaller than the deletion timestamp.
Lamport logical clock
Each node maintains T_max
, the highest timestamp this node has ever seen
from itself or from another node.
When a node creates an event and adds it to the log, it picks timestamp T =
max (T_max + 1, wall clock time)
- new timestamps are always higher than timestamps the node has ever seen
Tentative entries, commit scheme
It's annoying that entries in the calendar are always displayed as tentative because another (earlier) update could come in and replace it.
- maybe because the new update sender was disconnected for a long time
We're looking for a way to all agree that anything above a certain point in the log will never change (it's frozen, no one can modify stuff there)
Bad idea: One possibility is to have all the replicas exchange summary w/ each other about what they've seen:
- X has seen all A's updates through 20, B's through 17, and C's through 72
- these are timestamps (logical clocks)
- we know that X will never create a timestamp less than 72
- similarly, node Y also has a min timestamp that he will generate next
- say 30
- we can take the minimum over all these minimums
min(30, 72) = 30
and commit all operations up to that point - problem is it requires every node to be up and connected to all other nodes
Commit scheme for Bayou
They have one magic node, a primary. Every update that passes through the primary,
the primary stamps it with a commit sequence number (CSN), the actual ordering number
becomes: <csn, T, node ID>
- primary does not wait for earlier updates (with smaller
T
) to arrive first, it just timestamps things as they come - commit preserves causal order
- commit does not preserve wall clock order
If you don't have a CSN: <-, T, nodeID>
, all commited operations are considered
to occur before uncommitted ones.
TODO: not clear what this example was supposed to show
- A's meeting created
- B's meeting created
- B synchronizes with C
- B synchronizes with A
- C synchronizes with primary
- primary applies CSN to A's op, but not B's
- B synchronizes with primary
Vector timestamps
Synchronization
- A has
<-, 10, X>
<-, 20, Y>
<-, 30, X>
<-, 40, X>
- B has
<-, 10, X>
<-, 20, Y>
<-, 30, X>
- A syncs with B
- sends a version vector to B describe which updates it has
from every node
- A:
[X 40, Y 20]
- (remember that the timestamps are always increased by senders)
- B:
[X 30, Y 20]
- If B compares A's VT with his, he notices that he needs updates by X between timestamp 30 and 40
- A:
- sends a version vector to B describe which updates it has
from every node
A new node joins
Now some VTs will have an entry for some new node Z. For instance, in the previous example
- A can send [X 40, Y 20, Z 60] to B
We also need a way to remove nodes.
But B won't know if Z
is newly added or newly deleted?
- Z joins the system
- Z talks to X
- X generates Z's unique node ID
- Z's ID =
<Tz, X's node ID>
, where Tz is the time Z talked to X
- Z's ID =
X sends an update timestamped with
<-, Tz, X>
that says "new server z"Everybody will see this first before seeing Z's updates
- Z's updates have timestamps higher than Tz
note that IDs are unbounded in size
Forgetting nodes:
- Z's ID =
<20, X>
- A syncs -> B
- A has log entry from Z
<-, 25, <20, X>>
- B has no VT entry for Z
Now B needs to figure out from A's updates if Z was added or removed
Case 1: If B's VT entry for X
is less than the timestamp in Z
's ID, then
that means that B
hasn't even seen the creation for Z
, let alone any updates
from Z
=> B
should create the entry for Z
because Z
is new to B
Case 2: If B's VT entry for X
is higher than the timestamp in Z
's ID, (ie.
B has seen updates from X
after it created Z
), then B must've seen Z
's
creation =>
B must have seen a deletion notice
Q: If Z's entry is missing from B
then Z
(probably?) says <-, T, Z> bye, T > Tz
6.824 notes
Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System Terry, Theimer, Petersen, Demers, Spreitzer, Hauser, SOSP 95
Some material from Flexible Update Propagation for Weakly Consistent Replication, SOSP 97
Why this paper?
- Eventual consistency is pretty common
- git, iPhone sync, Dropbox, Amazon Dynamo
- Why do people like eventual consistency?
- fast read/write of local copy (no primary, no paxos)
- disconnected operation
- What goes wrong?
- doesn't look like "single copy" (no primary, no paxos)
- conflicting writes to different copies
- how to reconcile them when discovered?
- Bayou has the most sophisticated reconciliation story
Paper context:
- Early 1990s (like Ficus)
- Dawn of PDAs, laptops, tablets
- H/W clunky but clear potential
- Commercial devices did not have wireless
- Devices might be off or not have network access
- This problem has not gone away!
- iPhone sync, Dropbox sync, Dynamo
Let's build a conference room scheduler
- Only one meeting allowed at a time (one room).
- Each entry has a time and a description.
- We want everyone to end up seeing the same set of entries.
Traditional approach: one server
- Server executes one client request at a time
- Checks for conflicting time, says yes or no
- Updates DB
- Proceeds to next request
- Server implicitly chooses order for concurrent requests
Why aren't we satisfied with central server?
- I want to use scheduler on disconnected iPhone &c
- So need DB replica in each node.
- Modify on any node, as well as read.
- Periodic connectivity to net.
- Periodic direct contact with other calendar users (e.g. bluetooth).
Straw man 1: merge DBs
- Similar to iPhone calendar sync, or file sync.
- May need to compare every DB entry -- lots of time and net b/w.
- Still need a story for conflicting entries, i.e. two meetings at same time.
- User may not be available to decide at time of DB merge.
- So need automatic reconciliation.
Idea for conflicts: update functions
- Application supplies a function, not a new value.
- Read current state of DB, decide best change.
- E.g. "Meet at 9 if room is free at 9, else 10, else 11."
- Rather than just "Meet at 9"
- Function can make reconciliation decision for absent user.
- Sync exchanges functions, not DB content.
Problem: can't just apply update functions to DB replica
- A's fn: staff meeting at 10:00 or 11:00
- B's fn: hiring meeting at 10:00 or 11:00
- X syncs w/ A, then B
- Y syncs w/ B, then A
- Will X put A's meeting at 10:00, and Y put A's at 11:00?
Goal: eventual consistency
- OK for X and Y to disagree initially
- But after enough syncing, all nodes' DBs should be identical
Idea: ordered update log
- Ordered log of updates at each node.
- Syncing == ensure both nodes have same updates in log.
- DB is result of applying update functions in order.
- Same log
=>
same order=>
same DB content.
How can nodes agree on update order?
- Update ID:
<time T, node ID>
- T is creating node's wall-clock time.
- Ordering updates a and b:
a < b
ifa.T < b.T
or (a.T = b.T
anda.ID < b.ID
)
Example:
<10,A>: staff meeting at 10:00 or 11:00
<20,B>: hiring meeting at 10:00 or 11:00
what's the correct eventual outcome?
the result of executing update functions in timestamp order
staff at 10:00, hiring at 11:00
What DB content before sync?
- A: staff at 10:00
- B: hiring at 10:00
- This is what A/B user will see before syncing.
Now A and B sync with each other
- Each sorts new entries into its log, order by time-stamp
- Both now know the full set of updates
- A can just run B's update function
- But B has already run B's operation, too soon!
Roll back and replay
- B needs to to "roll back" DB, re-run both ops in the right order
- Big point: the log holds the truth; the DB is just an optimization
- We will optimize roll-back in a bit
Displayed meeting room calendar entries are "tentative"
- B's user saw hiring at 10, then it changed to hiring at 11
Will update order be consistent with wall-clock time?
- Maybe A went first (in wall-clock time) with
<10,A>
- Node clocks unlikely to be perfectly synchronized
- So B could then generate <9,b>
- B's meeting gets priority, even though A asked first
- Not "externally consistent"
Will update order be consistent with causality?
- What if A adds a meeting,
- then B sees A's meeting,
- then B deletes A's meeting.
- Perhaps
<10,A> add
<9,B> delete
-- B's clock is slow
- Now delete will be ordered before add!
Lamport logical clocks for causal consistency
- Want to timestamp events s.t.
- if node observes E1, then generates E2, then
TS(E2) > TS(E1)
- if node observes E1, then generates E2, then
- So all nodes will order E1, then E2
Tmax
= highest time-stamp seen from any node (including self)T = max(Tmax + 1, wall-clock time)
-- to generate a timestamp- Note properties:
- E1 then E2 on same node
=> TS(E1) < TS(E2)
- BUT
TS(E1) < TS(E2)
does not imply E1 came before E2
- E1 then E2 on same node
Logical clock solves add/delete causality example
- When B sees
<10,A>
,- B will set its Tmax to 10, so
- B will generate
<11,B>
for its delete
Irritating that there could always be a long-delayed update with lower TS
- That can cause the results of my update to change
- User can never be sure if meeting time is final!
- Would be nice if updates were eventually "stable"
=>
no changes in update order up to that point=>
results can never again change -- you know for sure when your meeting is=>
don't have to roll back, re-run committed updates
Bad idea: a fully decentralized "commit" scheme
- Proposal:
<10,A>
is stable if all nodes have seen all updates w/TS <= 10
- Have sync always send in log order -- "prefix property"
- If you have seen updates w/
TS > 10
from every node- Then you'll never again see one
< <10,A>
- So
<10,A>
is stable
- Then you'll never again see one
- Why doesn't Bayou do this?
- Not all nodes are connected to each other
How does Bayou commit updates, so that they are stable?
- One node designated "primary replica".
- It marks each update it receives with a permanent CSN.
- Commit Sequence Number.
- That update is committed.
- So a complete time stamp is
<CSN, local-time, node-id>
- Uncommitted updates (are considered to) come after all committed updates w/ this new timestamping scheme
- CSN notifications are synced between nodes.
- The CSNs define a total order for committed updates.
- All nodes will eventually agree on it.
Will commit order match tentative order?
- Often.
- Syncs send in log order (prefix property)
- Including updates learned from other nodes.
- So if A's update log says
<-,10,X>
<-,20,A>
- A will send both to primary, in that order
- Primary will assign CSNs in that order
- Commit order will, in this case, match tentative order
Will commit order always match tentative order?
- No: primary may see newer updates before older ones.
- A has just:
<-,10,A> W1
- B has just:
<-,20,B> W2
- If
C
sees both, C's order:W1 W2
- B syncs with primary, gets
CSN=5
. - Later A syncs w/ primary, gets
CSN=6
. - When C syncs w/ primary, its order will change to
W2 W1
<5,20,B> W1
<6,10,A> W2
- So: committing may change order.
Committing allows app to tell users which calendar entries are stable.
- A stable meeting room time is final.
Nodes can discard committed updates.
- Instead, keep a copy of the DB as of the highest known CSN
- Roll back to that DB when replaying tentative update log
- Never need to roll back farther
- Prefix property guarantees seen
CSN=x => seen CSN<x
- No changes to update order among committed updates
- Prefix property guarantees seen
How do I sync if I've discarded part of my log?
- Suppose I've discarded all updates with CSNs.
- I keep a copy of the stable DB reflecting just discarded entries.
- When I propagate to node
X
:- If node X's highest CSN is less than mine,
- I can send him my stable DB reflecting just committed updates.
- Node X can use my DB as starting point.
- And X can discard all CSN log entries.
- Then play his tentative updates into that DB.
- If node X's highest CSN is greater than mine,
- X doesn't need my DB.
- If node X's highest CSN is less than mine,
- In practice, Bayou nodes keep the last few committed updates.
- To reduce chance of having to send whole DB during sync.
How to sync?
- A sending to B
- Need a quick way for B to tell A what to send
- Committed updates easy: B sends its CSN to A
- What about tentative updates?
- A has:
<-,10,X>
<-,20,Y>
<-,30,X>
<-,40,X>
- B has:
<-,10,X>
<-,20,Y>
<-,30,X>
- At start of sync, B tells A "X 30, Y 20"
- Sync prefix property means B has all X updates before 30, all Y before 20
- A sends all X's updates after
<-,30,X>
, all Y's updates after<-,20,X>
, &c - This is a version vector -- it summarize log content
- It's the "F" vector in Figure 4
- A's F:
[X:40,Y:20]
- B's F:
[X:30,Y:20]
How could we cope with a new server Z joining the system?
- Could it just start generating writes, e.g.
<-,1,Z>
? - And other nodes just start including Z in VVs?
- If A syncs to B, A has
<-,10,Z>
, but B has no Z in VV- A should pretend B's VV was
[Z:0,...]
- A should pretend B's VV was
What happens when Z retires (leaves the system)?
- We want to stop including Z in VVs!
- How to announce that Z is gone?
- Z sends update
<-,?,Z> "retiring"
- Z sends update
- If you see a retirement update, omit Z from VV
- How to deal with a VV that's missing Z?
- If A has log entries from Z, but B's VV has no Z entry:
- e.g. A has
<-,25,Z>
, B's VV is just[A:20, B:21]
- Maybe Z has retired, B knows, A does not
- Maybe Z is new, A knows, B does not
- e.g. A has
- Need a way to disambiguate: Z missing from VV b/c new, or b/c retired?
Bayou's retirement plan
- Z joins by contacting some server
X
- Z's ID is generated by X as
<Tz,X>
- Tz is X's logical clock as of when Z joined
- Note: unbounded ID size
- X issues
<-,Tz,X> "new server Z"
How does ID=<Tz,X>
scheme help disambiguate new vs forgotten?
- Suppose Z's ID is
<20,X>
- A syncs to B
- A has log entry from
Z <-,25,<20,X>>
- B's VV has no Z entry
- A has log entry from
- One case:
- B's VV:
[X:10, ...]
10 < 20
implies B hasn't yet seen X's "new server Z" update
- B's VV:
- The other case:
- B's VV:
[X:30, ...]
20 < 30
implies B once knew about Z, but then saw a retirement update
- B's VV:
Let's step back.
Is eventual consistency a useful idea?
- Yes: people want fast writes to local copies
- iPhone sync, Dropbox, Dynamo, Riak, Cassandra, &c
Are update conflicts a real problem?
- Yes -- all systems have some more or less awkward solution
Is Bayou's complexity warranted?
- I.e. log of update functions, version vectors, tentative operations
- Only critical if you want peer-to-peer sync
- I.e. both disconnected operation AND ad-hoc connectivity
- Only tolerable if humans are main consumers of data
- Otherwise you can sync through a central server (iPhone, Dropbox)
- Or read locally but send updates through a master (PNUTS, Spanner)
But there's are good ideas for us to learn from Bayou
- Update functions for automatic application-driven conflict resolution
- Ordered update log is the real truth, not the DB
- Logical clock for causal consistency