Bizur - KV Consensus Algorithm consensus
11 March 2017
0 Abstract
- strong consistency
- Do not use Distributed Log(Paxos Zab Raft)
- main difference with Paxos-like consensus
- aimed at key-value data model, instead of general model
1 Introduction
1.2 drawbacks of distributed log
- dependency:
- real dependencies: entries of same content x
- false dependencies: succeeding entries
- failure detect, recovery take a long time
- need log compaction(snapshot or checkpoint)
2 Algorithm
- hash bucket: 1 bucket –> have many key-value
- bucket内顺序执行;
- bucket之间并发
- 每个bucket有主从(实际所有bucket一个leader)
- Replication bucket维度的
2.1 Leader election
procedure STARTELECTION
elect_id←electi_d+1
send PLEASEVOTE(elect id, self ) to all
if received ACKVOTE from majority then
is_leader ← true
procedure PLEASEVOTE(elect id, source)
if elect id > voted_elect_id then
voted_elect_id ← elect_id
leader ← source
reply ACKVOTE to source
else if elect id = voted_elect_id and
source = leader then
reply ACKVOTE to source
else reply NACKVOTE to source
Claim 1: 每个elect_id 最多只有一个 is_leader == true
2.2 replication
- bucket
- index
- version:
- elect_id
- counter
- write操作
- read操作: 需要ENSURERECOVERY
- leader change之后第一个read会recovery
- recovery: 会read, 从大多数中选出最高version的bucket, 然后write
# write
procedure WRITE(bucket) ⊲ By leader
bucket.ver.elect_id ← elect_id
bucket.ver.counter ← bucket.ver.counter + 1
send REPLICAWRITE(bucket, self ) to all
if received ACKWRITE from majority then
return true
else
is_leader ← false
return false
procedure REPLICAWRITE(bucket, source)
if bucket.ver.elect_id < voted_elect_id then
reply NACKWRITE to source
else
voted_elect_id ← bucket.ver.elect_id
leader ← source ⊲ “update” vote
local_buckets[bucket.index] ← bucket
reply ACKWRITE to source
# Read
procedure READ(index) ⊲ By leader
if not ENSURERECOVERY(index, elect_id) then
return ⊥
send REPLICAREAD(index, elect id, self ) to all
if received ACKREAD from majority then
return local_buckets[index]
else
is_leader ← false
return ⊥
procedure REPLICAREAD(index, elect_id, source)
if elect_id < voted_elec_id then
reply NACKREAD to source
else
voted_elect_id ← elect_id
leader ← source
reply ACKREAD(local_buckets[index]) to source
# Recovery
procedure ENSURERECOVERY(index, elect_id)
if elect_id = local_buckets[index].ver.elect_id then
return true
send REPLICAREAD(index, elect_id, self ) to all
if received ACKREAD from majority then
max_ver ← max{bucket.ver | received bucket}
bucket ← some bucket s.t. bucket.ver = max_ver
bucket.ver.elect_id ← elect_id
bucket.ver.counter ← 0
return WRITE(bucket)
else
is leader ← false
return false
Other
- key-value wrapper based on read write operation
- reconfiguration:
- two instances runing old and new
- with special reconfigure state
- old not serve, while new read will copy bucket if necessary
- some optimization
- scalability
QA
- 1 How to send bucket between nodes?
blog comments powered by Disqus