11 March 2017

0 Abstract

  • strong consistency
  • Do not use Distributed Log(Paxos Zab Raft)
  • main difference with Paxos-like consensus
    • aimed at key-value data model, instead of general model

1 Introduction

1.2 drawbacks of distributed log

  • dependency:
    • real dependencies: entries of same content x
    • false dependencies: succeeding entries
  • failure detect, recovery take a long time
  • need log compaction(snapshot or checkpoint)

2 Algorithm

  • hash bucket: 1 bucket –> have many key-value
    • bucket内顺序执行;
    • bucket之间并发
    • 每个bucket有主从(实际所有bucket一个leader)
    • Replication bucket维度的

2.1 Leader election

procedure STARTELECTION
    elect_id←electi_d+1
    send PLEASEVOTE(elect id, self ) to all
    if received ACKVOTE from majority then
        is_leader ← true
procedure PLEASEVOTE(elect id, source)
    if elect id > voted_elect_id then
        voted_elect_id ← elect_id 
        leader ← source
        reply ACKVOTE to source
    else if elect id = voted_elect_id and
            source = leader then
        reply ACKVOTE to source
    else reply NACKVOTE to source

Claim 1: 每个elect_id 最多只有一个 is_leader == true

2.2 replication

  • bucket
    • index
    • version:
      • elect_id
      • counter
  • write操作
  • read操作: 需要ENSURERECOVERY
    • leader change之后第一个read会recovery
  • recovery: 会read, 从大多数中选出最高version的bucket, 然后write
# write
procedure WRITE(bucket) ⊲ By leader 
    bucket.ver.elect_id ← elect_id
    bucket.ver.counter ← bucket.ver.counter + 1
    send REPLICAWRITE(bucket, self ) to all
    if received ACKWRITE from majority then
        return true
    else
        is_leader ← false
        return false
        
procedure REPLICAWRITE(bucket, source)
    if bucket.ver.elect_id < voted_elect_id then
        reply NACKWRITE to source
    else
        voted_elect_id ← bucket.ver.elect_id
        leader ← source             ⊲    “update” vote             
        local_buckets[bucket.index] ← bucket
        reply ACKWRITE to source
# Read
procedure READ(index)           ⊲ By leader 
    if not ENSURERECOVERY(index, elect_id) then
        return ⊥
    send REPLICAREAD(index, elect id, self ) to all 
    if received ACKREAD from majority then
        return local_buckets[index] 
    else
        is_leader ← false
        return ⊥

procedure REPLICAREAD(index, elect_id, source)
    if elect_id < voted_elec_id then 
    reply NACKREAD to source
    else
        voted_elect_id ← elect_id
        leader ← source
        reply ACKREAD(local_buckets[index]) to source
# Recovery
procedure ENSURERECOVERY(index, elect_id)
    if elect_id = local_buckets[index].ver.elect_id then
        return true
    send REPLICAREAD(index, elect_id, self ) to all
    if received ACKREAD from majority then
        max_ver ← max{bucket.ver | received bucket}
        bucket ← some bucket s.t. bucket.ver = max_ver
        bucket.ver.elect_id ← elect_id
        bucket.ver.counter ← 0
        return WRITE(bucket)
    else
        is leader ← false
        return false

Other

  • key-value wrapper based on read write operation
  • reconfiguration:
    • two instances runing old and new
    • with special reconfigure state
      • old not serve, while new read will copy bucket if necessary
  • some optimization
  • scalability

QA

  • 1 How to send bucket between nodes?


blog comments powered by Disqus