MPL-2.0 License

  • Partial membership for gossip clusters
  • [[https://www.semanticscholar.org/paper/HyParView%253A-A-Membership-Protocol-for-Reliable-Leit%C3%A3o-Pereira/a2562ede25e8ed2c7c1d888d72b625a526b3b25a][The Paper]] in production in [[http://partisan.cloud][Partisan]]
  • 10k node cluster requires 5 active and 30 passive peers in views at
    each node
  • Side goal: demonstrate separation of state changes from async or
    mutex behavior to improve testing options
  • Running The Simulation

=brew install gnuplot=

=make simulation=

See =data= for the data input. Plotting output is in =plot=

** Known bugs

  • The simulation consistently shows a number of asymmetric links, these would actually be repaired in a stable system that continued to shuffle. The simulation should be extended to demonstrate that.

  • When running with only a single send attempt, failure recovery causes a stack overflow. The simulator send plugin should use an external stack that doesn't overflow so easily

  • Related Topics for Improving the Approach
  • Thicket ([[https://www.gsd.inesc-id.pt/~ler/reports/srds10.pdf][Thicket PDF]], also by Leitão and Rodrigues) describes multiple spanning tree approach to sending application messages, which reduces waste messages while remaining resilient to churn. The simulator currently implements very simple epidemic gossip transmission where the active view size is the degree of fanout (novel messages are forwarded to all active peers). Waste and path length are plotted.

  • In the [[https://www.semanticscholar.org/paper/Gossip-based-peer-sampling-Jelasity-Voulgaris/b571ec0ac7173bcecfe1b3095af2f6a5232526a9][Peer Sample]] paper (which uses only the passive view), entries are tagged with a timestamp of the last direct active contact, and the shuffling algorithm is biased to penalize old entries. This allows a dead node to fall out of the passive view of the total network more quickly. This may have a detrimental effect if a network is partitioned on recovery.

  • Partion recovery in general needs more investigation, there may be a risk that partioned networks would repair their respective active views and interconnected tightly, so that repairing the partion would lead to two well connected networks joined by just a few nodes.

  • Improvements to the Code
  • Isolated nodes in simulation runs have asymmetric active views. Refactor the simulator to support periodic messages like active view keepalive and shuffle from the perspective of each node.

  • Incorporate =failActive= directly into the code. Because it relies on synchronous low priority =Neighbor= requests, it requires a sort of iterator interface (or some other refactoring)

  • Minor, dependent on research results: the view constructor takes the expected number of peers as an argument, and should use that to set the appropriate default configuration, once it's factors are known outside the values given in the paper

  • Live Test

There's an example secure GRPC application of this library at [[https://github.com/hashicorp/hyparview-example][hyparview-example]]. It will be used for a live test on hardware.