Immunizing Systems from Distant Failures by Limiting Lamport Exposure
Cristina Băsescu and Bryan Ford
Twentieth ACM Workshop on Hot Topics in Networks (HotNets)
November 10-12, 2021
Failures far away from a user should intuitively be less likely to affect that
user. Today's ecosystem miserably fails this test, however, despite
high-availability best practices. Correlated and cascading failures –
triggered by misconfigurations, bugs, and network partitions –
often invalidate assumptions of failure independence.
We propose that distributed services
need not and should not expose local activities to distant failures or
partitions, no matter how severe. Limix is an exposure-limiting architecture,
guaranteeing that neither the availability nor the performance of
strongly-consistent accesses within a local area may be impacted by distant
failures. Preliminary results suggest that infrastructures today could use
Limix to limit exposure at a manageable cost.