Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Upgrade Failures in Distributed Systems

Andrey Satarin
September 28, 2022

Understanding Upgrade Failures in Distributed Systems

Andrey Satarin

September 28, 2022
Tweet

More Decks by Andrey Satarin

Other Decks in Programming

Transcript

  1. Understanding and Detecting Software
    Upgrade Failures in Distributed Systems
    By Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Shan Lu, Ding Yuan

    Presented by Andrey Satarin, @asatarin

    September, 2022

    https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/

    View Slide

  2. Outline
    • Introduction


    • Findings on Severity and Root Causes


    • Testing and Detecting


    • Conclusions


    • Personal Experience and Commentary
    2

    View Slide

  3. Introduction
    3

    View Slide

  4. Software upgrade failures
    Software upgrade failures — failures that only occur during software
    upgrade. Never occur under regular execution scenarios.


    • Not failure-inducing configurations change


    • Not bug in only new version of software


    Defects from two versions of software interacting
    4

    View Slide

  5. Why upgrade failures are important?
    • Large scale — touches the whole system or large part


    • Vulnerable context — upgrade is a disruption in itself


    • Persistent Impact — can corrupt persistent data irreversibly


    • Dif
    fi
    cult to expose in house — little focus in testing
    5

    View Slide

  6. What was studied?
    • Symptoms and severity


    • Root causes


    • Triggering conditions


    • Ways to detect upgrade failures
    6

    View Slide

  7. Number of upgrade failures analyzed
    • Cassandra — 44


    • HBase — 13


    • HDFS — 38


    • Kafka — 7


    Total: 123 bugs
    7
    • MapReduce — 1


    • Mesos — 8


    • Yarn — 8


    • ZooKeeper — 4

    View Slide

  8. Findings on Severity and Root Causes
    8

    View Slide

  9. Finding 1
    Upgrade failures have signi
    fi
    cantly higher priority than
    regular failures


    Larger share of bugs is high priority compared to non-upgrade failures
    9

    View Slide

  10. Finding 2
    The majority (67%) of upgrade failures are catastrophic
    (i.e., affecting all or a majority of users instead of only a few
    of them). This percentage is much higher than that (24%)
    among all bugs


    • 28% bring down the entire cluster


    • Catastrophic data loss or performance degradation
    10

    View Slide

  11. Finding 3
    Most (70%) upgrade failures have easy-to-observe
    symptoms like node crashes or fatal exceptions
    11

    View Slide

  12. Finding 4
    The majority (63%) of upgrade bugs were not caught before
    code release


    => We need to get better at testing upgrades
    12

    View Slide

  13. Finding 5
    About two thirds of upgrade failures are caused by
    interaction between two software versions that hold
    incompatible data syntax or semantics assumption


    Out of those:


    • 60% in persistent data and 40% in network messages


    • 2/3 syntax difference and 1/3 semantic difference
    13

    View Slide

  14. Finding 6
    Close to 20% of syntax incompatibilities are about data
    syntax defined by serialization libraries or enum data types.
    Given their clear syntax definition interface, automated
    incompatibility detection is feasible
    14

    View Slide

  15. Finding 10
    All of the upgrade failures require no more than 3 nodes
    to trigger


    [OSDI14] “Simple Testing Can Prevent Most Critical Failures”:


    “Finding 3. Almost all (98%) of the failures are guaranteed to manifest on
    no more than 3 nodes. 84% will manifest on no more than 2 nodes.”

    15

    View Slide

  16. Finding 11
    Close to 90% of the upgrade failures are deterministic, not
    requiring any special timing to trigger
    16

    View Slide

  17. Testing and Detecting
    17

    View Slide

  18. Limitations in state of the art
    (As presented in the paper)


    • Do not solve problem of workload generation


    • Testing workloads are designed from scratch (BAD!)


    • No mechanism to systematically explore different version combinations,
    configuration or update scenarios
    18

    View Slide

  19. DUPTester
    19

    View Slide

  20. DUPTester
    • DUPTester — Distributed system UPgrade Tester


    • Simulates 3-node cluster with containers


    • Systematically tests three scenarios:


    • Full-stop upgrade


    • Rolling upgrade


    • Adding new node
    20

    View Slide

  21. Testing workloads
    From section 6.1.2 Testing workload:


    “As discussed in Section 5.3, a main challenge facing all existing
    systems is to come up with workload for upgrade testing”


    DUPTester:


    • Using stress testing is straightforward


    • Using “unit” testing requires some tricks
    21

    View Slide

  22. Using “unit” tests as workload
    Two strategies:


    • Automatically translate “unit” tests into client-side scripts


    • Not guaranteed to translate everything


    • Needs function mapping from developers


    • Execute on V1 and successfully start on V2
    22

    View Slide

  23. DUPChecker
    23

    View Slide

  24. DUPChecker
    Types of syntax incompatibilities:


    • Serialization libraries definition syntax incompatible across versions


    • Open source alternatives exist


    • Incompatibility of Enum-typed data
    24

    View Slide

  25. DUPChecker
    Serialization libraries:


    • Parses protobuf definitions


    • Compares them across versions to find incompatibilities


    Enums:


    • Data flow analysis to find persisted enums


    • Check if enum index is persisted and there are additions/deletions in enum
    25

    View Slide

  26. Conclusions
    26

    View Slide

  27. Conclusions
    • First in-depth analysis of upgrade failures


    • Upgrade failures have severe consequences


    • DUPTester found 20 new upgrade failures in 4 systems


    • DUPChecker detected 800+ incompatibilities in 7 systems


    • Apache HBase team requested DUPChecker to be a part of their pipeline
    27

    View Slide

  28. Personal Experience and Commentary
    28

    View Slide

  29. Upgrades and correctness
    • Stress tests usually do not include correctness validation


    • Correctness implies correctness with failure injection


    • Testing system upgrade implies testing rollback
    29

    View Slide

  30. System as a black box
    30
    System Under Test
    Invariants
    V1

    View Slide

  31. System as a black box
    31
    System Under Test
    Invariants
    V1 V2

    View Slide

  32. Testing workload
    From section 6.1.2 Testing workload:


    “As discussed in Section 5.3, a main challenge facing all existing
    systems is to come up with workload for upgrade testing”


    You probably already have workloads to test correctness:


    • Stress tests


    • Correctness tests (probably Jepsen-like) [Jepsen22]
    32

    View Slide

  33. Upgrade and rollback
    • We need to test both upgrade and rollback


    • Both operations ideally tested with failure injection


    • Probability of exposing bugs ~ “mixed version time”


    • We should maximize “mixed version time”
    33

    View Slide

  34. Upgrade and rollback
    34
    V1 V2 V1 V2
    Upgrade Rollback
    Rollback
    Upgrade

    View Slide

  35. Upgrade and rollback
    35
    V1 V2 V1 V2
    Upgrade Rollback
    Rollback
    Upgrade

    View Slide

  36. Conclusions (2)
    • There is certainly value in research and ideas from the paper


    • There are additional ways one can improve upgrade testing by leveraging
    correctness tests


    • System during upgrade == system during normal operation
    36

    View Slide

  37. Thank you for your attention
    37

    View Slide

  38. References
    • Self reference for this talk (slides, video, etc)

    https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-
    systems


    • “Understanding and Detecting Software Upgrade Failures in Distributed
    Systems” paper https://dl.acm.org/doi/10.1145/3477132.3483577


    • Talk at SOSP 2021 https://youtu.be/29-isLcDtL0


    • Reference repository for the paper https://github.com/zlab-purdue/ds-
    upgrade
    38

    View Slide

  39. References
    • [OSDI14] Simple Testing Can Prevent Most Critical Failures: An Analysis of
    Production Failures in Distributed Data-Intensive Systems

    https://www.usenix.org/conference/osdi14/technical-sessions/presentation/
    yuan


    • [Jepsen22] https://jepsen.io/
    39

    View Slide

  40. Contacts
    • Follow me on Twitter @asatarin


    • Other public talks https://asatarin.github.io/talks/


    • https://www.linkedin.com/in/asatarin/
    40

    View Slide