Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A ROADMAP TOWARDS CHAOS ENGINEERING

Avatar for Chaos Conf Chaos Conf
September 26, 2019

A ROADMAP TOWARDS CHAOS ENGINEERING

Jose Esquivel, Backcountry

A common problem with Chaos Experimentation is knowing where to start. In this talk, Principal Software Engineer Jose Esquivel will present a roadmap for Chaos Experimentation that can be applicable to any organization.

Avatar for Chaos Conf

Chaos Conf

September 26, 2019
Tweet

More Decks by Chaos Conf

Other Decks in Technology

Transcript

  1. 2 – The Roadmap that got us to Chaos Engineering.

    – 8 Stability Patterns. – 4 ways to achieve observability. Agenda
  2. 3 1 4 3 2 1 – Observability 2 –

    Alerting 3 – Incident Management 4 – Test Harness
  3. 5 Before running the Chaos Experimentation make sure you account

    for: – Stability Patterns – Observability
  4. 7 Timeout & Retries – Consequences are, depleting your HTTP

    or DB pools. – Pay close attention to achieving comprehensive retries. – Do not overwhelm the server with retries.
  5. 8 Circuit Breaker – Be a good fellow client. –

    Open Circuit during failure events. – Include this on your monitoring. – Graceful degradation.
  6. 10 Steady State – Define what is a steady state

    for your API. – Avoid increasing amount of data.
  7. 12 Handshaking – Allows to reject calls. – Avoid memory

    overflow. – Pull monitoring using /health endpoints.
  8. 13 Uncoupling via Middleware. – Avoid waiting for a response

    “fire and forget”. – System can process other things while waiting. – Server side will never be overwhelmed.
  9. 15 4 ways to Achieve Observability The goal is to

    answer all the questions we have about our system.
  10. 17 – Choose what you want to see then choose

    the tools. Don’t be afraid of tools proliferation. – This exist because somebody wrote something. APM are good but nothing beats intentionality. – Ordered vs structured logs. Logging [INFO ] [AvalaraCaptureOperation] 04d6fa2e-71b7-4881-88b1-71c7afcb2b76.motosport.10.42.7.10 - Discrepancy of -0.42 dollars found | Taxes charged to user: 2.53 | Taxes reported by Remote Tax Service 2.95 | order OrderLookupResult(orderId=m7075551, orderDate=2019-07-26, catalogId=Motosport, [email protected], estimatedOrderTotalTax=2.53, shippingCostTax=0.52, shipmentTaxes=[ShipmentTax(id='m7075551-42616991', facility=SLCW, items=[ItemTax(sku=ABL000E-X001-Y003, quantity=1, tic=P0000000, productGroupId=100001782, totalTax=2.43, price=33.44, originalPrice=33.44, unitaryClientTax=2.08)], shippingCostTaxItem=ShippingCostTaxItem(cost=7.0, tax=0.52), capturedDate=2019-07-26, captured=true)], isPartnerOrder=false, isExemptTaxOrder=false, address=Destination(country=US, address1=3240 Gilliland Rd, address2=3240 Gilliland Rd, city=Springtown, state=TX, zip=76082-5233), provider=Avalara, created=2019-07-26, lastModified=2019-07-26) --------------------------------------------------------------------------------------------- 2019-08-21 23:32:33,153 merch-log ERROR c.b.s.m.w.c.c.BaseRestController - "netsuite-price-5879-18" | 10.42.7.10:"Apache-HttpClient/4.5.9" 524 PUT "/erp/variants/prices" {} | { "request-size"=103909b, "duration"=5543ms, "response-size"=8192b } com.backcountry.supplychain.merch.business.common.exceptions.B adRequestServiceException: Bad Request received. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-LT. Invalid value for field 'sku' on object 'erpVariantPriceUpdateModelList': Invalid reference of sku COL4271-BLA-XS
  11. 18 – Causality across systems. – TraceId and object ids.

    – This are the hardest to code as context need to be passed across systems. Tracing The Event Log
  12. 20 – Aggregate Data. – Expose values: The good the

    bad and the ugly. – Reports can be pushed to the customer. Metrics & Reports
  13. 22 – Differentiate Warnings with Critical – Criticals are meant

    to wake up someone. – 6 practices to make great alerting: Alerting •  Stop using emails for Critical alerts. •  Write runbooks. •  Delete and tune alerts. •  Use maintenance periods. •  Attempt self healing, but be careful. •  Overcome arbitrary static thresholds.
  14. 23 – A maturity model in the form of an

    outgoing Roadmap that can get us to Chaos Engineering activities. – 8 Stability Patterns that can apply to software, hardware, network, etc. – 4 ways to achieve observability and be able to answer all the questions we have about our systems. Summary