Upgrade to Pro — share decks privately, control downloads, hide ads and more …

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous impro...

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri

Takeshi Kondo

May 16, 2023
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Who am I chaspy chaspy_ Engineering Manager Site Reliability and

    Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me
  2. SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ૊৫

    ΁ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ࿩ • ૊৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ
  3. SRE & Web Application Development 2018 2020 2021 2023 2019

    2022 2VJQQFS ೖࣾ 43&/&95  4-0Λ૊৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹ΋ࢀՃ 43&/&95  &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨
  4. SRE & Web Application Development 2018 2020 2021 2023 2019

    2022 2VJQQFS ೖࣾ 43&/&95  4-0Λ૊৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95  &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹ΋ࢀՃ ࠓ೔͸։ൃऀ໨ઢͰ࿩͠·͢ʂ 4-0DPOG 5PLZP✨
  5. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  6. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  7. 2022೥2݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦෼Λ৽نϚΠΫϩ αʔϏεͱͯ͠2೥ʹ౉Γ։ൃ • ϦϦʔε͔Β1೥ܦաɻݱࡏ΋ܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ 

    ࠓिͷϛογϣϯͱ൓෮ԋशػೳʹΑΔݸผֶशࢧԉ  ԋशྔɾ೉қ౓Λେ෯֦ॆ  ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ৔  ֶशը໘ͷσβΠϯΛҰ৽
  8. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  9. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  10. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •

    SLI/SLO ͸શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ࢖͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ
  11. Why Envoy? • ౰࣌͸ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane

    ΛؚΜͩ Service Mesh Ͱ͸ͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ
  12. DevSupport: ೔ସΘΓ౰൪Ͱఆৗӡ༻ۀ຿Λߦ͏ • Slack ͷ௨஌Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert,

    GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳ΋ͷ͸֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)໰͍߹ΘͤͷҰ࣍ड͚ • શମ޲͚ϝϯγϣϯͷ1࣍ड͚
  13. ى͖͍ͯͨ՝୊: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ౓΋ SLO Alert ͕໐ͬͨ͜ͱ͸ͳ͍ •

    Sentry ͷ Exception ྔ͕ SLI ʹ൓ө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ΋ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌఺Ͱ Error Budget ͱ͍͏֓೦ ͸ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗ෼ͨͪͷظ଴஋ΑΓ΋؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠
  14. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  15. Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰ͸ʁ • Yes • Exception

    ͷҰ෦͸ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷ͸ͦΕ͸ͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕ͹ྑ͍͕…? ௨ৗͷ௨৴  UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢ΔŠ͕͜͜ ࣦഊͨ͠  IUUQUBSBNBJOͰ௨৴͢Δ
  16. Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰ͸ʁ • Yes •

    GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠৔߹ɺhttp Ͱ͸ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦෼ࣦഊ͸ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴  $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ  3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZ΁QSPYZᶄ  UBSBBQJHBUFXBZ͔ΒțBSBNBJO΁௨৴ᶄŠ͜͜ͰΤϥʔ͕ൃੜ
  17. ରॲ1ɿGraphQL Error ͷ৔߹ http 500 Λฦ͢ • ݩʑ GraphQL ͸

    http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ͸ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status ͸ 200 ʹ౷Ұ͢ΔϓϥΫςΟε΋͋Δ • Client ΋ Error ͸ Response ͷ errors ΛݟΔͷͰ໰୊͸ͳ͍ ಉ྅͕γϡοͱ௚ͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ
  18. ରॲ2ɿ Envoy Λ΍Ίͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ೉͠͞ΛݮΒͨ͢Ί •

    Envoy ͷ metrics ʹ໰୊͕͋ͬͨΘ͚Ͱ͸ͳ͍ • ӡ༻ͷ՝୊΋ଟ͘ metrics औಘҎ֎ͷϝϦοτ͸ಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨ΋ͷͷൃಈͨ͠έʔε͸΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod ಺ side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λ଴ͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ࢖͍ͬͯΔ৔߹ͷ Patch ํ๏ʢResource ٯసͯ͠ো֐ʹͳͬͨ͜ͱ΋ʣ
  19. খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin

    ͷ resource tag ͸ default Ͱ͸ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ͸ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ૊৫಺Ͱ http-client ͷ resource tag ͷ໋໊ن໿Λ߹ҙ
  20. খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ͸ http 5xx ͸֘౰͠ͳ͍

    • ٯʹ 4xx ͸֘౰͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ
  21. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •

    SLO Λݟ௚ͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ࢖͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫΁ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ෼͚ΔϝϦοτ͕ෳ਺ 4-*4-0Λ؅ཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫޲͚4-*4-0௥Ճ Ϣʔβج൫޲͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0Λ௥Ճ
  22. DevSupport ݟ௚͠ • Sentry Exception Ͱ͸ΞϓϦέʔγϣϯίʔυىҼͷ΋ͷҎ ֎͸શͯ Ignore ͢Δ •

    SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํ਑ΛυΩϡϝϯτԽ • ౰೔ରԠͰ͖ͳ͔ͬͨ΋ͷΛ2िؒʹ1ճνʔϜͰରԠ
  23. Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ •

    αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ
  24. Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web

    Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me