Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous impro...
Search
Takeshi Kondo
May 16, 2023
Technology
1
3.5k
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri
https://connpass.com/event/282120/
Takeshi Kondo
May 16, 2023
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.3k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
220
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
7.5k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.4k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.9k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.3k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
5.1k
Other Decks in Technology
See All in Technology
相互運用可能な学修歴クレデンシャルに向けた標準技術と国際動向
fujie
0
110
Microsoft Learn MCP/Fabric データエージェント/Fabric MCP/Copilot Studio-簡単・便利なAIエージェント作ってみた -"Building Simple and Powerful AI Agents with Microsoft Learn MCP, Fabric Data Agent, Fabric MCP, and Copilot Studio"-
reireireijinjin6
1
190
激動の時代、新卒エンジニアはAIツールにどう向き合うか。 [LayerX Bet AI Day Countdown LT Day1 ツールの選択]
tak848
0
630
MCPと認可まわりの話 / mcp_and_authorization
convto
2
340
大規模イベントを支える ABEMA の アーキテクチャ 変遷 2025
nagapad
5
580
製造業の課題解決に向けた機械学習の活用と、製造業特化LLM開発への挑戦
knt44kw
0
110
オブザーバビリティプラットフォーム開発におけるオブザーバビリティとの向き合い / Hatena Engineer Seminar #34 オブザーバビリティの実現と運用編
arthur1
0
210
SAE J1939シミュレーション環境構築
daikiokazaki
1
200
CSPヘッダー導入で実現するWebサイトの多層防御:今すぐ試せる設定例と運用知見
llamakko
1
280
経験がないことを言い訳にしない、 AI時代の他領域への染み出し方
parayama0625
0
280
LLM開発を支えるエヌビディアの生成AIエコシステム
acceleratedmu3n
0
350
増え続ける脆弱性に立ち向かう: 事前対策と優先度づけによる 持続可能な脆弱性管理 / Confronting the Rise of Vulnerabilities: Sustainable Management Through Proactive Measures and Prioritization
nttcom
1
230
Featured
See All Featured
Writing Fast Ruby
sferik
628
62k
Designing for Performance
lara
610
69k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.9k
Unsuck your backbone
ammeep
671
58k
Automating Front-end Workflow
addyosmani
1370
200k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
283
13k
How STYLIGHT went responsive
nonsquared
100
5.7k
How to train your dragon (web standard)
notwaldorf
96
6.1k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
YesSQL, Process and Tooling at Scale
rocio
173
14k
Transcript
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo
2023
Who am I chaspy chaspy_ Engineering Manager Site Reliability and
Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me
SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ৫
ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ • ৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ ࠓ։ൃऀઢͰ͠·͢ʂ 4-0DPOG 5PLZP✨
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Ұઃఆͯ͠ݟ͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷΛ͠·͢😅
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Λ͜Ε͔Β͍ͬͯͧ͘ͱ͍͏ Takeshi Kondo / @chaspy 2023/05/13 SLOconf
Tokyo 2023
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
લఏɿϓϩμΫτհ - ελσΟαϓϦ
None
20222݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦Λ৽نϚΠΫϩ αʔϏεͱͯ͠2ʹΓ։ൃ • ϦϦʔε͔Β1ܦաɻݱࡏܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ
ࠓिͷϛογϣϯͱ෮ԋशػೳʹΑΔݸผֶशࢧԉ ԋशྔɾқΛେ෯֦ॆ ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ ֶशը໘ͷσβΠϯΛҰ৽
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Why SLI/SLO? • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ͏ͷ͔Λ Fact-BasedͰܾఆ͢ΔͨΊ • Error Budget ͕͋ Δ͏ͪ1ͭ1ͭͷ
Τϥʔʹରॲ͠ͳ͍ • Burn Out Λආ͚ΒΕΔ • Budget ͕͋Δ͏ͪϦε Ϋ͕ͱΕΔ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLI/SLO શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ
Why Envoy? • ࣌ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane
ΛؚΜͩ Service Mesh Ͱͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ
DevSupport: ସΘΓ൪Ͱఆৗӡ༻ۀΛߦ͏ • Slack ͷ௨Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert,
GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳͷ֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)͍߹ΘͤͷҰ࣍ड͚ • શମ͚ϝϯγϣϯͷ1࣍ड͚
ى͖͍ͯͨ՝: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ SLO Alert ͕໐ͬͨ͜ͱͳ͍ •
Sentry ͷ Exception ྔ͕ SLI ʹө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌Ͱ Error Budget ͱ͍͏֓೦ ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗͨͪͷظΑΓ؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰʁ • Yes • Exception
ͷҰ෦ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷͦΕͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕྑ͍͕…? ௨ৗͷ௨৴ UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢Δ͕͜͜ ࣦഊͨ͠ IUUQUBSBNBJOͰ௨৴͢Δ
Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰʁ • Yes •
GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠߹ɺhttp Ͱ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦ࣦഊ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴ $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ 3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZQSPYZᶄ UBSBBQJHBUFXBZ͔ΒțBSBNBJO௨৴ᶄ͜͜ͰΤϥʔ͕ൃੜ
ରॲ1ɿGraphQL Error ͷ߹ http 500 Λฦ͢ • ݩʑ GraphQL
http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status 200 ʹ౷Ұ͢ΔϓϥΫςΟε͋Δ • Client Error Response ͷ errors ΛݟΔͷͰͳ͍ ಉ྅͕γϡοͱͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ
ରॲ2ɿ Envoy ΛΊͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ͠͞ΛݮΒͨ͢Ί •
Envoy ͷ metrics ʹ͕͋ͬͨΘ͚Ͱͳ͍ • ӡ༻ͷ՝ଟ͘ metrics औಘҎ֎ͷϝϦοτಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨͷͷൃಈͨ͠έʔε΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ͍ͬͯΔ߹ͷ Patch ํ๏ʢResource ٯసͯ͠োʹͳͬͨ͜ͱʣ
খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin
ͷ resource tag default Ͱ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ৫Ͱ http-client ͷ resource tag ͷ໋໊نΛ߹ҙ
খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ http 5xx ֘͠ͳ͍
• ٯʹ 4xx ֘͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLO Λݟͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ͚ΔϝϦοτ͕ෳ 4-*4-0Λཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫͚4-*4-0Ճ Ϣʔβج൫͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0ΛՃ
DevSupport ݟ͠ • Sentry Exception ͰΞϓϦέʔγϣϯίʔυىҼͷͷҎ ֎શͯ Ignore ͢Δ •
SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํΛυΩϡϝϯτԽ • ରԠͰ͖ͳ͔ͬͨͷΛ2िؒʹ1ճνʔϜͰରԠ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Կ͕ى͖͍ͯͨͷ͔ • ϦϦʔε࣌ʹҰઃఆ͞Εͨ SLI/SLO 1ݟ͞Εͯͳ ͔ͬͨ • SLO ͕ԿͷՁൃش͍ͯ͠ͳ͔ͬͨ •
SLI/SLO ྆ํΛݟ͠ɺࠓޙܧଓతʹݟ͢͜ͱʹͨ͠
Ͳ͏͖ͩͬͨ͢ͷ͔ • ։ൃऀࢹ • SLO ͕ຊʹՁΛͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ • ൪ྑ͍͕ɺͨ·ʹશһͰݟΔ࣌ؒΛऔΔ͜ͱॏཁ • 1ަͩͱਂ͘ௐΔΠϯηϯςΟϒ͕ಇ͔ͳ͍
• SRE ࢹ • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ͢ΈΛ࡞Δ • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ͢ΔΈΛ࡞Δ
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web
Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me