Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Site Reliability Engineering における 重要領域とパフォーマンス指...
Search
Takeshi Kondo
June 04, 2021
Technology
1
3.1k
Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE
2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/
Takeshi Kondo
June 04, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
170
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
6.6k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.1k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.7k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
1.8k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
4.8k
『スタディサプリ 中学講座』における E2E Test の運用と計測による改善 / Improved E2E testing through measurement
chaspy
0
4.5k
Other Decks in Technology
See All in Technology
頻繁リリース × 高品質 = 無理ゲー? いや、できます!/20250306 Shoki Hyo
shift_evolve
0
150
LINE API Deep Dive Q1 2025: Unlocking New Possibilities
linedevth
1
160
Go の analysis パッケージで自作するリファクタリングツール
kworkdev
PRO
1
410
17年のQA経験が導いたスクラムマスターへの道 / 17 Years in QA to Scrum Master
toma_sm
0
400
AWS CDK コントリビュート はじめの一歩
yendoooo
1
120
製造業の会計システムをDDDで開発した話
caddi_eng
3
960
ISUCONにPHPで挑み続けてできるようになっ(てき)たこと / phperkaigi2025
blue_goheimochi
0
140
Security response for open source ecosystems
frasertweedale
0
100
Keynote - KCD Brazil - Platform Engineering on K8s (portuguese)
salaboy
0
120
AIエージェントキャッチアップと論文リサーチ
os1ma
6
1.2k
Amazon GuardDuty Malware Protection for Amazon S3を使おう
ryder472
2
100
数百台のオンプレミスのサーバーをEKSに移行した話
yukiteraoka
0
680
Featured
See All Featured
Product Roadmaps are Hard
iamctodd
PRO
52
11k
Java REST API Framework Comparison - PWX 2021
mraible
29
8.5k
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.4k
Adopting Sorbet at Scale
ufuk
75
9.3k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
102
18k
Measuring & Analyzing Core Web Vitals
bluesmoon
6
320
How to Think Like a Performance Engineer
csswizardry
22
1.5k
Automating Front-end Workflow
addyosmani
1369
200k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
30k
Optimising Largest Contentful Paint
csswizardry
35
3.2k
Transcript
Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04
ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Agenda 1. എܠͱత 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ
5. ·ͱΊͱࠓޙͷల
എܠͱత • SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠
• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ • Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ • 100+ Developer • 30M+
Access / Day
ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ྖҬ͝ͱͷؔੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment
Trade-Off Trade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ • σϓϩΠͷස • มߋͷϦʔυλΠϜ
• MTTR • มߋࣦഊ https://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 োൃੜ࣌ʹɺোใࠂϑϩʔͷ ඞਢ߲ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ
$*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ ຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ • MTTR • σϓϩΠճ • σϓϩΠ࣌ؒ • CI ҆ఆੑ
• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ • ܭଌՄೳੑͷ • ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ • ͦͷͨΊʹ Incident Response
ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ • σʔλྔͷ • Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ • Β͖ͭͷ • σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍ • গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ
• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE • MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு • ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝ •
ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର • monorepo Λ࠾༻ • master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ
• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ ͞ΕΔ • Γ HOTFIX
• ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ • ະੳʢه༧ఆʣ • ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ • Time Window 7 Days • 30
Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖ • ࢦඪͱͯ͠ෆద • جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ • "มߋࣦഊ"ͷఆٛͷ • ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠ • ܭଌํ๏ͷ
• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ • MTTR ಉ༷ͷ
·ͱΊͱߟ • SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ݅
• ेʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ • MTTR 🚀 • τϥοΩϯάܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response
ͷվળ͕ඞཁ • σϓϩΠճ🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల • ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ • ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ
• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠ • ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo