Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Site Reliability Engineering における 重要領域とパフォーマンス指...
Search
Takeshi Kondo
June 04, 2021
Technology
1
3.3k
Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE
2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/
Takeshi Kondo
June 04, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.4k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
220
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
7.6k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.5k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.9k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.4k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
5.2k
Other Decks in Technology
See All in Technology
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
8.6k
我々は雰囲気で仕事をしている / How can we do vibe coding as well
naospon
2
220
つくって納得、つかって実感! 大規模言語モデルことはじめ
recruitengineers
PRO
24
6.2k
OpenAPIから画面生成に挑戦した話
koinunopochi
0
160
JavaScript 研修
recruitengineers
PRO
3
180
JOAI発表資料 @ 関東kaggler会
joai_committee
1
350
アジャイルテストで高品質のスプリントレビューを
takesection
0
120
JuniorからSeniorまで: DevOpsエンジニアの成長ロードマップ
yuriemori
0
220
認知戦の理解と、市民としての対抗策
hogehuga
0
370
生成AI利用プログラミング:誰でもプログラムが書けると 世の中どうなる?/opencampus202508
okana2ki
0
190
モダンフロントエンド 開発研修
recruitengineers
PRO
3
350
新規案件の立ち上げ専門チームから見たAI駆動開発の始め方
shuyakinjo
0
130
Featured
See All Featured
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
Imperfection Machines: The Place of Print at Facebook
scottboms
268
13k
How to Ace a Technical Interview
jacobian
279
23k
Building Better People: How to give real-time feedback that sticks.
wjessup
367
19k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
What’s in a name? Adding method to the madness
productmarketing
PRO
23
3.6k
Statistics for Hackers
jakevdp
799
220k
Unsuck your backbone
ammeep
671
58k
Facilitating Awesome Meetings
lara
55
6.5k
GraphQLとの向き合い方2022年版
quramy
49
14k
It's Worth the Effort
3n
187
28k
Transcript
Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04
ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Agenda 1. എܠͱత 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ
5. ·ͱΊͱࠓޙͷల
എܠͱత • SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠
• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ • Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ • 100+ Developer • 30M+
Access / Day
ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ྖҬ͝ͱͷؔੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment
Trade-Off Trade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ • σϓϩΠͷස • มߋͷϦʔυλΠϜ
• MTTR • มߋࣦഊ https://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 োൃੜ࣌ʹɺোใࠂϑϩʔͷ ඞਢ߲ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ
$*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ ຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ • MTTR • σϓϩΠճ • σϓϩΠ࣌ؒ • CI ҆ఆੑ
• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ • ܭଌՄೳੑͷ • ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ • ͦͷͨΊʹ Incident Response
ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ • σʔλྔͷ • Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ • Β͖ͭͷ • σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍ • গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ
• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE • MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு • ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝ •
ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର • monorepo Λ࠾༻ • master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ
• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ ͞ΕΔ • Γ HOTFIX
• ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ • ະੳʢه༧ఆʣ • ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ • Time Window 7 Days • 30
Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖ • ࢦඪͱͯ͠ෆద • جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ • "มߋࣦഊ"ͷఆٛͷ • ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠ • ܭଌํ๏ͷ
• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ • MTTR ಉ༷ͷ
·ͱΊͱߟ • SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ݅
• ेʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ • MTTR 🚀 • τϥοΩϯάܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response
ͷվળ͕ඞཁ • σϓϩΠճ🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల • ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ • ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ
• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠ • ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo