Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE ShopチームでのSREの取り組み / SRE in LINE Shop team

LINE ShopチームでのSREの取り組み / SRE in LINE Shop team

2020/11/7に行われたJJUG CCC 2020 Fallでのスポンサーセッションの登壇資料です。
https://ccc2020fall.java-users.jp/

LINE Developers

November 07, 2020
Tweet

More Decks by LINE Developers

Other Decks in Programming

Transcript

  1. αʔϏεن໛ • ελϯϓʹؔ͢Δ਺ࣈ *1 • ൢചதͷελϯϓ਺: 855ສηοτ (2020೥3݄࣌఺) • 1೔͋ͨΓͷελϯϓૹ৴਺:

    ฏۉ4ԯ3,300ສճ (2019೥݄̐࣌఺) •RPS(requests/sec) *2 • ීஈͷϐʔΫ: ~ 80K RPS (2020/10࣌఺) • ೥࢝ͷϐʔΫ: ~ 120K RPS (2020/01࣌఺) *1 https://linecorp.com/ja/pr/news/ja/2020/3127 *2 https://logmi.jp/tech/articles/322924
  2. Armeria ػೳ֓ཁ •Asynchronous and reactive (like Spring WebFlux) •HTTP/2 •REST

    API͚ͩͰ͸ͳ͘ɺgRPCͱThrift΋αϙʔτ •Client side load balancing • https://armeria.dev/docs/client-service-discovery •ϚΠΫϩαʔϏεͰඞཁͳػೳΛఏڙ • Circuit breaker, Service discovery(DNS etc),Distributed tracing(Zipkin integration), etc
  3. Armeria ࢀߟࢿྉ •Official site: https://armeria.dev •GitHub repo: https://github.com/line/armeria •LINE DEVELOPER

    DAY 2019 ʮArmeriaɿͲ͜Ͱ΋໾ཱͭϚΠΫϩαʔϏεϑϨʔϜϫʔΫʯ • https://linedevday.linecorp.com/jp/2019/sessions/D2-2 • https://youtu.be/lii7oNzAOx0 • https://speakerdeck.com/line_devday2019/armeria-a- microservice-framework-well-suited-everywhere
  4. Central Dogma ࢀߟࢿྉ •Official site: https://line.github.io/centraldogma/ •GitHub repo: https://github.com/line/centraldogma/ •LINE

    DEVELOPER DAY 2017 Central DogmaɿLINE ͷ GitΛϕʔεʹͨ͠ߴՄ༻ੑαʔϏεߏ੒Ϩϙ δτϦ • https://www.slideshare.net/linecorp/central-dogma-lines-gitbacked- highlyavailable-service-configuration-repository • https://www.youtube.com/watch?v=BmgizIFwMq4
  5. SLI

  6. SLI • SLI (Service Level Indicator) • API Availability (ϦΫΤετ੒ޭ཰:

    ੒ޭ਺/τʔλϧϦΫΤετ਺) • ϨΠςϯγ • etc • SLO (Service Level Objective) • SLIΛϕʔεʹͨ͠αʔϏεͷ৴པੑͷ໨ඪ • SLO 100%͸ؒҧͬͨ໨ඪ • ػೳվળɺ৽ػೳ௥Ճɺϝϯςφϯε͕ߦ͑ͳ͘ͳΔ • ࢖͍ͬͯΔϓϥοτϑΥʔϜͷSLA͕100%Ͱ͸ͳ͍৔߹΋͋Δ
  7. SLI • LINE ShopͰ͸API availability(੒ޭ཰), API latencyΛSLIͱͯ͠࢖༻ • ʮThe Site

    Reliability Workbookʢ೔ຊޠ൛ɿαΠτϦϥΠΞϏϦ ςΟϫʔΫϒοΫʣʯʹܝࡌ͞Ε͍ͯΔࣄྫΛࢀߟʹͨ͠ • Prometheus+GrafanaͰՄࢹԽ • αʔϏεো֐͕ൃੜͨ࣌͠ʹϢʔβ΁ͷӨڹΛ֬ೝ͍ͯ͠Δ • SREͷϓϥΫςΟεͰ͸ɺSLO͸εςʔΫϗϧμʔͱ ߹ҙ͢Δඞཁ͕͋Δ͕·ͩग़དྷ͍ͯͳ͍ʢࠓޙͷ՝୊ʣ
  8. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  9. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  10. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  11. ϞχλϦϯά - Alerting • ErrorʢϢʔβʹ௚઀తͳӨڹ͋Γʣ • LatencyͷѱԽ • Error response/secͷ૿Ճ

    • etc • Warnʢ໰୊ͷݪҼͱͳΔ΋ͷ or αʔϏεӨڹ͕௿͍ʣ • CPU usage • JVM GC • ΞϓϦέʔγϣϯαʔό͕མͪͨʢ਺୆ͳΒαʔϏεӨڹ͸ແ͍ʣ • etc
  12. ϞχλϦϯά - ऩू͍ͯ͠ΔϝτϦΫε • API͝ͱͷϝτϦΫε • Server/Client latency (50th, 90th,

    99th percentile, etc) • Requests/sec • Error responses/sec • ϩάͷྔ (Warn, Error) • JVM (GC, Heap, etc) • DB client metrics (HikariCP, etc) • Server load (CPU, Memory, Network Traffic, etc) • etc…
  13. Armeria͕export͢ΔϝτϦΫεͷྫʢҰ෦ʣ • Server/Client latency (50th, 90th, 99th percentile, etc) •

    Requests/sec • Error response/sec • Circuit breaker(CLOSED, OPEN, HALF_OPEN, etc)
  14. Armeria͕export͢ΔϝτϦΫεͷྫʢҰ෦ʣ • Request/Response size • ݺͼग़͠ଆͷ໰୊ͰRequest size͕૿͑ͯαʔόͷෛՙ্͕͕ ΔՄೳੑΛϞχλϦϯά • αʔόଆͷ໰୊Ͱෆਖ਼ͳʢۃ୺ʹখ͞ͳʣϨεϙϯεΛฦͯ͠

    ͠·͏ՄೳੑΛϞχλϦϯά • Armeria client͕DNS໊લղܾʹ͔͔ͬͨ࣌ؒ • DNS͕ݪҼͰ໊લղܾʹ͕͔͔࣌ؒΓϨΠςϯγ͕ѱԽͯ͠͠ ·͏ՄೳੑΛϞχλϦϯά
  15. ϞχλϦϯά - Batch job Metrics: shop_batch_successful_time_seconds{job=“foo”, period="10min"} 1601313019 ※”1601313019”ͷ෦෼͸job͕ਖ਼ৗʹ׬ྃͨ࣌͠఺ͷUNIX time

    Alert rule: time() - shop_batch_successful_time_seconds{period="10min"} > 60 * 10 * 3 ※”3”͸Ұ࣌తͳΤϥʔͰΞϥʔτΛ্͛ͳ͍ͨΊͷόοϑΝɻ͓޷ΈͰɻ ※͜ͷྫͰ͸ɺperiod=“10min”ϥϕϧΛ෇͚ͨbatch job͕લճऴྃҎ߱ɺ30෼Ҏ಺ʹਖ਼ৗऴ ྃ͠ͳ͚Ε͹Ξϥʔτ্͕͕Δ https://www.robustperception.io/monitoring-batch-jobs-in-python
  16. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  17. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  18. ϙετϞʔςϜ • ϙετϞʔςϜʹ·ͱΊΔ߲໨ • Өڹൣғ • ো֐ͷݪҼ • ঢ়گͷ࣌ܥྻ·ͱΊ •

    ࠶ൃ๷ࢭࡦͷݕ౼ • ো֐ݕ஌ʹ໰୊͕ͳ͔͔ͬͨʁͲ͏վળ͢Δ͔ʁ • ো֐ͷϋϯυϦϯάʹ໰୊͕ͳ͔͔ͬͨʁͲ͏վળ͢Δ͔ʁ
  19. ʮSRE αΠτϦϥΠΞϏϦςΟ ΤϯδχΞϦϯά - ୈᶙ෦ ࣮ફʯΑΓ αʔϏεͷ৴པੑͷ֊૚ • 7. ϓϩμΫτ

    • 6. ։ൃ • 5. ΩϟύγςΟϓϥϯχϯά • 4. ςετٴͼϦϦʔεखॱ • 3. ϙετϞʔςϜ/ࠜຊݪҼ෼ੳ • 2. ΠϯγσϯτରԠ • 1. ϞχλϦϯά
  20. ·ͱΊ • LINE ShopαʔϏεΛ͝঺հ͠·ͨ͠ • LINE ShopαʔϏεΞʔΩςΫνϟΛ͝঺հ͠·ͨ͠ • ArmeriaͱCentral Dogma

    • ϚΠΫϩαʔϏεʹ͓͍ͯݕ౼͕ඞཁͳࣄ • Distributed Tracing • Cascading FailureΛ๷͙ͨΊͷCircuit Breaker • Graceful DegradationΛߟྀͨ͠αʔϏε෼ׂ • Service Discovery
  21. We are hiring • LINE Fukuokaגࣜձࣾ • αʔόʔαΠυΤϯδχΞ https://linefukuoka.co.jp/ja/career/list/engineer/ development_engineer_server-side

    • LINEגࣜձࣾ • γχΞαʔόʔαΠυΤϯδχΞ/ίϯςϯπൢചϓϥοτϑΥʔϜ https://linecorp.com/ja/career/position/665 • Site Reliability Engineer/ίϯςϯπൢചϓϥοτϑΥʔϜ https://linecorp.com/ja/career/position/1535