Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#12 “The Tail at Scale”

#12 “The Tail at Scale”

Communications of the ACM, February 2013, Vol. 56 No. 2, Pages 74-80 10.1145/2408776.2408794
https://dl.acm.org/doi/10.1145/2408776.2408794

cafenero_777

June 14, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. $ which • The Tail at Scale • Je ff

    rey Dean: a Google fellow in the systems infrastructure group of google inc. • Luiz André Barroso: a Google fellow and technical lead of core computing infrastructure at google inc. • Communications of the ACM, February 2013, Vol. 56 No. 2, Pages 74-80 10.1145/2408776.2408794 • contributed articles
  2. Introduction • Ϩεϙϯε͕଎͍ (100ms< latency)ͱΑΓࣗવͳ΋ͷʹײ͡ΒΕΔ • 1 : 1 ->

    1 : 1000 online system in Data Center • Google glass (wearableͳ΋ͷ)ͩͱಛʹॏཁ • ෳࡶͳେن໛γεςϜ • tail-latencyΛ୹͘͢Δͷ͕ࠔ೉ & ߴϨΠςϯγʔ͕શମΛࢧ഑ • Ξφϩδʔ • fault-tolerant: ৴པੑͷ௿͍ύʔπ͔Β৴པੑͷߴ͍શମΛ࡞Δ • tail-tolerant: Ԡ౴ͷ༧ଌ͠ʹ͍͘ύʔπ͔ΒԠ౴͕༧ଌͰ͖ΔશମΛ࡞Δ • ߴ஗ԆͷࣄྫɾݪҼɾܰݮςΫχοΫΛ঺հ
  3. Why Variability Exists? • ڞ༗ϦιʔεɿCPU core, Memory/NW bandwidth • deamons:

    Ϧιʔε͸࢖Θͳ͍͕ɺ࣌ؒ͸࢖͏(਺msఔ౓) • άϩʔόϧϦιʔεɿNW switch, shared FS • Maintenance activity: ෼ࢄFSͷσʔλ࠶ߏ੒ɺϩάѹॖɺΨϕίϨͰఆظతʹ͕͔͔࣌ؒΔ • ΩϡʔΠϯάɿproxyαʔόɺNW switch • CPU throttling: ௕࣌ؒϒʔετ͢Δͱൃ೤->clockམͱ͢->஗͘ͳΔ • SSD: SSD಺σʔλΨϕίϨ͢ΔͨΊʹ1/100ఔ౓஗͘ͳΔՄೳੑ͕͋Δ • Energy management: লిྗϞʔυ͔Βͷ”෮ؼ”ʹ͕͔͔࣌ؒΔ
  4. Component-Level Variability Ampli fi ed By Scale • େن໛ΦϯϥΠϯαʔϏεͷ଴ͪ࣌ؒ୹ॖํ๏͸ϚγϯฒྻԽ •

    جຊ͸ϦΫΤετͷ෼ׂͱαϒΦϖϨʔγϣϯɻ͔͠͠ɺɺɺ • ྫ1ʢάϥϑʣɿ • 1 in 100 (100୆ʹ1୆͕1s͔͔ͬͨ৔߹ʣ • αʔό͕100୆ͷ৔߹ɺ63ˋ͕1ඵҎ্͔͔Δ • 1 in 10000 • ໿2ׂ͕1ඵҎ্͔͔Δ@αʔό2000୆ • ྫ2 ʢදʣ͸GoogleͷࣅͨΑ͏ͳ෼ࢄαʔϏε࣮ྫ • 1ͭͷϦΫΤετ͕ऴΘΔͷ͸10ms • શͯͷϦΫΤετ͕ऴΘΔͷ͸140ms (latency͸fan-out͞ΕΔ/޿͕Δ܏޲) 100 ஗͍5%͕70ms͔͔Δ = 1 − ( 99 100) 100 ref. όʔεσʔύϥυοΫε
  5. Reducing Component Variability • Ԡ౴ੑͷվળํ๏ • ༏ઌ౓ͷௐ੔ɿOSͷdisk queueΑΓϦΫΤετͷqueueΛ༏ઌ • OSͷ༏ઌ౓͸௿͍

    (shallow queue) • head-of-line blockingͷ࡟ݮ • େ͖ͳϦΫΤετΛ෼ׂͯ͠ࡉ੾ΕʹɻΠϯλʔϦʔϒॲཧͰޮՌ͋Γ • όοΫάϥ΢ϯυͱͷௐ੔ • ϩάѹॖͳͲͷߴෛՙॲཧ͸෼ׂͯ͠ෛՙ͕௿͍࣌ʹߦ͏ʢϦΫΤετʹӨڹΛड͚ͤ͞ͳ͍ʣ • ϦΫΤετΛಉظతʹ෼ࢄ->όʔετͰෛՙ͸ى͖Δ͕ϨΠςϯγ͸վળ • Ωϟογϡ͸ʁ • શtaskʹcache͕ͳ͚Ε͹tail-latencyʹ͸ޮՌ͕ͳ͍ ஗Ԇ࣌ؒݪҼΛ׬શഉআΛ໨ࢦ͢ʁ
  6. Living with Latency Variability • ࣮ࡍ͸ʁ • ܭࢉϦιʔεΛڞ༗͢Δ؀ڥɾࠓͲ͖ͷن໛ɾෳࡶ͞Ͱ͸଴ͪ࣌ؒͷ׬શഉআ͸ແཧʂ • ༗ޮͳखஈ

    • 1. Within Request Short-Term Adaptations • ௕͘ͳΔલʹ਺10msͰϦΫΤετΛௐ੔ • 2. Cross-Request Long-Term Adaptations • ਺10s-਺mͷϦΫΤετ͸ʢ௕ظతʹʣσʔλ഑ஔΛௐ੔ ϦΫΤετΛ޻෉ʂ σʔλ഑ஔͳͲΛ޻෉ʂ
  7. 1. Within Request Short-Term Adaptations (1/2) • σʔλϨϓϦΧͰϦΫΤετΛ෼ࢄɻread only࣌ʹಛʹ༗ޮɻ •

    Hedged requests • ෳ਺ͷϨϓϦΧʹಉ͡ϦΫΤετΛૹΓɺ࠷ॳͷԠ౴ͷ݁ՌΛ࢖͏ • φΠʔϒͳ࣮૷Ͱ͸ߴෛՙ͕ͩɺ஗Ԇ࡟ݮޮՌ͸ൈ܈ • ظ଴஗Ԇͷ95%·Ͱ଴͔ͬͯΒ2౓໨ͷϦΫΤετ (hedged request)ΛૹΔ • ྫɿread 1k keys on BigTable • 10ms hedge: 1800ms -> 74ms @ 99.9%ile • hedged request͸primary requestΑΓ༏ઌ౓Λ௿͘͢Δ޻෉
  8. 1. Within Request Short-Term Adaptations (2/2) • Tied requests •

    ΩϡʔΠϯά஗Ԇͷ཈੍ • ಉ࣌ʹ2ͭϦΫΤετΛ౤͛Δ -> ॲཧ͢Δલʹ(1msఔ౓଴͔ͬͯΒ)αʔόؒͰstatus֬ೝ-> Ωϡʔʹೖ͍ͬͯΕ͹Ωϟϯηϧɺ΋͘͠͸େ෯ʹ༏ઌ౓௿͛ͤ͞Δ • ྫɿGoogleͷ෼ࢄϑΝΠϧγεςϜ • ϦΫΤετΛग़͢લʹidleͳαʔόΛ୳ͯ͠࢖͑͹ྑ͍ʁ • Ή͠Ζେมɺɺ • ෛՙͷ࣌ؒมԽɺෛՙଌఆࠔ೉ɺϗοτεϙοτԽ ௿ෛՙɾߴෛՙͱ΋ʹtailվળ
  9. 2. Cross-Request Long-Term Adaptations • ಉ͡ίετʹͳΔΑ͏ʹσʔλ෼ׂ(partition)͢Δɺ͕ɺɺ • ֎ཚͰ࣌ؒͱͱ΋ʹίετมԽɻ֎Ε஋ʢΑ͘read͞ΕΔ౳ʣΛड͚࣋ͭɻ • 1.

    Micro-partitions • ͱʹ͔͘ࡉ͔͘෼ࢄͤͯ͞ɺۉҰʹͤ͞ΔɻϚγϯ1୆Ͱ20partitions. • 2. Selective replication • ͦΕͰ΋ෛՙ͕ߴ͍partition͚ͩ͸ߋʹ૿΍͢ɻྫ: Asia DC͕མͪͨͱ͖͸USʹΞδΞݴޠϨϓϦΧΛ࡞੒ • 3. Latency-induced probation • ஗Ԇ෼෍Λଌఆ͠ɺϨεϙϯε͕஗͍αʔόΛআڈɾอޢɻߴෛՙαʔόΛ֎͢ͱ௿ෛՙʹͳΔʢʂʣ
  10. Large Information Retrieval Systems • େن໛৘ใݕࡧγεςϜɿ࠷ߴͷ݁ՌΛΏͬ͘Γग़͢ΑΓ΋ྑ͍݁ՌΛૉૣ͘ग़͢΄͏͕ྑ͍ • Good enough schemes:

    • े෼ͳׂ߹ͰϨεϙϯε͕໭͖ͬͯͨΒ׬ྃͱ͢Δ • ϕετͳճ౴͸1000ΫΤϦʹ1ͭɻίʔύεͷϨϓϦΧΛߟ͑Δͱ֬཰͸΋ͬͱ௿͍ • Canary requests: • ࣌ؒͷ͔͔ΔϦΫΤετΛςετ͢Δ • ࿈࠯తʹ௕͍஗Ԇ͕ى͖ͨ৔߹͸ϑϥάΛཱͯͯfan-out͠ͳ͍Α͏ʹɻ
  11. Conclusion • ܭࢉྔ͕ଟ͘ɺΠϯλϥΫςΟϒͳେن໛αʔϏεͰ͸Ԡ౴ੑ͕ॏཁ • ύϑΥʔϚϯεͷ͹Β͖ͭΛݮΒ͚ͩ͢Ͱ͸ෆे෼ • fault-tolerantͱಉ͘͡Կ͔ͷ࢓૊Έ͕ඞཁ • શͯͷ஗ԆݪҼΛͳ͘͢͜ͱ͸ෆՄೳ •

    tail-tolerantͳٕज़Λ։ൃ • ෳࡶγεςϜ΄Ͳtail-tolerantٕज़ͷॏཁ౓͸૿͍ͯ͘͠ • ௥ՃϦιʔε͕ඞཁͳ৔߹΋͋Δ • େ෯ͳϨΠςϯγվળ͕Ͱ͖Δ • ʢϥΠϒϥϦԽ΍ΧϓηϧԽ͢Δ͜ͱͰɺʣΞϓϦέʔγϣϯઃܭ͸ΑΓγϯϓϧʹͳΔ
  12. EoP