Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#39 “Profiling a warehouse-scale computer”

#39 “Profiling a warehouse-scale computer”

ISCA ’15 (International Symposium on Computer Architecture)
https://dl.acm.org/doi/10.1145/2749469.2750392

Avatar for cafenero_777

cafenero_777

June 22, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background and methodology 3.

    Workload diversity 4. Datacenter tax 5. Microarchitecture analysis 6. Instruction cache bottlenecks 7. Core back-end behavior: dependent accesses 8. Simultaneous multi-threading 9. Related work 10.Conclusion 2
  2. ର৅࿦จ •Pro fi ling a warehouse-scale computer • Svilen Kanev,

    et.al. • Harvard University, Universidad de Buenos Aires, Yahoo Labs, Google • ISCA ’15 (International Symposium on Computer Architecture) • https://dl.acm.org/doi/10.1145/2749469.2750392 3
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • WSC (Warehouse-Scale Computer)/Cloud ComputingΛjobΛ෼ੳ • CPUͷ໿30%Λ࢖༻͢Δ”ڞ௨ύλʔϯ (Data

    Center Tax)”Λݟ͚ͭͨ • HWʹΑΔ࠷దԽͷՄೳੑ΍ɺCPUΩϟογϡͷӨڹͳͲΛղઆ •ಡ΋͏ͱͨ͠ཧ༝ • ࠓͲ͖ͷDCͷ࢖ΘΕํʢͷଌΓํʣͱHWΦϑϩʔυͷצॴ͕ؾʹͳΔ • ؔ࿈ɿThe Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines • ೔ຊޠ༁ͷຊ΋͋Γ • podcastΛฉ͍ͯͯؾʹͳ͍ͬͯͨ • ͋Δछͷ౴͑߹Θͤʢ2022೥ݱࡏ͔Βݟ௚ͯ͠ΈΔʣ 4
  4. 1. Introduction 5 •SaaS/Cloud Computing/IoTͷόοΫΤϯυ͸WSC (Warehouse-Scale Computer) • େن໛DCͰͷύϑΥʔϚϯεͱίετޮ཰ͷઃܭ (Θ͔ͣͳվળ͕େن໛ͳίετ࡟ݮ΁)

    •࣮؀ڥͷ਺ສ୆Λ3೥ϓϩϑΝΠϦϯάͯ͠ఆྔ෼ੳ • SW: workload͸ଟ࠼͗͢Δɻɻ͕ɺڞ௨෦෼΋ଘࡏɻRPC/PB/SerDes/Comp • ͜Ε͕”σʔληϯλ੫ (Data Center Tax)” • HW: CPUϚΠΫϩΞʔΩςΫνϟͷར༻ঢ়گͷௐࠪ • i-cache/d-cache miss౳ͰCPU stall (15-30%), CPUͷ5-10%͸࢖͍͑ͯͳ͍ɺɺϫʔΫϩʔυࣗମ͸ ೥30%Ͱ૿͍͑ͯΔ • SMT͸ϨΠςϯγ΍stallͷӅṭʹ໾ʹཱ͕ͭɺͦΕͰ΋ෆे෼
  5. 2. Background and methodology (1/2) •WSCιϑτ΢ΣΞ؀ڥ • ෼ࢄɾଟ૚ΞʔΩςΫνϟ • ϚΠΫϩαʔϏεAPIɺRPC/PB

    SerDes • Tail latency͕ੑೳࢦඪɺ͜ΕΛݮΒ͢ଟ͘ͷ޻෉ • όΠφϦ͕Ͱ͔͍ʢ਺ඦMBʣ •ܧଓతͳϓϩϑΝΠϧ • GWP: Google-Wide Pro fi lingΛ࢖ͬͯશମͰऩूɾ෼ੳ • C++Ͱॻ͔ΕͨόΠφϦΛओʹ෼ੳ 6 https://research.google/pubs/pub36575/
  6. 2. Background and methodology (2/2) •ऩूํ๏ • CPUύϑΥʔϚϯεΧ΢ϯλΛऩू
 ϥϯμϜʹ2ສ୆બΜͰɺ͢΂ͯͷjobΛ1ඵؒαϯϓϦϯά •CPUύϑΥʔϚϯεΧ΢ϯλͷ෼ੳ

    • Top-Downੑೳղੳख๏ʢOoO CPUͰ΋ۙࣅCPIελοΫ ͷ࠶ߏங͕Ͱ͖Δʣ •ϫʔΫϩʔυ • 12छʹ஫໨: batch, latency, low-level, front/back-end 7
  7. 3. Workload diversity •WSCʹΩϥʔΞϓϦέʔγϣϯ͸ଘࡏ͠ͳ͍ • hotͳΞϓϦ͸ߴʑ10%ఔ౓ •͔͠΋ɺ೥ʑtop 50ΞϓϦͷ઎ΊΔCPUαΠΫϧ͸ݮগ • ༷ʑͳΞϓϦ͕CPUΛ࢖͏Α͏ʹͳ͍ͬͯΔ

    •ಛఆΞϓϦʹ஫໨ͯ͠΋࠷దԽࡁΈ • CPU80%࢖͏ʹ͸353ݸͷLeafʢ຤୺ͷʣؔ਺͕ඞཁ • ಛఆؔ਺͕৯͍ͬͯΔɺͱ͍͏Θ͚Ͱ͸ͳ͍ • 3ݸͷؔ਺͕65%Λ઎Ί͍ͯΔCloudSuiteͷௐࠪ࿦จͱ͸ରরత 8
  8. 4. Datacenter tax •Protobuf mgmt: SerDesॲཧ •RPCϥΠϒϥϦ: ෛՙ෼ࢄɾ҉߸Խɾো֐ݕ஌ •compression: ྫɿSnoppy͸HWͰѹॖ

    •Data movement: memcpy(), memmove() •Memory allocation: OSͱڠௐಈ࡞ඞཁ •Hashing: গͳ͍͕Ұఆׂ߹Λ઎ΊΔ •Kernel/Sched: গͳ͘ͳ͍͕࠷దԽࠔ೉ 9 ֤ΞϓϦͷڞ௨෦෼ (Data Center TAX)
 22-27%ͷCPUαΠΫϧΛ઎ΊΔ HWͰ࠷దԽͰ͖Δ͔΋ʁ HWͰ࠷దԽ೉ͦ͠͏ɻɻ
  9. 5. Microarchitecture analysis •෼ྨ • uOps͕Q͔Β཭ΕΔ࣌ͷίϛοτஈ֊ͷঢ়ଶ • Retiring or Bad

    speculation • uOps Qεϩοτ͕ಛఆαΠΫϧͰۭʹͳΒͳ͍ • frontend bound: fetch, i-cache,decodingͳͲ͕Ͱ͖ͳ͍৔߹ • backend bound: stallঢ়ଶʢ໋ྩ࣮ߦ͕Ͱ͖͍ͯͳ͍৔߹ʣ •ଌఆ݁Ռ • Retiringগͳ͍ɺBad speculationগ͠ଟ͍ • frontend bound΋ଟ͍͕ɺbackend bound͕ඇৗʹଟ͍ 10 https://jp.xlsoft.com/documents/intel/vtune/2017/Tuning_Applications_Using_a_Top- down_Microarchitecture_Analysis_Method.pdf
  10. 6. Instruction cache bottlenecks •ݪҼ͸L2 cache miss • όΠφϦେ͖͍ɺ͔ͭɺhot spot͕ͳ͍

    • -> i-cache͕ංେԽ -> ᷓΕͯmiss, • ಛʹL2$͸i-$΋d-$΋ͳͷͰ༨ܭͻͲ͍ • ϑϩϯτΤϯυ͕”ۭৼΓ”͍ͯ͠Δ •ରࡦҊ: $Λେ͖͘͢Δɺෳࡶͳ໋ྩͷϓϦϑΣονɺL2 i-/d-$ partition • ྫɿSPARC M7 11
  11. 7. Core back-end behavior: dependent accesses •ओݪҼ͸Cache-bound • όοΫΤϯυαΠΫϧͷ50-60%͸cacheͷϩʔυ଴ͪ or

    ༰ྔෆ଍ͷׂ߹ •ILP: instruction-level parallelismͷ؍఺ • ໿7ׂ͸ฒྻ͕ऑ͍ •WSC͸ґଘੑ͕ߴ͍ʢILPͷ௿͍ʣɺ͔ͭɺόʔετతͳܭࢉ͕ଟ͍ •ϝϞϦଳҬͷར༻཰௿͍ɿCPU͕࢖͑ͯͳ͍ͨΊ • தԝ஋͕10%ͱ௿͍͕DC workloadతʹ͸Ұൠత • ଳҬΑΓϝϞϦϨΠςϯγ͕ॏཁ 12
  12. ̔. Simultaneous multi-threading •SMT (ྫ: intel HT) • ҟͳΔϘτϧωοΫ͕͋Δ৔߹ɺεϨουؒͰ͓ޓ͍ิ׬ͯ͠ޮ཰Խ •

    Frontend/Backend྆ํʹϘτϧωοΫ͕͋ͬͨͷͰޮՌ͕ظ଴Ͱ͖Δ •SMTΦϑʹ͢ΔͱαʔϏεӨڹେ -> SMTΦϯͰίΞຖͷΧ΢ϯλʹ஫໨ͯ͠ਪఆ • ΞϓϦຖͷSMTӨڹʢεϐʔυΞοϓʣ͸ଌ͍ͬͯͳ͍఺ʹ஫໨ •SMTͷϝϦσϝ • ͦΕͧΕͷSMT͕ಉ໋͡ྩ$Λfetch͢Δ -> $༰ྔ͕ංେԽͯ͠ϘτϧωοΫ • ϨΠςϯγͷ௕͍fetchόϒϧ͸ผͷSMTͷfetchͰٵऩɻࠓճ͸͕ͬͪ͜ࢧ഑తɻ • ਤ14ͷ্͔Β̎ͭ໨ɺ3ͭ໨ • IPC͸ͦΕ΄Ͳߴ͘ͳΒͣɺɺʢਤ14ͷҰ൪Լʣཧ࿦஋Ͱ4.0͕max 13 SMTͰ1/3͸3unitҎ্࢖͑ΔͱԾఆͨ͠৔߹ɺ (ଌఆ஋Ͱ͋Δ)per-threadΑΓ΋େ෯ʹվળɻ
  13. 9. Related work (2015೥࣌఺ͷ࿦จʹ஫ҙ) •ऑ͍ίΞ(atom) ΛಛघͳΠϯλʔίωΫτͰͭͳ͙ •ಛघͳHWΛ࢖͏ʢCatapult v1, FPGA 6x8

    2Dτʔϥε) •CloudSuite (ϕϯνϚʔΫπʔϧ)͕ݕࡧʹয఺Λ౰ͯ͗͢໰୊ • WSC͸ݕࡧαʔϏε͕େ෦෼ɺͱ͍͏Θ͚Ͱ΋ͳ͍ɻ͔͠΋࣮ࡍͷ݁ Ռͱ૬͕ؔ௿͍ɻɻɻ •γεςϜϨϕϧͰͷscalabilityͷݚڀ͸ଟ͍ 14 2015೥࣌఺ͷ࿦จͰ͋Δ͜ͱʹ஫ҙ