#39 “Profiling a warehouse-scale computer”

Research Paper Introduction #39 “Pro fi ling a warehouse-scale computer”
௨ࢉ#104 @cafenero_777 2022/08/25 1

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background and methodology 3.
Workload diversity 4. Datacenter tax 5. Microarchitecture analysis 6. Instruction cache bottlenecks 7. Core back-end behavior: dependent accesses 8. Simultaneous multi-threading 9. Related work 10.Conclusion 2

ର৅࿦จ •Pro fi ling a warehouse-scale computer • Svilen Kanev,
et.al. • Harvard University, Universidad de Buenos Aires, Yahoo Labs, Google • ISCA ’15 (International Symposium on Computer Architecture) • https://dl.acm.org/doi/10.1145/2749469.2750392 3

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • WSC (Warehouse-Scale Computer)/Cloud ComputingΛjobΛ෼ੳ • CPUͷ໿30%Λ࢖༻͢Δ”ڞ௨ύλʔϯ (Data
Center Tax)”Λݟ͚ͭͨ • HWʹΑΔ࠷దԽͷՄೳੑ΍ɺCPUΩϟογϡͷӨڹͳͲΛղઆ •ಡ΋͏ͱͨ͠ཧ༝ • ࠓͲ͖ͷDCͷ࢖ΘΕํʢͷଌΓํʣͱHWΦϑϩʔυͷצॴ͕ؾʹͳΔ • ؔ࿈ɿThe Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines • ೔ຊޠ༁ͷຊ΋͋Γ • podcastΛฉ͍ͯͯؾʹͳ͍ͬͯͨ • ͋Δछͷ౴͑߹Θͤʢ2022೥ݱࡏ͔Βݟ௚ͯ͠ΈΔʣ 4

1. Introduction 5 •SaaS/Cloud Computing/IoTͷόοΫΤϯυ͸WSC (Warehouse-Scale Computer) • େن໛DCͰͷύϑΥʔϚϯεͱίετޮ཰ͷઃܭ (Θ͔ͣͳվળ͕େن໛ͳίετ࡟ݮ΁)
•࣮؀ڥͷ਺ສ୆Λ3೥ϓϩϑΝΠϦϯάͯ͠ఆྔ෼ੳ • SW: workload͸ଟ࠼͗͢Δɻɻ͕ɺڞ௨෦෼΋ଘࡏɻRPC/PB/SerDes/Comp • ͜Ε͕”σʔληϯλ੫ (Data Center Tax)” • HW: CPUϚΠΫϩΞʔΩςΫνϟͷར༻ঢ়گͷௐࠪ • i-cache/d-cache miss౳ͰCPU stall (15-30%), CPUͷ5-10%͸࢖͍͑ͯͳ͍ɺɺϫʔΫϩʔυࣗମ͸ ೥30%Ͱ૿͍͑ͯΔ • SMT͸ϨΠςϯγ΍stallͷӅṭʹ໾ʹཱ͕ͭɺͦΕͰ΋ෆे෼

2. Background and methodology (1/2) •WSCιϑτ΢ΣΞ؀ڥ • ෼ࢄɾଟ૚ΞʔΩςΫνϟ • ϚΠΫϩαʔϏεAPIɺRPC/PB
SerDes • Tail latency͕ੑೳࢦඪɺ͜ΕΛݮΒ͢ଟ͘ͷ޻෉ • όΠφϦ͕Ͱ͔͍ʢ਺ඦMBʣ •ܧଓతͳϓϩϑΝΠϧ • GWP: Google-Wide Pro fi lingΛ࢖ͬͯશମͰऩूɾ෼ੳ • C++Ͱॻ͔ΕͨόΠφϦΛओʹ෼ੳ 6 https://research.google/pubs/pub36575/

2. Background and methodology (2/2) •ऩूํ๏ • CPUύϑΥʔϚϯεΧ΢ϯλΛऩू  ϥϯμϜʹ2ສ୆બΜͰɺ͢΂ͯͷjobΛ1ඵؒαϯϓϦϯά •CPUύϑΥʔϚϯεΧ΢ϯλͷ෼ੳ
• Top-Downੑೳղੳख๏ʢOoO CPUͰ΋ۙࣅCPIελοΫ ͷ࠶ߏங͕Ͱ͖Δʣ •ϫʔΫϩʔυ • 12छʹ஫໨: batch, latency, low-level, front/back-end 7

3. Workload diversity •WSCʹΩϥʔΞϓϦέʔγϣϯ͸ଘࡏ͠ͳ͍ • hotͳΞϓϦ͸ߴʑ10%ఔ౓ •͔͠΋ɺ೥ʑtop 50ΞϓϦͷ઎ΊΔCPUαΠΫϧ͸ݮগ • ༷ʑͳΞϓϦ͕CPUΛ࢖͏Α͏ʹͳ͍ͬͯΔ
•ಛఆΞϓϦʹ஫໨ͯ͠΋࠷దԽࡁΈ • CPU80%࢖͏ʹ͸353ݸͷLeafʢ຤୺ͷʣؔ਺͕ඞཁ • ಛఆؔ਺͕৯͍ͬͯΔɺͱ͍͏Θ͚Ͱ͸ͳ͍ • 3ݸͷؔ਺͕65%Λ઎Ί͍ͯΔCloudSuiteͷௐࠪ࿦จͱ͸ରরత 8

4. Datacenter tax •Protobuf mgmt: SerDesॲཧ •RPCϥΠϒϥϦ: ෛՙ෼ࢄɾ҉߸Խɾো֐ݕ஌ •compression: ྫɿSnoppy͸HWͰѹॖ
•Data movement: memcpy(), memmove() •Memory allocation: OSͱڠௐಈ࡞ඞཁ •Hashing: গͳ͍͕Ұఆׂ߹Λ઎ΊΔ •Kernel/Sched: গͳ͘ͳ͍͕࠷దԽࠔ೉ 9 ֤ΞϓϦͷڞ௨෦෼ (Data Center TAX)  22-27%ͷCPUαΠΫϧΛ઎ΊΔ HWͰ࠷దԽͰ͖Δ͔΋ʁ HWͰ࠷దԽ೉ͦ͠͏ɻɻ

5. Microarchitecture analysis •෼ྨ • uOps͕Q͔Β཭ΕΔ࣌ͷίϛοτஈ֊ͷঢ়ଶ • Retiring or Bad
speculation • uOps Qεϩοτ͕ಛఆαΠΫϧͰۭʹͳΒͳ͍ • frontend bound: fetch, i-cache,decodingͳͲ͕Ͱ͖ͳ͍৔߹ • backend bound: stallঢ়ଶʢ໋ྩ࣮ߦ͕Ͱ͖͍ͯͳ͍৔߹ʣ •ଌఆ݁Ռ • Retiringগͳ͍ɺBad speculationগ͠ଟ͍ • frontend bound΋ଟ͍͕ɺbackend bound͕ඇৗʹଟ͍ 10 https://jp.xlsoft.com/documents/intel/vtune/2017/Tuning_Applications_Using_a_Top- down_Microarchitecture_Analysis_Method.pdf

6. Instruction cache bottlenecks •ݪҼ͸L2 cache miss • όΠφϦେ͖͍ɺ͔ͭɺhot spot͕ͳ͍
• -> i-cache͕ංେԽ -> ᷓΕͯmiss, • ಛʹL2$͸i-$΋d-$΋ͳͷͰ༨ܭͻͲ͍ • ϑϩϯτΤϯυ͕”ۭৼΓ”͍ͯ͠Δ •ରࡦҊ: $Λେ͖͘͢Δɺෳࡶͳ໋ྩͷϓϦϑΣονɺL2 i-/d-$ partition • ྫɿSPARC M7 11

7. Core back-end behavior: dependent accesses •ओݪҼ͸Cache-bound • όοΫΤϯυαΠΫϧͷ50-60%͸cacheͷϩʔυ଴ͪ or
༰ྔෆ଍ͷׂ߹ •ILP: instruction-level parallelismͷ؍఺ • ໿7ׂ͸ฒྻ͕ऑ͍ •WSC͸ґଘੑ͕ߴ͍ʢILPͷ௿͍ʣɺ͔ͭɺόʔετతͳܭࢉ͕ଟ͍ •ϝϞϦଳҬͷར༻཰௿͍ɿCPU͕࢖͑ͯͳ͍ͨΊ • தԝ஋͕10%ͱ௿͍͕DC workloadతʹ͸Ұൠత • ଳҬΑΓϝϞϦϨΠςϯγ͕ॏཁ 12

̔. Simultaneous multi-threading •SMT (ྫ: intel HT) • ҟͳΔϘτϧωοΫ͕͋Δ৔߹ɺεϨουؒͰ͓ޓ͍ิ׬ͯ͠ޮ཰Խ •
Frontend/Backend྆ํʹϘτϧωοΫ͕͋ͬͨͷͰޮՌ͕ظ଴Ͱ͖Δ •SMTΦϑʹ͢ΔͱαʔϏεӨڹେ -> SMTΦϯͰίΞຖͷΧ΢ϯλʹ஫໨ͯ͠ਪఆ • ΞϓϦຖͷSMTӨڹʢεϐʔυΞοϓʣ͸ଌ͍ͬͯͳ͍఺ʹ஫໨ •SMTͷϝϦσϝ • ͦΕͧΕͷSMT͕ಉ໋͡ྩ$Λfetch͢Δ -> $༰ྔ͕ංେԽͯ͠ϘτϧωοΫ • ϨΠςϯγͷ௕͍fetchόϒϧ͸ผͷSMTͷfetchͰٵऩɻࠓճ͸͕ͬͪ͜ࢧ഑తɻ • ਤ14ͷ্͔Β̎ͭ໨ɺ3ͭ໨ • IPC͸ͦΕ΄Ͳߴ͘ͳΒͣɺɺʢਤ14ͷҰ൪Լʣཧ࿦஋Ͱ4.0͕max 13 SMTͰ1/3͸3unitҎ্࢖͑ΔͱԾఆͨ͠৔߹ɺ (ଌఆ஋Ͱ͋Δ)per-threadΑΓ΋େ෯ʹվળɻ

9. Related work (2015೥࣌఺ͷ࿦จʹ஫ҙ) •ऑ͍ίΞ(atom) ΛಛघͳΠϯλʔίωΫτͰͭͳ͙ •ಛघͳHWΛ࢖͏ʢCatapult v1, FPGA 6x8
2Dτʔϥε) •CloudSuite (ϕϯνϚʔΫπʔϧ)͕ݕࡧʹয఺Λ౰ͯ͗͢໰୊ • WSC͸ݕࡧαʔϏε͕େ෦෼ɺͱ͍͏Θ͚Ͱ΋ͳ͍ɻ͔͠΋࣮ࡍͷ݁ Ռͱ૬͕ؔ௿͍ɻɻɻ •γεςϜϨϕϧͰͷscalabilityͷݚڀ͸ଟ͍ 14 2015೥࣌఺ͷ࿦จͰ͋Δ͜ͱʹ஫ҙ

10. Conclusions •WSCʢ਺ສ୆ن໛ʣΛ਺೥ؒCPUϓϩϑΝΠϦϯάͨ͠ •HotspotͳΞϓϦ͸ଘࡏ͠ͳ͍ɻͨͩ͠ΞϓϦ಺ͷڞ௨෦෼͸վળͷ༨஍͋Γʢ= Data Center Taxʣ •CPUϚΠΫϩΞʔΩςΫνϟϨϕϧͰͷಛ௃·ͱΊ • ௿IPS,
large instruction footprints, ϝϞϦ͸ଳҬΑΓϨΠςϯγॏࢹ 15

EoP 16

#39 “Profiling a warehouse-scale computer”

#39 “Profiling a warehouse-scale computer”

cafenero_777

More Decks by cafenero_777

Other Decks in Technology

Featured

Transcript

Research Paper Introduction #39 “Pro fi ling a warehouse-scale computer”

Agenda •ର৅࿦จ •֓ཁͱಡ΋͏ͱͨ͠ཧ༝ 1. Introduction 2. Background and methodology 3.

ର৅࿦จ •Pro fi ling a warehouse-scale computer • Svilen Kanev,

֓ཁͱಡ΋͏ͱͨ͠ཧ༝ •֓ཁ • WSC (Warehouse-Scale Computer)/Cloud ComputingΛjobΛ෼ੳ • CPUͷ໿30%Λ࢖༻͢Δ”ڞ௨ύλʔϯ (Data

1. Introduction 5 •SaaS/Cloud Computing/IoTͷόοΫΤϯυ͸WSC (Warehouse-Scale Computer) • େن໛DCͰͷύϑΥʔϚϯεͱίετޮ཰ͷઃܭ (Θ͔ͣͳվળ͕େن໛ͳίετ࡟ݮ΁)

2. Background and methodology (1/2) •WSCιϑτ΢ΣΞ؀ڥ • ෼ࢄɾଟ૚ΞʔΩςΫνϟ • ϚΠΫϩαʔϏεAPIɺRPC/PB

2. Background and methodology (2/2) •ऩूํ๏ • CPUύϑΥʔϚϯεΧ΢ϯλΛऩू  ϥϯμϜʹ2ສ୆બΜͰɺ͢΂ͯͷjobΛ1ඵؒαϯϓϦϯά •CPUύϑΥʔϚϯεΧ΢ϯλͷ෼ੳ

3. Workload diversity •WSCʹΩϥʔΞϓϦέʔγϣϯ͸ଘࡏ͠ͳ͍ • hotͳΞϓϦ͸ߴʑ10%ఔ౓ •͔͠΋ɺ೥ʑtop 50ΞϓϦͷ઎ΊΔCPUαΠΫϧ͸ݮগ • ༷ʑͳΞϓϦ͕CPUΛ࢖͏Α͏ʹͳ͍ͬͯΔ

4. Datacenter tax •Protobuf mgmt: SerDesॲཧ •RPCϥΠϒϥϦ: ෛՙ෼ࢄɾ҉߸Խɾো֐ݕ஌ •compression: ྫɿSnoppy͸HWͰѹॖ

5. Microarchitecture analysis •෼ྨ • uOps͕Q͔Β཭ΕΔ࣌ͷίϛοτஈ֊ͷঢ়ଶ • Retiring or Bad

6. Instruction cache bottlenecks •ݪҼ͸L2 cache miss • όΠφϦେ͖͍ɺ͔ͭɺhot spot͕ͳ͍

7. Core back-end behavior: dependent accesses •ओݪҼ͸Cache-bound • όοΫΤϯυαΠΫϧͷ50-60%͸cacheͷϩʔυ଴ͪ or

̔. Simultaneous multi-threading •SMT (ྫ: intel HT) • ҟͳΔϘτϧωοΫ͕͋Δ৔߹ɺεϨουؒͰ͓ޓ͍ิ׬ͯ͠ޮ཰Խ •

9. Related work (2015೥࣌఺ͷ࿦จʹ஫ҙ) •ऑ͍ίΞ(atom) ΛಛघͳΠϯλʔίωΫτͰͭͳ͙ •ಛघͳHWΛ࢖͏ʢCatapult v1, FPGA 6x8

10. Conclusions •WSCʢ਺ສ୆ن໛ʣΛ਺೥ؒCPUϓϩϑΝΠϦϯάͨ͠ •HotspotͳΞϓϦ͸ଘࡏ͠ͳ͍ɻͨͩ͠ΞϓϦ಺ͷڞ௨෦෼͸վળͷ༨஍͋Γʢ= Data Center Taxʣ •CPUϚΠΫϩΞʔΩςΫνϟϨϕϧͰͷಛ௃·ͱΊ • ௿IPS,

EoP 16