Proposal talk: Energy-Efficient Mobile Web Computing

1 Energy-Eﬃcient Mobile Web Computing Yuhao Zhu UT Austin Advisor:
Vijay Janapa Reddi Feb. 17th, 2016

Call Text 2

Call Text The (in)famous “snake game” 2

4 Architects Make Mobile Processors Faster

4 Architects Make Mobile Processors Faster In-order (2007)

4 Architects Make Mobile Processors Faster In-order (2007) Out-of-order (2010)
Multi-core (2010) Asymmetric Multi-core (2014)

Multi-core (2010) Asymmetric Multi-core (2014) Performance

Multi-core (2010) Asymmetric Multi-core (2014) Performance Power

Multi-core (2010) Asymmetric Multi-core (2014) Performance Power At the Expense of Excessive Power

Responsiveness 5

Responsiveness Energy-Eﬃciency 5

Responsiveness Energy-Eﬃciency Conﬂicting requirements 5

Thesis Statement 6 Energy-Efficiency Conflicting requirements A mobile computing system
that satisfies user QoS requirements on a mobile energy budget Responsiveness

Thesis Statement 6 Energy-Efficiency Conflicting requirements A mobile computing system
that satisfies user QoS requirements on a mobile energy budget Responsiveness for the mobile Web

8 Achieving Mobile Web Performance Mobile Client

8 Achieving Mobile Web Performance Mobile Client Cloud Web Servers

Cellular Network

Cellular Network [MICRO 2015] (Top Picks Honorable Mention)

9 Achieving Mobile Web Performance Mobile Client Cellular Network

10 Isn’t Responsiveness a Network Issue? Mobile Client Cellular Network

Isn’t Responsiveness a Network Issue? 11 [HotMobile’11, WWW’12], 100+ citations

Resource loading is the bottleneck

Client compute doesn’t matter much Resource loading is the bottleneck

Client compute doesn’t matter much Resource loading is the bottleneck Conclusions circa 2010!

38 32 26 20 14 8 2 Load time (s)
10 2 3 4 5 6 7 8 100 2 3 4 5 6 7 8 1000 2 Network RTT (ms) 12 Isn’t Responsiveness a Network Issue? A Year 2015 Experiment!

38 32 26 20 14 8 2 Load time (s)
10 2 3 4 5 6 7 8 100 2 3 4 5 6 7 8 1000 2 Network RTT (ms) 12 Isn’t Responsiveness a Network Issue? ▸ Samsung Galaxy S4 smartphone. ▸ Hot webpages from Alexa1. ▸ Time measured using Navigation Timing API2. 1. http://www.alexa.com/ 2. https://www.w3.org/TR/navigation-timing-2/ A Year 2015 Experiment!

38 32 26 20 14 8 2 Load time (s)
10 2 3 4 5 6 7 8 100 2 3 4 5 6 7 8 1000 2 Network RTT (ms) 12 LTE 3G Adverse 3G 2G Wi-Fi Isn’t Responsiveness a Network Issue? ▸ Samsung Galaxy S4 smartphone. ▸ Hot webpages from Alexa1. ▸ Time measured using Navigation Timing API2. 1. http://www.alexa.com/ 2. https://www.w3.org/TR/navigation-timing-2/ A Year 2015 Experiment!

38 32 26 20 14 8 2 Load time (s)
10 2 3 4 5 6 7 8 100 2 3 4 5 6 7 8 1000 2 Network RTT (ms) 12 LTE 3G Adverse 3G 2G Wi-Fi Isn’t Responsiveness a Network Issue? Circa 2010 ▸ Samsung Galaxy S4 smartphone. ▸ Hot webpages from Alexa1. ▸ Time measured using Navigation Timing API2. 1. http://www.alexa.com/ 2. https://www.w3.org/TR/navigation-timing-2/ A Year 2015 Experiment!

13 Responsiveness is also a Compute Issue! Mobile Client Cellular
Network

13 Responsiveness is also a Compute Issue! Mobile Client Cellular
Network This Proposal

14 Traditional Approach

14 Traditional Approach Frameworks and Libraries HTML JavaScript CSS Language
Runtime Styling Security Local Storage User Input Layout Render

14 Traditional Approach Frameworks and Libraries HTML JavaScript CSS Language
Runtime Styling Security Local Storage User Input Layout Render Application

▸ Parallelize browser computation 14 Traditional Approach Frameworks and Libraries
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application

HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture

HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture ▸ Voltage/frequency scaling on general-purpose processors

HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors

▸ Parallelize browser computation ▸ Ignored! 14 Traditional Approach Frameworks
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors

▸ Parallelize browser computation ▸ Ignored! 14 Traditional Approach Frameworks
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors ▸ End of Dennard Scaling! ▸ Diminishing return

▸ Parallelize browser computation ▸ Ignored! 15 My Approach Frameworks
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture WebCore Web-speciﬁc Architecture

▸ Parallelize browser computation 15 My Approach Frameworks and Libraries
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Lost page-level diversity ▸ Lost user QoS requirements WebCore Web-speciﬁc Architecture

▸ Parallelize browser computation 15 My Approach Frameworks and Libraries
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture ▸ Lost page-level diversity ▸ Lost user QoS requirements WebCore Web-speciﬁc Architecture

16 My Approach Frameworks and Libraries HTML JavaScript CSS Language
Runtime Styling Security Local Storage User Input Layout Render Application Architecture WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions

16 My Approach Frameworks and Libraries HTML JavaScript CSS Language
Runtime Styling Security Local Storage User Input Layout Render Application Architecture WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions Runtime

WebRT Energy-aware Web Runtime 16 My Approach Frameworks and Libraries
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions Runtime

Runtime 17 My Approach Architecture Application WebRT Energy-aware Web Runtime
WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions

Runtime 17 My Approach Architecture Application My Research Scope WebRT
Energy-aware Web Runtime WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions [PLDI 2016] [ISCA 2014] [HPCA 2013] [HPCA 2015] [CAL 2014] (Best of CAL)

Energy-aware Web Runtime WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions [PLDI 2016] [ISCA 2014] [HPCA 2013] [HPCA 2015] [CAL 2014] (Best of CAL)

19 Execution Time Energy General-Purpose Designs WebCore: a Web-Speciﬁc Mobile
Architecture

19 Execution Time Energy General-Purpose Designs WebCore: a Web-Speciﬁc Mobile
Architecture Diminishing return

19 Execution Time Energy ASIC? General-Purpose Designs WebCore: a Web-Speciﬁc
Mobile Architecture

19 Execution Time Energy ASIC? Extremely challenging ‣Chrome: 17M LoC,
29 languages ▹ c.f., H264 codec: 0.13M LoC, 6 languages ‣Code base is very irregular ▹ No ﬁne-grained parallelism General-Purpose Designs WebCore: a Web-Speciﬁc Mobile Architecture

19 Execution Time Energy ASIC? General-Purpose Designs WebCore: a Web-Speciﬁc
Mobile Architecture Goal

19 Execution Time Energy ??? ASIC? General-Purpose Designs WebCore: a
Web-Speciﬁc Mobile Architecture Goal

WebCore Philosophy 20 Claim: Instead of directly jumping to fully
specialization, we must take it step by step

WebCore Philosophy 20

Web Software WebCore Philosophy 20

Web Software WebCore Philosophy 20 General- purpose Processor (GPP)

Web Software WebCore Philosophy 20 General- purpose Processor (GPP) Customized
GPP Customization Tune uarch parameters

GPP Specialization Customized GPP Customization Tune uarch parameters Specialization Accelerate key kernels

GPP Specialization Customized GPP Customization Tune uarch parameters Specialization Accelerate key kernels WebCore

WebCore: a Web-Speciﬁc Mobile Architecture 21 Execution Time Energy General-Purpose
Designs Goal

Designs Customization Goal

Designs Customization Specialization Goal

Customization: Find an Ideal General Purpose Architecture for the Mobile
Web 22 22

Web ▸What is a proper general purpose baseline architecture? ▹Out-of-order (Silvermont, A15) or in-order (Saltwell, A7)? ▹Are existing general purpose mobile designs ideal? 22 22

Web ▸What is a proper general purpose baseline architecture? ▹Out-of-order (Silvermont, A15) or in-order (Saltwell, A7)? ▹Are existing general purpose mobile designs ideal? ▸Exhaustive design space exploration. 22 22

Design Space Exploration (DSE) Setup ▸Search space of over 3
billion design points ▹ Leverage statistical inference models to increase search speed ▸Use integrated simulators ▹McPAT for Power ▹Marss86 for Performance (x86 full-system simulator) ▸Chromium Web browser 23

Design Space Exploration (DSE) Findings 24

Design Space Exploration (DSE) Findings ▸Out-of-order designs are more ﬂexible
24

Understand the Difference Using Kernel Knowledge 25

Understand the Difference Using Kernel Knowledge 25 10% 13% 17%
25% 35% Render Style Other Layout DOM

Understand the Difference Using Kernel Knowledge In-order design 25

Understand the Difference Using Kernel Knowledge ▸In-order designs show strong
kernel variance In-order design 25

kernel variance In-order design 25 Out-of-order design

kernel variance In-order design 25 Out-of-order design ▸An Out-of-order design can accommodate kernel variance

Customization: Identifying Major Sources of Energy Inefﬁciency 26 26

Customization: Identifying Major Sources of Energy Inefﬁciency 26 P2 P1
26

Customization: Identifying Major Sources of Energy Inefﬁciency 26 P1 P2
ARM A15 Issue width 1 3 3 # Function units 2 3 8 Load queue size 4 16 16 Store queue size 4 16 BTB size 1024 128 256 ROB size 128 128 40+ L1 I-$ size (KB) 64 128 32 # Physical registers 128 140 ? L1 D-$ size (KB) 8 64 32 L2-$ size (KB) 256 1024 <4096 26

P1 P2 ARM A15 Issue width 1 3 3 #
Function units 2 3 8 Load queue size 4 16 16 Store queue size 4 16 BTB size 1024 128 256 ROB size 128 128 40+ L1 I-$ size (KB) 64 128 32 # Physical registers 128 140 ? L1 D-$ size (KB) 8 64 32 L2-$ size (KB) 256 1024 <4096 27 P2 P1 27 Customization: Identifying Major Sources of Energy Inefﬁciency

Function units 2 3 8 Load queue size 4 16 16 Store queue size 4 16 BTB size 1024 128 256 ROB size 128 128 40+ L1 I-$ size (KB) 64 128 32 # Physical registers 128 140 ? L1 D-$ size (KB) 8 64 32 L2-$ size (KB) 256 1024 <4096 ▸Instruction supply 27 P2 P1 27 Customization: Identifying Major Sources of Energy Inefﬁciency

Function units 2 3 8 Load queue size 4 16 16 Store queue size 4 16 BTB size 1024 128 256 ROB size 128 128 40+ L1 I-$ size (KB) 64 128 32 # Physical registers 128 140 ? L1 D-$ size (KB) 8 64 32 L2-$ size (KB) 256 1024 <4096 ▸Instruction supply ▸Data feeding 27 P2 P1 27 Customization: Identifying Major Sources of Energy Inefﬁciency

Specialization: Fixing the Pending Inefﬁciencies 28 ▸Instruction supply ▹ Pack
more operations in one instruction ▸Data feeding ▹ Move operands closer to operations

Style Resolution Kernel ▸ Choose the Style kernel as the
specialization target 29

specialization target 29 10% 13% 17% 25% 35% Render Style Other Layout DOM 12% 14% 16% 18% 40% Render Style Other Layout DOM Execution time breakdown Energy breakdown

specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}}

specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}} Rule-level Parallelism (RLP)

specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}} Rule-level Parallelism (RLP) Property-level Parallelism (PLP)

specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}} Rule-level Parallelism (RLP) Property-level Parallelism (PLP) ▸ Exploiting the parallelism to increase the arithmetic intensity

▸ A running example from www.cnn.com  30 Rule Property 1
Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 Style Resolution Kernel

Property 1 Property 2 Property 3 id value id value
id value Final Style Info ▸ A running example from www.cnn.com  30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 Style Resolution Kernel

id value Final Style Info ▸ A running example from www.cnn.com  30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel

id value Final Style Info ▸ A running example from www.cnn.com  30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px ▸Order Matters in RLP ▸Order Does Not Matter in PLP margin 0 High priority Style Resolution Kernel

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP 31

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP 31 Input Scratchpad

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conﬂict Resolution

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority Prop m Prop m 31 Input Scratchpad Conﬂict Resolution

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority Prop m 31 Input Scratchpad Conﬂict Resolution

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conﬂict Resolution Compute Lanes

... ... Rule j ... ... Prop l ... ...
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conﬂict Resolution Output Scratchpad Compute Lanes

Evaluation Results 32

Evaluation Results 32 ▸Fully synthesized using Synopsys 28 nm toolchain

▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

Evaluation Results 32 0.55 0.688 0.825 0.963 1.1 1.6 1.8
2 2.2 2.4 Energy (J) Load Time (s) ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) A15-like design ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) 18.6% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 9.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization 29.2% 47.0% ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4

WebCore in SoC 33

WebCore in SoC 33 CPUs

WebCore in SoC 33 CPUs GPUs

WebCore in SoC 33 CPUs GPUs Memory

WebCore in SoC 33 CPUs GPUs Specialized Logics Memory

WebCore in SoC 33 CPUs GPUs Specialized Logics Memory WebCore
▸ One of the cores in the multicore SoC ▸ Becomes “dark” when other applications are executing

Energy-aware Web Runtime WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions

35 Architecture Evolution

35 Architecture Evolution In-order (2007) Out-of-order (2011) CMP (2011) Complex!
(Present)

35 Architecture Evolution ACMP (Big/Little) In-order (2007) Out-of-order (2011) CMP
(2011) Complex! (Present)

36 WebRT: Energy-aware Web Runtime

▸ Why ACMP?: Offer a large performance-energy trade-off space for
energy optimizations ▹ Different microarchitectures (in-order + out-of-order) ▹ Different frequency settings 36 WebRT: Energy-aware Web Runtime

energy optimizations ▹ Different microarchitectures (in-order + out-of-order) ▹ Different frequency settings ▸ Idea: Provide just-enough energy to meet performance target 36 WebRT: Energy-aware Web Runtime

energy optimizations ▹ Different microarchitectures (in-order + out-of-order) ▹ Different frequency settings ▸ Idea: Provide just-enough energy to meet performance target ▸ Approach: Systematically understand user interactions and bridge the gap between user behavior and system execution. 36 WebRT: Energy-aware Web Runtime

Interacting With a Mobile Web Application 37

Interacting With a Mobile Web Application 37 Loading Interactions

Interacting With a Mobile Web Application 37 Austin Loading Interactions

Interacting With a Mobile Web Application 37 Austin Loading Touching
Interactions

Interacting With a Mobile Web Application 37 Austin Loading Touching
Moving Interactions

Interacting With a Mobile Web Application 38 Loading Touching Moving
Interactions

Interactions Once per a usage session

Interactions Proactive Mechanism WebRT Component

Interactions Proactive Mechanism WebRT Component Repetitive in a usage session

Interactions Proactive Mechanism WebRT Component History- based Mechanism

39 Loading Touching Moving Interactions Proactive Mechanism WebRT Component History-
based Mechanism WebRT: Energy-aware Web Runtime

Optimizing for Loading 40

Optimizing for Loading ▸ Observation: Web applications have different characteristics
that lead to different loading times and energy consumptions 40

that lead to different loading times and energy consumptions 40 ▸ Mechanism: Predict the ideal ACMP conﬁguration (<core, frequency>) and schedule application loading accordingly

that lead to different loading times and energy consumptions 40 ▸ Mechanism: Predict the ideal ACMP conﬁguration (<core, frequency>) and schedule application loading accordingly ▸ Eﬀect: Properly provision the hardware resources based on application characteristics

Big/Little Setup 41 ODroid XU+E development board, which contains an
Exynos 5410 SoC used in Samsung Galaxy S4.

Big/Little Setup 41 ODroid XU+E development board, which contains an
Exynos 5410 SoC used in Samsung Galaxy S4. Big core cluster: ARM Cortex A15, OoO with 3 issue DVFS: 800 MHz ~ 1.8 GHz at a 100 MHz granularity

Big/Little Setup 41 Little core cluster: ARM Cortex A7, In-order
with 2 issue DVFS: 350 MHz ~ 600 MHz at a 50 MHz granularity ODroid XU+E development board, which contains an Exynos 5410 SoC used in Samsung Galaxy S4. Big core cluster: ARM Cortex A15, OoO with 3 issue DVFS: 800 MHz ~ 1.8 GHz at a 100 MHz granularity

Big/Little Setup 41 Little core cluster: ARM Cortex A7, In-order
with 2 issue DVFS: 350 MHz ~ 600 MHz at a 50 MHz granularity ODroid XU+E development board, which contains an Exynos 5410 SoC used in Samsung Galaxy S4. Big core cluster: ARM Cortex A15, OoO with 3 issue DVFS: 800 MHz ~ 1.8 GHz at a 100 MHz granularity Overhead: ▸ Frequency switch: 100 us ▸ Core migration: 20 us

Power and Energy Measurements 42 + - Vin+ Vin- Vout
GND Sense resistor 15mΩ SoC ARM Cortex A9 VRM Gain x50 Probe Data Acquisition (DAQ)

Performance-Energy Trade-off 43

Enegy Consumption (J) 0 2 4 6 8 Load time
(s) 0 3 6 9 12 15 Big Core Performance-Energy Trade-off 43 www.autoblog.com

0 2 4 6 8 0 3 6 9 12
15 Small Core Enegy Consumption (J) 0 2 4 6 8 Load time (s) 0 3 6 9 12 15 Big Core Performance-Energy Trade-off 43 www.autoblog.com

0 2 4 6 8 0 3 6 9 12
15 Small Core Enegy Consumption (J) 0 2 4 6 8 Load time (s) 0 3 6 9 12 15 Big Core 44 www.newegg.com Performance-Energy Trade-off

0 2 4 6 8 0 3 6 9 12
15 Small Core Enegy Consumption (J) 0 2 4 6 8 Load time (s) 0 3 6 9 12 15 Big Core 44 www.newegg.com 30% Performance-Energy Trade-off

0 2 4 6 8 0 3 6 9 12
15 Small Core Enegy Consumption (J) 0 2 4 6 8 Load time (s) 0 3 6 9 12 15 Big Core 45 www.adobe.com Performance-Energy Trade-off

0 2 4 6 8 0 3 6 9 12
15 Small Core Enegy Consumption (J) 0 2 4 6 8 Load time (s) 0 3 6 9 12 15 Big Core 45 www.adobe.com 80% Performance-Energy Trade-off

46 Breaking Down the Computations 46

46 Breaking Down the Computations HTML (Structure) CSS (Style) 46

46 Breaking Down the Computations Tag Attribute HTML (Structure) CSS
(Style) 46

46 Breaking Down the Computations Tag Attribute HTML (Structure) CSS
(Style) Selector Property 46

46 Breaking Down the Computations DOM Tree Tag Attribute HTML
(Structure) CSS (Style) Selector Property 46

46 Breaking Down the Computations DOM Tree Tag Attribute HTML
(Structure) CSS (Style) Selector Property 46 Web Primitives

47 47 HTML Tag Analysis www.163.com

47 47 HTML Tag Analysis Number of Tags (K) 5
Webpages

Webpages www.google.com

Webpages

Webpages ▸ Web applications have diﬀerent tag counts

48 48 Tag Processing Overhead ms mJ 0 175 350
525 700 0 45 90 135 180 h3 table img Load time Energy ▸ Web applications have diﬀerent tag counts

49 49 ms mJ 0 175 350 525 700 0
45 90 135 180 h3 table img Load time Energy ▸ Web applications have diﬀerent tag counts Tag Processing Overhead

525 700 0 45 90 135 180 h3 table img Load time Energy ▸ Tags have diﬀerent processing overheads ▸ Web applications have diﬀerent tag counts

Root-cause of Web Application Variance 51 51 Tag Processing Overhead
▸ Tags have diﬀerent processing overheads ▸ Web applications have diﬀerent tag counts

Predicting Loading Performance & Energy 52 Idea: predict load time
& energy (responses) based on Web primitives (predictors)

Predicting Loading Performance & Energy 52 Identify Predictors Training using
hottest 2,500 webpages Predictors (HTML, CSS) Responses (Time, Energy)

hottest 2,500 webpages Model Construction & Refinement Refine the linear model Predictors (HTML, CSS) Responses (Time, Energy) Mitigate Over-fitting Model Non-Linearity Linear Regression

hottest 2,500 webpages Model Construction & Refinement Refine the linear model Model Validation Validating on another 2,500 webpages Predictors (HTML, CSS) Responses (Time, Energy) Mitigate Over-fitting Model Non-Linearity Linear Regression Loading Time Model Energy Model

53 0.00 0.05 0.10 0.15 0.20 performance • • •
• • • • • • • • • • • • • • • • • • • • • • 0.00 0.05 0.10 0.15 0.20 energy Median prediction error is less than 5% Predicting Loading Performance & Energy

Webpage-aware Scheduler 54

Webpage-aware Scheduler 54 Normal Web application loading Scheduler operations

Webpage-aware Scheduler 54 Network ........ Normal Web application loading Scheduler
operations

Webpage-aware Scheduler 54 ........ Parsing (1~%) Normal Web application loading
Scheduler operations

Webpage-aware Scheduler 54 ........ Prediction (minimal overhead) Normal Web application
loading Scheduler operations

Webpage-aware Scheduler 54 ........ Scheduling Overhead (~120 us) Normal Web
application loading Scheduler operations

Webpage-aware Scheduler 54 ........ Rest of loading Normal Web application
loading Scheduler operations

55 Evaluation

55 Evaluation ▸ Highest performance (Perf) ▹Highest frequency on big
core ▹Standard to guarantee responsiveness

core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores)

core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations

core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations 83.0% energy savings over Perf, 4.1% more QoS violations

core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations 83.0% energy savings over Perf, 4.1% more QoS violations 8.6% energy savings over OS, 0.1% more QoS violations

56 Loading Touching Moving Interactions Proactive Mechanism WebRT Component History-
based Mechanism WebRT: Energy-aware Web Runtime

57 Optimizing Post-loading Interactions

57 Optimizing Post-loading Interactions Touching Moving Interactions

57 Optimizing Post-loading Interactions Touching Moving Interactions Events

57 Optimizing Post-loading Interactions Touching Moving Interactions Events click touchstart
touchmove scroll

touchmove scroll Event Queue

touchmove scroll Event Loop Event Queue

touchmove scroll Optimize post-loading at an event-granularity Event Loop Event Queue

▸ Observation: Events have different execution latencies that enable energy
optimizations 57 Optimizing Post-loading Interactions Touching Moving Interactions Events click touchstart touchmove scroll Event Loop Event Queue

optimizations 58 Optimizing Post-loading Interactions

optimizations 58 ▸ Mechanism: Event-based scheduling to predict the ACMP conﬁguration that exploits event slacks and saves energy Optimizing Post-loading Interactions

optimizations 58 ▸ Mechanism: Event-based scheduling to predict the ACMP conﬁguration that exploits event slacks and saves energy ▸ Eﬀect: Properly provision the hardware resources based on event characteristics Optimizing Post-loading Interactions

Event-Level Characterization 59

Event-Level Characterization 59 150 100 50 0 Event Latency (ms)
Events

Events keyup

Events Large Slack keyup

Events Large Slack change keyup

Events Large Slack change Small Slack keyup

Events Large Slack change Small Slack click keyup

Events Large Slack change Small Slack No Slack click keyup

Events Large Slack change Small Slack No Slack click keyup ▸ Wide distribution of event latencies. Events exhibit different slacks. ▹ How to exploit event slacks?

60 Event-based Scheduler (EBS)

60 Event-based Scheduler (EBS) ▸ Goal: For each event, find
the most energy-efficient ACMP configuration that meets the latency target

60 Event-based Scheduler (EBS) Thread Scheduling

60 Event-based Scheduler (EBS) Thread-based Scheduler Thread Scheduling

60 Event-based Scheduler (EBS) Thread-based Scheduler Thread Scheduling Throughput Fairness

Events-based Scheduling

Events-based Scheduling Event Queue

Event-based Scheduler Events-based Scheduling Event Queue

Event-based Scheduler Events-based Scheduling Event Latency Event Energy Event Queue

61 Predicting Event Latency

61 Predicting Event Latency Memory Operation CPU Operation Tmemory Ndependent
f Event Latency Xie, et al., Compile-Time Dynamic Voltage Scaling Settings: Opportunities and Limits, PLDI’03

f Event Latency Xie, et al., Compile-Time Dynamic Voltage Scaling Settings: Opportunities and Limits, PLDI’03 Event Latency =

f Event Latency Xie, et al., Compile-Time Dynamic Voltage Scaling Settings: Opportunities and Limits, PLDI’03 Event Latency = Tmemory +

f Event Latency Xie, et al., Compile-Time Dynamic Voltage Scaling Settings: Opportunities and Limits, PLDI’03 Event Latency = Tmemory + Ndependent / f

61 Predicting Event Latency Event Latency = Tmemory + Ndependent
/ f Event Latency Frequency

62 Event-based Scheduler

62 Event-based Scheduler Events

62 Event-based Scheduler Model Constructor Event-Based Scheduler Events

62 Event-based Scheduler QoS Monitor Model Constructor Event-Based Scheduler Model
Events

62 Event-based Scheduler QoS Monitor Model Constructor Big/Little Hardware Event-Based
Scheduler Model <core, freq> Events

Scheduler Model <core, freq> Events ▸ Fine-tune the model when over or under-predict

Scheduler Model Recalibrate <core, freq> Events ▸ Fine-tune the model when over or under-predict ▸ Recalibrate if it mispredicts too often

Evaluation ▸Baseline Mechanisms ▹Highest performance (Perf) — Standard to guarantee
responsiveness ▹Minimal energy (Energy) — Minimize energy consumption ▹Interactive governor (Interactive) — Android default 63

responsiveness ▹Minimal energy (Energy) — Minimize energy consumption ▹Interactive governor (Interactive) — Android default 63 ▸Metrics ▹Energy Savings ▹QoS Violations

responsiveness ▹Minimal energy (Energy) — Minimize energy consumption ▹Interactive governor (Interactive) — Android default 63 ▸Metrics ▹Energy Savings ▹QoS Violations 37.9% - 41.2% energy savings, 0.1% more QoS violations

Energy-aware Web Runtime WebCore Web-speciﬁc Architecture GreenWeb QoS Language Extensions

65 GreenWeb: QoS Web Language Extensions

65 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS

65 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS Abstracting
Mobile QoS

Mobile QoS Expressing Abstractions

Mobile Expressing

Mobile Expressing Performance Degradation QoS Experience

Mobile Expressing Performance Degradation QoS Experience Imperceptible

Mobile Expressing Performance Degradation QoS Experience Imperceptible Tolerable

Mobile Expressing Performance Degradation QoS Experience Imperceptible Tolerable Unusable

Mobile Expressing Performance Degradation QoS Experience Imperceptible Tolerable Unusable Energy Savings

Imperceptible Unusable Tolerable 66 GreenWeb: QoS Web Language Extensions Understanding
Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

▸ QoS Type: performance metric Imperceptible Unusable Tolerable 66 GreenWeb:
QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

▸ QoS Type: performance metric ▸ QoS Target: threshold performance
values Imperceptible Unusable Tolerable 66 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

Mobile Expressing Abstractions ▸ QoS Type: performance metric ▸ QoS Target: threshold performance values element {event: Type, Target} When event is triggered on element, the QoS type and QoS target is Type and Target, respectively. Semantics: Syntax (CSS Compatible)

68 Future Work ▸ Automatic GreenWeb Annotation ▹ Empower the
developers, but not overburden them! ▸ GreenWeb Composability ▹ Can GreenWeb programs be safely integrated with other code? ▹ How to compose comprehensive QoS abstractions? ▸ Integrating WebRT with GreenWeb ▹ How can WebRT adapt to different QoS constraints?

Timeline 69 Key Tasks Program-level Composability Study (Goal: Improve the
composability and ﬂexibility of GreenWeb extensions.) Automatic Annotation System for GreenWeb (Goal: Explore the feasibility of automatic applying GreenWeb annotations.) Thesis Writing APR MAY JUNE JULY AUG FEB MAR WebRT Adaptivity Study (Goal: Evaluate the sensitivity of WebRT with respect to different QoS constraints.)

Retrospective: Three Principles Learnt 70 Runtime Application Architecture

Retrospective: Three Principles Learnt 70 Runtime Application Architecture ▸ General-purpose
vs. Specialization ▹ WebCore combines general-purpose customization with domain specialization

Retrospective: Three Principles Learnt 70 Runtime Application Architecture ▸ Exposing
Hardware Complexities ▹ WebRT Leverages Core Type and Core Frequency ▸ General-purpose vs. Specialization ▹ WebCore combines general-purpose customization with domain specialization

Retrospective: Three Principles Learnt 70 Runtime Application Architecture ▸ Empowering
the Developers ▹ GreenWeb Language Extensions Provide QoS Abstractions ▸ Exposing Hardware Complexities ▹ WebRT Leverages Core Type and Core Frequency ▸ General-purpose vs. Specialization ▹ WebCore combines general-purpose customization with domain specialization

[PLDI 2016] Yuhao Zhu, Vijay Janapa Reddi, “GreenWeb: Language Extensions
for Energy-Efficient Mobile Web Computing” [HPCA 2015] Yuhao Zhu, Matthew Halpern, Vijay Janapa Reddi, “Event- Based Scheduling for Energy-Efficient QoS (eQoS) in Mobile Web Applications” [HPCA 2013] Yuhao Zhu, Vijay Janapa Reddi, “High-Performance and Energy-Efficient Mobile Web Browsing on Big/Little Systems” [CAL 2012] Yuhao Zhu, Aditya Srikanth, Jingwen Leng, Vijay Janapa Reddi, “Exploiting Webpage Characteristics for Energy-Efficient Mobile Web Browsing” (Best of CAL) [ISCA 2014] Yuhao Zhu, Vijay Janapa Reddi, “WebCore: Architectural Support for Mobile Web Browsing” [IEEE MICRO 2015] Yuhao Zhu, Matthew Halpern, Vijay Janapa Reddi, “The Role of the CPU in Energy-Efficient Mobile Web Browsing” [HPCA 2016] Matthew Halpern, Yuhao Zhu, Vijay Janapa Reddi, “Mobile CPU’s Rise to Power: Quantifying the Impact of Generational Mobile CPU Design Trends on Performance, Energy, and User Satisfaction” [MICRO 2015] Yuhao Zhu, Daniel Richins, Matthew Halpern, Vijay Janapa Reddi, “Microarchitectural Implications of Event-driven Server- side Web Applications” (Top Picks Honorable Mention) GreenWeb WebRT WebCore Motivational Studies Server Microarch

[DAC 2011] Yuhao Zhu, Yangdong Deng, Yubei Chen, “Hermes: An
Integrated CPU/GPU Microarchitecture for IP Routing.” [DAC 2010] Bo Wang, Yuhao Zhu, Yangdong Deng, “Distributed Time, Conservative Parallel Logic Simulation on GPUs.” [TODAES 2011] Yuhao Zhu, Bo Wang, Yangdong Deng, “Massively Parallel Logic Simulation with GPUs.” [ISPASS 2015] Matthew Halpern, Yuhao Zhu, Ramesh Peri, and Vijay Janapa Reddi, “Mosaic: Cross-platform User-interaction Record and Replay for the Fragmented Android Ecosystem.” [IRPS 2014] Chen Zhou, Xiaofei Wang, Weichao Xu, Yuhao Zhu, Vijay Janapa Reddi, Chris Kim, “Estimation of Instantaneous Frequency Fluctuation in a Fast DVFS Environment Using an Empirical BTI Stress- Relaxation Model.” GPGPU & IP Routing Architecture Tools Reliability

Coursework 73 Name Instructor Semester SUP Grade COMPILERS Keshav Pingali
Fall 2010 A ADV EMBED MICROCONTROL SYS Mark McDermott Spring 2011 A- MEMORY MANAGEMENT Kathryn McKinley Spring 2011 Y A VLSI I Jacob Abraham Fall 2011 A- COMP ARCH: PARALLISM/LOCLTY Mattan Erez Fall 2011 A MICROARCHITECTURE Yale Patt Spring 2012 B DYNAMIC COMPILATION Vijay Janapa Reddi Spring 2012 A- COMP PERF EVAL/BENCHMARKING Lizy John Fall 2012 B+ PARALLEL COMP ARCHITECTURE Derek Chiou Spring 2013 B+ HUMAN COMPUT & CROWDSRCING Matt Lease Fall 2015 Y A-

Thank you!

Scheduling Results 75 Using a performance-oriented strategy as the baseline

Scheduling Results 75 Energy Savings (%) 0 25 50 75
100 QoS Violations (%) 0 10 20 30 40 OS (Big) OS (Little) WS Using a performance-oriented strategy as the baseline

100 QoS Violations (%) 0 10 20 30 40 OS (Big) OS (Little) WS Using a performance-oriented strategy as the baseline 83.0% energy savings over Perf, 4.1% more QoS violations

100 QoS Violations (%) 0 10 20 30 40 OS (Big) OS (Little) WS Using a performance-oriented strategy as the baseline 8.6% energy savings over OS, 0.1% more QoS violations 83.0% energy savings over Perf, 4.1% more QoS violations

Evaluation Methodology ▸ Baseline Mechanisms ▹ Highest performance (Perf) —
Standard to guarantee responsiveness ▹ Minimal energy (Energy) — Minimize energy consumption ▹ Interactive governor (Interactive) — Android default ▹ On-demand governor (Ondemand) 82

Standard to guarantee responsiveness ▹ Minimal energy (Energy) — Minimize energy consumption ▹ Interactive governor (Interactive) — Android default ▹ On-demand governor (Ondemand) 82 ▸ Scheduling Scenarios Performance QoS Experience Unusable Tolerable Imperceptible

Standard to guarantee responsiveness ▹ Minimal energy (Energy) — Minimize energy consumption ▹ Interactive governor (Interactive) — Android default ▹ On-demand governor (Ondemand) 82 ▸ Scheduling Scenarios ▹ Scheduling for imperceptibility Performance QoS Experience Unusable Tolerable Imperceptible

Standard to guarantee responsiveness ▹ Minimal energy (Energy) — Minimize energy consumption ▹ Interactive governor (Interactive) — Android default ▹ On-demand governor (Ondemand) 82 ▸ Scheduling Scenarios ▹ Scheduling for imperceptibility ▹ Scheduling for tolerability Performance QoS Experience Unusable Tolerable Imperceptible

Evaluation Results 83 QoS Violations (%) 0.0 1.5 3.0 4.5
6.0 emberjs gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Ondemand Energy

84 QoS Violations (%) 0.0 1.5 3.0 4.5 6.0 emberjs
gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy Evaluation Results No QoS Violations

gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy Evaluation Results No QoS Violations

gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy 9.4 17.8 58.1 6.9 Evaluation Results

gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy 9.4 17.8 58.1 6.9 Evaluation Results Energy (J) 0.0 1.0 2.0 3.0 4.0 emberjs gwt jquery backbone paperjs sina google ebay

89 Energy (J) 0.0 1.0 2.0 3.0 4.0 emberjs gwt
jquery backbone paperjs sina google ebay 8.2 7.7 Evaluation Results QoS Violations (%) 0.0 1.5 3.0 4.5 6.0 emberjs gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy 9.4 17.8 58.1 6.9

jquery backbone paperjs sina google ebay 8.2 7.7 Evaluation Results QoS Violations (%) 0.0 1.5 3.0 4.5 6.0 emberjs gwt jquery backbone paperjs sina google ebay EBS Perf Interactive Energy 9.4 17.8 58.1 6.9 37.9% - 41.2% energy savings, 0.1% more QoS violations

Imperceptible Unusable Tolerable 92 GreenWeb: QoS Web Language Extensions Understanding
Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

▸ QoS Type: performance metric Imperceptible Unusable Tolerable 92 GreenWeb:
QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

▸ QoS Type: performance metric ▹ Single (frame latency) vs.
Continuous (frame throughput) Imperceptible Unusable Tolerable 92 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

Continuous (frame throughput) ▸ QoS Target: threshold performance values Imperceptible Unusable Tolerable 92 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu) Imperceptible Unusable Tolerable 92 GreenWeb: QoS Web Language Extensions Understanding Mobile QoS Abstracting Mobile QoS Expressing Abstractions Performance Degradation QoS Experience

Mobile Expressing Abstractions ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions Selector button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions {QoS Declaration} Selector button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions {QoS Declaration} Selector Semantics: QoS is evaluated by a single frame latency when clicking the button button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions button:QoS {onclick: continuous} button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions button:QoS {onclick: continuous} button:QoS {onclick: single} Use default QoS targets ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Mobile Expressing Abstractions button:QoS {onclick: continuous} button:QoS {onclick: single} Use default QoS targets button:QoS {onclick: continuous, 20, 100} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Overwrite default targets 95 GreenWeb: QoS Web Language Extensions Understanding
Mobile QoS Abstracting Mobile Expressing Abstractions button:QoS {onclick: continuous} button:QoS {onclick: single} Use default QoS targets button:QoS {onclick: continuous, 20, 100} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)

Design Space Exploration (DSE) Setup ▸Webpages selected by Principal Component
Analysis (PCA) ▹ PCs calculated from webpage-inherent and µarch-dependent features (~400 in total) 96

Analysis (PCA) ▹ PCs calculated from webpage-inherent and µarch-dependent features (~400 in total) 96 10-4 10-3 10-2 10-1 100 101 PC2 (log) -5 0 5 PC1

Analysis (PCA) ▹ PCs calculated from webpage-inherent and µarch-dependent features (~400 in total) 96 10-4 10-3 10-2 10-1 100 101 PC2 (log) -5 0 5 PC1 dominated by # webpage elements

Analysis (PCA) ▹ PCs calculated from webpage-inherent and µarch-dependent features (~400 in total) 96 10-4 10-3 10-2 10-1 100 101 PC2 (log) -5 0 5 PC1 dominated by IPC

Analysis (PCA) ▹ PCs calculated from webpage-inherent and µarch-dependent features (~400 in total) 96 10-4 10-3 10-2 10-1 100 101 PC2 (log) -5 0 5 PC1

Design Considerations 97 How large should the scratchpad memory be?

... ... Rule j ... ... Prop l ... ... Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k

100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP

~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP

How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP

How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP ... ... Rule j ... ... Prop l ... ... Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k

100 80 60 40 20 0 Total CSS Properties (%)
96 64 32 0 PLP Design Considerations 97 How large should the scratchpad memory be? How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP

100 80 60 40 20 0 Total CSS Properties (%)
96 64 32 0 PLP Design Considerations 97 How large should the scratchpad memory be? How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP 100 80 60 40 20 0 Total CSS Properties (%) 96 64 32 0 PLP

How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP 100 80 60 40 20 0 Total CSS Properties (%) 96 64 32 0 PLP

How many compute lanes should an SRU have? ~1 KB 32 Lanes 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP 100 80 60 40 20 0 Total CSS Properties (%) 96 64 32 0 PLP

SRU Integration 98 IF ID EX MEM WB ALU MUL
FPU SRU Style_apply(DOMNodeId, matchedRules); Hardware Layer API Layer Runtime Layer Software Failsafe SRU Access ISA support

Evaluation Methodology 99 99

Evaluation Methodology ▸Fully synthesized using Synopsys 28 nm toolchain 99
99

Evaluation Methodology ▸Fully synthesized using Synopsys 28 nm toolchain ▸24
representative webpages 99 99

Evaluation Methodology ▸Fully synthesized using Synopsys 28 nm toolchain ▸24
representative webpages 99 www.amazon.com www.cnn.com www.msn.com www.google.com.hk www.twitter.com www.espn.go.com www.bbc.co.uk www.slashdot.org www.youtube.com www.ebay.com www.sina.com.cn www.163.com Desktop and mobile versions 99

Evaluation Results 100

2 2.2 2.4 Energy (J) Load Time (s) ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) A15-like design ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) 18.6% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 9.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization 29.2% 47.0% ▸Fully synthesized using Synopsys 28 nm toolchain

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization 29.2% 47.0% ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches I$

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches D$

2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches I+D$

01. 1 2 Smartphone Models Energy-Eﬃciency Plateaued 101 Motorola Droid
2009 Galaxy S Nexus Galaxy S3 Galaxy S4 Galaxy S5 2010 2011 2012 2013 2014 Galaxy S6 2015

Smartphone Models Energy-Eﬃciency Plateaued 102 2009 2010 2011 2012 2013
2014 2015 Coremark SPEC CPU 2006 01. 1 2

Proposal talk: Energy-Efficient Mobile Web Comp...

Proposal talk: Energy-Efficient Mobile Web Computing

More Decks by Yuhao Zhu

Other Decks in Technology

Featured

Transcript