HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture ▸ Voltage/frequency scaling on general-purpose processors
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Voltage/frequency scaling on general-purpose processors ▸ End of Dennard Scaling! ▸ Diminishing return
and Libraries HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture WebCore Web-specific Architecture
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Inputs Architecture ▸ Lost page-level diversity ▸ Lost user QoS requirements WebCore Web-specific Architecture
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture ▸ Lost page-level diversity ▸ Lost user QoS requirements WebCore Web-specific Architecture
HTML JavaScript CSS Language Runtime Styling Security Local Storage User Input Layout Render Application Architecture WebCore Web-specific Architecture GreenWeb QoS Language Extensions Runtime
29 languages ▹ c.f., H264 codec: 0.13M LoC, 6 languages ‣Code base is very irregular ▹ No fine-grained parallelism General-Purpose Designs WebCore: a Web-Specific Mobile Architecture
Web ▸What is a proper general purpose baseline architecture? ▹Out-of-order (Silvermont, A15) or in-order (Saltwell, A7)? ▹Are existing general purpose mobile designs ideal? 22 22
Web ▸What is a proper general purpose baseline architecture? ▹Out-of-order (Silvermont, A15) or in-order (Saltwell, A7)? ▹Are existing general purpose mobile designs ideal? ▸Exhaustive design space exploration. 22 22
Web ▸What is a proper general purpose baseline architecture? ▹Out-of-order (Silvermont, A15) or in-order (Saltwell, A7)? ▹Are existing general purpose mobile designs ideal? ▸Exhaustive design space exploration. 22 22
specialization target 29 10% 13% 17% 25% 35% Render Style Other Layout DOM 12% 14% 16% 18% 40% Render Style Other Layout DOM Execution time breakdown Energy breakdown
specialization target 29 10% 13% 17% 25% 35% Render Style Other Layout DOM 12% 14% 16% 18% 40% Render Style Other Layout DOM Execution time breakdown Energy breakdown
specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}}
specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}}
specialization target 29 for (each rule in matchedRules) { for (each property in rule) { switch (property.id) { case Font: Style[Font] = Handler(property.value, DOMNode); break; case N: ...}}} Rule-level Parallelism (RLP) Property-level Parallelism (PLP) ▸ Exploiting the parallelism to increase the arithmetic intensity
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px ▸Order Matters in RLP ▸Order Does Not Matter in PLP margin 0 High priority Style Resolution Kernel
id value Final Style Info ▸ A running example from www.cnn.com 30 Rule Property 1 Property 2 id value id value 1 padding 0 margin 0 2 padding 6 px width 36 px Style Rules padding 0 width 6 px 36 px ▸Order Matters in RLP ▸Order Does Not Matter in PLP margin 0 High priority Style Resolution Kernel
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP 31
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP 31 Input Scratchpad
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conflict Resolution
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority Prop m Prop m 31 Input Scratchpad Conflict Resolution
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority Prop m 31 Input Scratchpad Conflict Resolution
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conflict Resolution Compute Lanes
Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k Style Resolution Unit 31 ▸Order Matters in RLP ▸Order Does Not Matter in PLP Higher Priority 31 Input Scratchpad Conflict Resolution Output Scratchpad Compute Lanes
2 2.2 2.4 Energy (J) Load Time (s) ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) A15-like design ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) 18.6% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) 18.6% 22.2% 9.2% 22.2% A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization 29.2% 47.0% ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▹ SoC die area is 122 mm2 in Samsung Galaxy S4
energy optimizations ▹ Different microarchitectures (in-order + out-of-order) ▹ Different frequency settings ▸ Idea: Provide just-enough energy to meet performance target 36 WebRT: Energy-aware Web Runtime
energy optimizations ▹ Different microarchitectures (in-order + out-of-order) ▹ Different frequency settings ▸ Idea: Provide just-enough energy to meet performance target ▸ Approach: Systematically understand user interactions and bridge the gap between user behavior and system execution. 36 WebRT: Energy-aware Web Runtime
that lead to different loading times and energy consumptions 40 ▸ Mechanism: Predict the ideal ACMP configuration (<core, frequency>) and schedule application loading accordingly
that lead to different loading times and energy consumptions 40 ▸ Mechanism: Predict the ideal ACMP configuration (<core, frequency>) and schedule application loading accordingly ▸ Effect: Properly provision the hardware resources based on application characteristics
with 2 issue DVFS: 350 MHz ~ 600 MHz at a 50 MHz granularity ODroid XU+E development board, which contains an Exynos 5410 SoC used in Samsung Galaxy S4. Big core cluster: ARM Cortex A15, OoO with 3 issue DVFS: 800 MHz ~ 1.8 GHz at a 100 MHz granularity
with 2 issue DVFS: 350 MHz ~ 600 MHz at a 50 MHz granularity ODroid XU+E development board, which contains an Exynos 5410 SoC used in Samsung Galaxy S4. Big core cluster: ARM Cortex A15, OoO with 3 issue DVFS: 800 MHz ~ 1.8 GHz at a 100 MHz granularity Overhead: ▸ Frequency switch: 100 us ▸ Core migration: 20 us
hottest 2,500 webpages Model Construction & Refinement Refine the linear model Predictors (HTML, CSS) Responses (Time, Energy) Mitigate Over-fitting Model Non-Linearity Linear Regression
hottest 2,500 webpages Model Construction & Refinement Refine the linear model Model Validation Validating on another 2,500 webpages Predictors (HTML, CSS) Responses (Time, Energy) Mitigate Over-fitting Model Non-Linearity Linear Regression Loading Time Model Energy Model
core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations
core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations 83.0% energy savings over Perf, 4.1% more QoS violations
core ▹Standard to guarantee responsiveness ▸ OS DVFS strategies (OS) ▹Ondemand governor (across big and little cores) ▸ Metrics: ▹ Energy Savings ▹ QoS Violations 83.0% energy savings over Perf, 4.1% more QoS violations 8.6% energy savings over OS, 0.1% more QoS violations
optimizations 58 ▸ Mechanism: Event-based scheduling to predict the ACMP configuration that exploits event slacks and saves energy Optimizing Post-loading Interactions
optimizations 58 ▸ Mechanism: Event-based scheduling to predict the ACMP configuration that exploits event slacks and saves energy ▸ Effect: Properly provision the hardware resources based on event characteristics Optimizing Post-loading Interactions
Events Large Slack change Small Slack No Slack click keyup ▸ Wide distribution of event latencies. Events exhibit different slacks. ▹ How to exploit event slacks?
Mobile Expressing Abstractions ▸ QoS Type: performance metric ▸ QoS Target: threshold performance values element {event: Type, Target} When event is triggered on element, the QoS type and QoS target is Type and Target, respectively. Semantics: Syntax (CSS Compatible)
developers, but not overburden them! ▸ GreenWeb Composability ▹ Can GreenWeb programs be safely integrated with other code? ▹ How to compose comprehensive QoS abstractions? ▸ Integrating WebRT with GreenWeb ▹ How can WebRT adapt to different QoS constraints?
composability and flexibility of GreenWeb extensions.) Automatic Annotation System for GreenWeb (Goal: Explore the feasibility of automatic applying GreenWeb annotations.) Thesis Writing APR MAY JUNE JULY AUG FEB MAR WebRT Adaptivity Study (Goal: Evaluate the sensitivity of WebRT with respect to different QoS constraints.)
Hardware Complexities ▹ WebRT Leverages Core Type and Core Frequency ▸ General-purpose vs. Specialization ▹ WebCore combines general-purpose customization with domain specialization
the Developers ▹ GreenWeb Language Extensions Provide QoS Abstractions ▸ Exposing Hardware Complexities ▹ WebRT Leverages Core Type and Core Frequency ▸ General-purpose vs. Specialization ▹ WebCore combines general-purpose customization with domain specialization
for Energy-Efficient Mobile Web Computing” [HPCA 2015] Yuhao Zhu, Matthew Halpern, Vijay Janapa Reddi, “Event- Based Scheduling for Energy-Efficient QoS (eQoS) in Mobile Web Applications” [HPCA 2013] Yuhao Zhu, Vijay Janapa Reddi, “High-Performance and Energy-Efficient Mobile Web Browsing on Big/Little Systems” [CAL 2012] Yuhao Zhu, Aditya Srikanth, Jingwen Leng, Vijay Janapa Reddi, “Exploiting Webpage Characteristics for Energy-Efficient Mobile Web Browsing” (Best of CAL) [ISCA 2014] Yuhao Zhu, Vijay Janapa Reddi, “WebCore: Architectural Support for Mobile Web Browsing” [IEEE MICRO 2015] Yuhao Zhu, Matthew Halpern, Vijay Janapa Reddi, “The Role of the CPU in Energy-Efficient Mobile Web Browsing” [HPCA 2016] Matthew Halpern, Yuhao Zhu, Vijay Janapa Reddi, “Mobile CPU’s Rise to Power: Quantifying the Impact of Generational Mobile CPU Design Trends on Performance, Energy, and User Satisfaction” [MICRO 2015] Yuhao Zhu, Daniel Richins, Matthew Halpern, Vijay Janapa Reddi, “Microarchitectural Implications of Event-driven Server- side Web Applications” (Top Picks Honorable Mention) GreenWeb WebRT WebCore Motivational Studies Server Microarch
Integrated CPU/GPU Microarchitecture for IP Routing.” [DAC 2010] Bo Wang, Yuhao Zhu, Yangdong Deng, “Distributed Time, Conservative Parallel Logic Simulation on GPUs.” [TODAES 2011] Yuhao Zhu, Bo Wang, Yangdong Deng, “Massively Parallel Logic Simulation with GPUs.” [ISPASS 2015] Matthew Halpern, Yuhao Zhu, Ramesh Peri, and Vijay Janapa Reddi, “Mosaic: Cross-platform User-interaction Record and Replay for the Fragmented Android Ecosystem.” [IRPS 2014] Chen Zhou, Xiaofei Wang, Weichao Xu, Yuhao Zhu, Vijay Janapa Reddi, Chris Kim, “Estimation of Instantaneous Frequency Fluctuation in a Fast DVFS Environment Using an Empirical BTI Stress- Relaxation Model.” GPGPU & IP Routing Architecture Tools Reliability
Fall 2010 A ADV EMBED MICROCONTROL SYS Mark McDermott Spring 2011 A- MEMORY MANAGEMENT Kathryn McKinley Spring 2011 Y A VLSI I Jacob Abraham Fall 2011 A- COMP ARCH: PARALLISM/LOCLTY Mattan Erez Fall 2011 A MICROARCHITECTURE Yale Patt Spring 2012 B DYNAMIC COMPILATION Vijay Janapa Reddi Spring 2012 A- COMP PERF EVAL/BENCHMARKING Lizy John Fall 2012 B+ PARALLEL COMP ARCHITECTURE Derek Chiou Spring 2013 B+ HUMAN COMPUT & CROWDSRCING Matt Lease Fall 2015 Y A-
100 QoS Violations (%) 0 10 20 30 40 OS (Big) OS (Little) WS Using a performance-oriented strategy as the baseline 83.0% energy savings over Perf, 4.1% more QoS violations
100 QoS Violations (%) 0 10 20 30 40 OS (Big) OS (Little) WS Using a performance-oriented strategy as the baseline 8.6% energy savings over OS, 0.1% more QoS violations 83.0% energy savings over Perf, 4.1% more QoS violations
Mobile Expressing Abstractions {QoS Declaration} Selector Semantics: QoS is evaluated by a single frame latency when clicking the button button:QoS {onclick: single} ▸ QoS Type: performance metric ▹ Single (frame latency) vs. Continuous (frame throughput) ▸ QoS Target: threshold performance values ▹ Imperceptible target (Ti) vs. Usable target (Tu)
... ... Rule j ... ... Prop l ... ... Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k
How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP ... ... Rule j ... ... Prop l ... ... Rule i.id ... Prop m ... Prop k ... Rule j.id ... ... ... ... ... start end start end Rule i Prop k Prop m Prop m Prop l Style l Style m Style k
96 64 32 0 PLP Design Considerations 97 How large should the scratchpad memory be? How many compute lanes should an SRU have? ~1 KB 100 80 60 40 20 0 Total Coverage (%) 16 12 8 4 0 RLP
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization 29.2% 47.0% ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches I$
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches D$
2 2.2 2.4 Energy (J) Load Time (s) A15-like design Customization Specialization ▸Fully synthesized using Synopsys 28 nm toolchain ▸Cost of specialization: 0.59 mm2 area overhead ▸Better than scaling-up approaches I+D$