Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

Md Shahriar Iqbal UNICORN: Reasoning about Configurable System Performance through
the Lens of Causality Rahul Krishna MA Javidian Baishakhi Ray Pooyan Jamshidi

Correlation vs Causation 2

3 Outline Motivation Causal Inference UNICORN Results

Consider a data analytics pipeline 4 Video Decoder Stream Muxer
Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86

5 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary
Classifier # Configuration Options 55 86 14 44 86 Composed System Compression … Each component has a plethora of configuration options Encryption …

Each component has a plethora of configuration options 6 Video
Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Con fi gurations Possible 2285 Complex interactions between options (intra or inter components) give rise to a combinatorially large con fi guration space Compression … Encryption …

Energy (Joules) Performance varies significantly when systems are deployed with
different configurations 7 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal

Energy (Joules) Performance varies significantly when systems are deployed with
different configurations 8 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal Reaching desired performance goal is di ffi cult due to sheer size of the con fi guration space and high con fi guration measurement cost

Computer systems undergo several environmental changes 9 Source Environment Decoder
Muxer Detector Tracker Classifier Target Environment Decoder Muxer Detector Tracker Classifier

Real world example: Deployment environment change 10 When we are
trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware

Real world example: Deployment environment change 11 When we are
trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware Incorrect understanding about the performance behavior often leads to miscon fi guration

What is misconfiguration? 12 Miscon fi gurations happen due to
unexpected interactions between con fi guration options in the deployment system stack.

What is misconfiguration? 13 Miscon fi gurations happen due to
unexpected interactions between con fi guration options in the deployment system stack. The system does not crash but remains operational with degraded performance e.g., high latency, low throughput, high energy consumption. Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Energy (Joules) Miscon fi guration

Performance task: Debugging 14 Performance debugging aims at fi nding
the root cause of the miscon fi guration and fi x it. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware The user expects 30-40% improvement

Energy (Joules) Performance task: Optimization 15 Latency Energy Consumption 5
10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Here, the developer aims at fi nding the optimal con fi guration with or without experiencing any miscon fi guration.

Performance debugging tasks take significantly long time, the fixes are
typically non-intuitive (changes to seemingly underrated options) 16 June 3 June 4 June 4 June 5 Any suggestions on how to improve my performance? Thanks! Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh We have already tried this. We still have high latency. Any other suggestions? TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings

17 How to resolve these issues? Current approaches: Reasoning based
on correlation! Our key idea: Reasoning based on causation :)

18 Performance In fl uence Models number of counters number
of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Interactions Options Options This is a representative work, but there are many other works related to using regression models (as well as other statistical models) for building performance models We have selection bias here ;)

19 These methods rely on statistical correlations to extract meaningful
information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

20 • Incorrect Explanations and Unreliable Predictions • Non-transferable across
Environments Performance In fl uence Models su ff er from several shortcomings

Performance Influence Models might be Unreliable Cache Misses Throughput (FPS)
20 10 0 100k 200k 21 Increasing Cache Misses increases Throughput.

Cache Misses Throughput (FPS) 20 10 0 100k 200k 22
Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Purely statistical models built on this data will be unreliable. This is counter-intuitive Performance Influence Models might be Unreliable

Cache Misses Throughput (FPS) 20 10 0 100k 200k 23
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models might be Unreliable

24 DeepStream (Environment: TX2) DeepStream (Environment: Xavier) Throughput = 5.1
× Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable Each term in the regression equations is considered a predictor

25 Performance In fl uence Models change signi fi cantly
in new environments resulting in less accuracy. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)

26 Performance in fl uence cannot be reliably used across
environments. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)

28 Our Key Idea: Building Causal Performance Model instead of
Performance Influence Models Expresses the relationships between Con fi guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put

Why Causal Performance Model? To build reliable models that produce
correct explanations 29 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recover the correct interactions. Cache Policy Cache Misses Through put

Why Causal Performance Models? To reuse them when the system
environment changes 30 Causal models remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy

How to use Causal Performance Models? ? Cache Policy Cache
Misses Through put How to generate a causal performance model? 31

How to use Causal Performance Models? ? How to use
the causal performance model for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal performance model? 32

UNICORN: End-to-end Pipeline 34 5- Estimate Causal Queries • What
is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Stages I. Specify Performance Query II. Learn Causal Performance Model III. Iterative Sampling IV. Update Causal Performance Model V. Estimate Causal Queries

Stage-I: Specify Performance Query 35 Performance Queries Query: What are
the root causes of my performance fault and how can I improve performance by 70%?

Stage-I: Specify Performance Query 36 Performance Queries Query: What are
the root causes of my performance fault and how can I improve performance by 70%? Query Engine Extracted Information Info: 70% gain expected Extracts meaningful information which is useful for subsequent stages for a performance task.

Bitrate (bits/s) Enable Padding … Cache Misses … Through put
(fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 1- Recovering the Skeleton fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 37 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

(fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 38 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

(fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Stage-II: Learn Causal Performance Model Partial Ancestral Graph (PAG) 39 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

Stage-II: Learn Causal Performance Model 40 FPS Energy Branch Misses
Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or

43 FPS Energy Branch Misses Cache Misses No of Cycles
Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 3- Refining Causal Directions Latent search and entropy Stage-II: Learn Causal Performance Model Acyclic Directed Mixed Graph (ADMG) Partial Ancestral Graph (PAG) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

44 Stage-III: Iterative Sampling FPS Energy Branch Misses Cache Misses
No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Branch Misses Bitrate Causal Performance Model Selected Subsection of Causal Performance Model Recommended Con fi guration Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Individual Causal E ff ect (ICE) Estimation Interventional Measurement Select Top K Paths using Average Causal E ff ect (ACE)

45 x In real world case, the causal graphs can
be very complex x It may be intractable to reason over the entire graph directly A real world causal graph for a data analytics pipeline Why Select Top K Paths?

46 Extracting Causal Paths from the Causal Model Extract paths
Always begins with a configuration option Or a system event Always terminates at a performance objective Cache Misses Bitrate Branch Misses FPS Bitrate Branch Misses FPS FPS Branch Misses Cache Misses

Ranking Causal Paths from the Causal Model Expected value of
Bitrate when we artificially intervene by setting Bitrate to the value b Expected value of Branch Misses when we artificially intervene by setting Bitrate to the value a If this difference is large, then small changes to Bitrate will cause large changes to Branch Misses Average over all permitted values of Bitrate. ACE(BranchMisses, Bitrate) = 1 N ∑ E(BranchMisses|do(Bitrate = b)) − E(BranchMisses|do(Bitrate = a)) 47 Bitrate Branch Misses FPS • There may be too many causal paths. • We need to select the most useful ones. • Compute the Average Causal E ff ect (ACE) of each pair of neighbors in a path.

48 Ranking Causal Paths from the Causal Model • Average
the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis Sum over all pairs of nodes in the causal path. PACE (Z, Y) = 1 2 (ACE(Z, X) + ACE(X, Y)) Bitrate Branch Misses FPS

How to reason over a path? 49 To reason, we
need to evaluate counterfactual queries that can be formulated using the con fi guration options and performance objectives in a particular path to resolve a particular performance task.

Counterfactual Queries 50 • Counterfactual inference asks “what if” questions
about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low throughput; Conditioned on the following events: • We hypothetically set the new Bitrate to 10000 • Bitrate was initially set to 6000 • We observed low throughput when Bitrate was set to 6000 • Everything else remains the same Example "Given that my current Bitrate is 6000 and I have low throughput, what is the probability of having low throughput if my Bitrate is increased to 10000"?

Selecting configuration for next intervention Top K paths Enumerate all
possible changes Change with the largest ICE Set every configuration option in the path to all permitted values ICE (change) Inferred from observational data. This is very cheap 51 Bitrate Branch Misses FPS

Selecting configuration for next intervention Change with the largest ICE
Yes No • Proceed to next stage Measure Performance 52 Query Satis fi ed? • Terminate

Stage-IV: Update Causal Performance Model 53 FPS Energy Branch Misses
Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf. Measurement 3- Update Causal Performance Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel options Intervention 1 … Intervention n Belief Update Prior Belief 4- Replace Causal Performance Model

Stage-V: Estimate Causal Queries 54 P(Throughput > 40/s|do(BufferSize = 20000))
Estimate the probability of satisfying QoS given Bu ff erSize=20000 • Use do-calculus to evaluate causal queries • Estimate budget and additional constraints

Experimental Setup: Systems, Workload, Hardware 56 Nvidia TX1 CPU 4
cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2.0 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 Cores, 1.3 GHz Memory 16 Gb, 137 GB/s Xception Image Recognition (50,000 test images) DeepSpeech Voice Recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video)

Experimental Setup: Baselines 57 Optimization Debugging

Results: Efficiency Unicorn fi nds the root-causes accurately

Results: Efficiency Unicorn fi nds the root-causes accurately Unicorn achieves
higher gain

Results: Efficiency 60 Unicorn fi nds the root-causes accurately Unicorn
achieves higher gain Unicorn performs them much faster UNICORN achieves higher sample e ffi ciency than other baselines. Takeaway

Results: Transferability 61 10k 20k 50k 0 30 60 90
Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN fi nds con fi guration with higher gain when workload changes.

Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN can be e ff ectively reused in new environments for di ff erent performance tasks Takeaway

Results: Scalability 63 Discovery time, query evaluation time and total
time do not increase exponentially as the number of con fi guration options and systems events are increased

Results: Scalability 64 Causal graphs are sparse

Results: Scalability 65 UNICORN is scalable for larger multi-component systems
and systems with large con fi guration space. Takeaway

66 Decoder Muxer Detector Tracker Classi fi er Causal reasoning
enables more reliable performance analyses and more transferable performance models

67 Cache Misses Throughput (FPS) 20 10 0 100k 200k
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models

68 Cache Misses Throughput (FPS) 20 10 0 100k 200k
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models

Causal reasoning enables more reliable performance analyses and more transferable
performance models 69 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er

https://github.com/softsys4ai/UNICORN

Supplementary Slides 71

Maintaining performance in a highly configurable system is challenging 72
• The con fi guration space is combinatorially large with 1000's of con fi guration options.

• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another.

• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems.

• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly which is prone to miscon fi gurations.

• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly every time an environmental changes occur. Incorrect understanding about the performance behavior often leads to miscon fi guration

Building configuration space of a highly configurable system 77 C
= O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets

Modern computer system is composed of multiple components. 78 Video
Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Component 1 Component 2 Component 3 Composed System ...

Building configuration space of a highly configurable system. 79 C
= O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True

Building configuration space of a highly configurable system. 80 C
= O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True f1 (c1 ) = 32/seconds f2 (c1 ) = 63.6 Joules Throughput Energy

Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) Accuracy Precision Recall Gain 30 60 90 % Unicorn (Reuse) Unicorn + 25 Unicorn (Rerun) Bugdoc (Reuse) Bugdoc + 25 Bugdoc (Rerun) Time 0 2 4 Hours. x UNICORN quickly fi xes the bug and achieves higher gain, accuracy, precision and recall when hardware changes

Why Causal Inference? - Accurate across Environments 82 Performance Inﬂuence
Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number

83 Performance Inﬂuence Model 0 5 10 15 20 25
30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments

84 Performance Inﬂuence Model 0 5 10 15 20 25
30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability

85 Future work • Determining more accurate causal graphs by
incorporating domain knowledge Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Domain Expert Causal Performance Model from Observational Data Causal Performance Model Corrected by Expert Background knowledge

86 Future work • Developing new domain-speci fi c languages
for performance query speci fi cation Unstructured Performance Queries Semantic Analysis Query Engine Useful Information End user

Unicorn: Reasoning about Configurable System Pe...

Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

More Decks by Pooyan Jamshidi

Other Decks in Science

Featured

Transcript