Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Con fi gurations Possible 2285 Complex interactions between options (intra or inter components) give rise to a combinatorially large con fi guration space Compression … Encryption …
different configurations 7 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal
different configurations 8 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal Reaching desired performance goal is di ffi cult due to sheer size of the con fi guration space and high con fi guration measurement cost
trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware
trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware Incorrect understanding about the performance behavior often leads to miscon fi guration
unexpected interactions between con fi guration options in the deployment system stack. The system does not crash but remains operational with degraded performance e.g., high latency, low throughput, high energy consumption. Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Energy (Joules) Miscon fi guration
the root cause of the miscon fi guration and fi x it. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware The user expects 30-40% improvement
10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Here, the developer aims at fi nding the optimal con fi guration with or without experiencing any miscon fi guration.
typically non-intuitive (changes to seemingly underrated options) 16 June 3 June 4 June 4 June 5 Any suggestions on how to improve my performance? Thanks! Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh We have already tried this. We still have high latency. Any other suggestions? TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings
of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Interactions Options Options This is a representative work, but there are many other works related to using regression models (as well as other statistical models) for building performance models We have selection bias here ;)
Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Purely statistical models built on this data will be unreliable. This is counter-intuitive Performance Influence Models might be Unreliable
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models might be Unreliable
Performance Influence Models Expresses the relationships between Con fi guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put
is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Stages I. Specify Performance Query II. Learn Causal Performance Model III. Iterative Sampling IV. Update Causal Performance Model V. Estimate Causal Queries
the root causes of my performance fault and how can I improve performance by 70%? Query Engine Extracted Information Info: 70% gain expected Extracts meaningful information which is useful for subsequent stages for a performance task.
Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
be very complex x It may be intractable to reason over the entire graph directly A real world causal graph for a data analytics pipeline Why Select Top K Paths?
Always begins with a configuration option Or a system event Always terminates at a performance objective Cache Misses Bitrate Branch Misses FPS Bitrate Branch Misses FPS FPS Branch Misses Cache Misses
Bitrate when we artificially intervene by setting Bitrate to the value b Expected value of Branch Misses when we artificially intervene by setting Bitrate to the value a If this difference is large, then small changes to Bitrate will cause large changes to Branch Misses Average over all permitted values of Bitrate. ACE(BranchMisses, Bitrate) = 1 N ∑ E(BranchMisses|do(Bitrate = b)) − E(BranchMisses|do(Bitrate = a)) 47 Bitrate Branch Misses FPS • There may be too many causal paths. • We need to select the most useful ones. • Compute the Average Causal E ff ect (ACE) of each pair of neighbors in a path.
the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis Sum over all pairs of nodes in the causal path. PACE (Z, Y) = 1 2 (ACE(Z, X) + ACE(X, Y)) Bitrate Branch Misses FPS
need to evaluate counterfactual queries that can be formulated using the con fi guration options and performance objectives in a particular path to resolve a particular performance task.
about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low throughput; Conditioned on the following events: • We hypothetically set the new Bitrate to 10000 • Bitrate was initially set to 6000 • We observed low throughput when Bitrate was set to 6000 • Everything else remains the same Example "Given that my current Bitrate is 6000 and I have low throughput, what is the probability of having low throughput if my Bitrate is increased to 10000"?
possible changes Change with the largest ICE Set every configuration option in the path to all permitted values ICE (change) Inferred from observational data. This is very cheap 51 Bitrate Branch Misses FPS
Estimate the probability of satisfying QoS given Bu ff erSize=20000 • Use do-calculus to evaluate causal queries • Estimate budget and additional constraints
Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN fi nds con fi guration with higher gain when workload changes.
Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN can be e ff ectively reused in new environments for di ff erent performance tasks Takeaway
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models
performance models 69 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er
• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another.
• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems.
• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly which is prone to miscon fi gurations.
• The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly every time an environmental changes occur. Incorrect understanding about the performance behavior often leads to miscon fi guration
30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments