Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding and Explaining the Root Causes of...

Pooyan Jamshidi
September 07, 2022
130

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

An invited talk at NASA JPL on August 19th, 2022.

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

Speaker: Pooyan Jamshidi

In this talk, I will present our recent progress in employing Causal AI (Causal Structure Learning and Inference, Counterfactual Reasoning, and Transfer Learning) in addressing several significant challenges in computer systems. After motivating the work, I will show how mainstream machine learning, which relies on spurious correlations, may become unreliable in certain situations. Next, I will present empirical observations that explain the underlying root causes of performance faults in several highly-configurable systems, including autonomous systems, robotics, on-device machine learning systems, and data analytics pipelines. I will then present our framework, Unicorn, and discuss how Unicorn fills the gap by employing a **causal reasoning approach**. In particular, I will discuss how Unicorn captures intricate interactions between configuration options across the software-hardware stack and how such interactions can impact performance variations. Finally, I will talk about our 2-years journey in a NASA-funded project called RASPBERRY-SI, developing a causal reasoning approach to enable synthesizing adaptation plans for reconfiguring autonomous systems to adapt to environmental uncertainties during operation.

For more information regarding the technical work and the people behind the work that I present, please refer to the following websites:
- The Unicorn framework: https://github.com/softsys4ai/unicorn
- The NASA-funded RASPBERRY-SI project: https://nasa-raspberry-si.github.io/raspberry-si/

Bio: Pooyan Jamshidi is an Assistant Professor of Computer Science and Engineering at UofSC. Pooyan Jamshidi's research involves designing novel artificial intelligence (AI) and machine learning (ML) algorithms and investigating their theoretical guarantees. He is also interested in applying AI/ML algorithms in high-impact applications, including robotics, computer systems, and space explorations. Pooyan has extensive collaborations with Google and NASA and is always open to explore new collaborations. Pooyan has been a recipient of the UofSC breakthrough award in 2022. Before his current position, he was a postdoctoral associate at CMU (Pittsburgh, US) and Imperial College (London, UK). He received a Ph.D. (Computer Science) from DCU (Dublin, Ireland) in 2014 and received an M.S. (Systems Engineering) and a B.S. (Math & Computer Science) from AUT (Iran) in 2003 and 2006, respectively. For more info about Pooyan's research and his group at UofSC, please refer to: http://pooyanjamshidi.github.io/.

Pooyan Jamshidi

September 07, 2022
Tweet

More Decks by Pooyan Jamshidi

Transcript

  1. Understanding and Explaining the Root Causes of Performance Faults with

    Causal AI A Path towards Building Dependable Computer Systems Pooyan Jamshidi
  2. Outline 6 UNICORN Results Causal AI For Systems Motivation Autonomy

    Evaluation at JPL Causal AI for Autonomy and Robotics
  3. 9

  4. Empirical observations con fi rm that systems are becoming increasingly

    con fi gurable 10 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
  5. Empirical observations con fi rm that systems are becoming increasingly

    con fi gurable 11 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
  6. Con fi gurations determine the performance behavior 12 void Parrot_setenv(.

    . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy
  7. Outline 13 Motivation Causal AI For Systems Results Case Study

    Causal AI for Autonomy and Robotics Autonomy Evaluation at JPL
  8. SocialSensor 15 Content Analysis Orchestrator Crawling Search and Integration Tweets:

    [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet
  9. Challenges 16 Content Analysis Orchestrator Crawling Search and Integration Tweets:

    [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time
  10. 0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000

    4000 5000 Average write latency ( s) The default con fi guration is typically bad and the optimal con fi guration is noticeably better than median 21 Default Con fi guration Optimal Con fi guration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median
  11. CoBot experiment: DARPA BRASS 0 2 4 6 8 Localization

    error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_re fi nement=y
  12. CoBot experiment 5 10 15 20 25 5 10 15

    20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]
  13. Transfer Learning for Improving Model Predictions in Highly Configurable Software

    Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]
  14. Looking further: When transfer learning goes wrong 10 20 30

    40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning
  15. 5 10 15 20 25 number of particles 5 10

    15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!
  16. Key question: Can we develop a theory to explain when

    transfer learning works? Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?
  17. Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory

    Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configura- tion options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]
  18. Causal AI in Systems and Software 33 Computer Architecture Database

    Operating Systems Programming Languages BigData Software Engineering https://github.com/y-ding/causal-system-papers
  19. Misconfiguration and its Effects • Misconfigurations can elicit unexpected interactions

    between software and hardware • These can result in non-functional faults ◦ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 34 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several
  20. 35 CUDA performance issue on tx2 When we are trying

    to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware
  21. Motivating Example 36 June 3rd We have already tried this.

    We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days ! Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?
  22. 38 Performance In fl uence Models number of counters number

    of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options
  23. 39 These methods rely on statistical correlations to extract meaningful

    information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options
  24. 40 • Performance in fl uence models could produce incorrect

    explanations • Performance in fl uence models could produce unreliable predictions. • Performance in fl uence models could produce unstable predictions across environments and in the presence of measurement noise. Performance In fl uence Models su ff er from several shortcomings
  25. Performance Influence Models Issue: Incorrect Explanation Cache Misses Throughput (FPS)

    20 10 0 100k 200k 41 Increasing Cache Misses increases Throughput.
  26. Cache Misses Throughput (FPS) 20 10 0 100k 200k 42

    Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Any ML/statistical models built on this data will be incorrect. This is counter-intuitive Performance Influence Models Issue: Incorrect Explanation
  27. Cache Misses Throughput (FPS) 20 10 0 100k 200k 43

    Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group Increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models Issue: Incorrect Explanation
  28. 44 Performance In fl uence Models change signi fi cantly

    in new environments resulting in less accuracy. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors
  29. 45 Performance in fl uence are cannot be reliably used

    across environments. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors
  30. 46 Performance in fl uence models do not generalize well

    across deployment environments. Performance in fl uence model in TX2 Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Non-generalizability
  31. 47 Causal Performance Model Expresses the relationships between Con fi

    guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put
  32. Why Causal Inference? - Produces Correct Explanations 48 Cache Misses

    Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recovers the correct interactions. Cache Policy Cache Misses Through put
  33. Why Causal Inference? - Minimal Structure Change 49 Causal models

    remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy
  34. Why Causal Inference? - Accurate across Environments 50 Performance Influence

    Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number
  35. 51 Performance Influence Model 0 5 10 15 20 25

    30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments
  36. 52 Performance Influence Model 0 5 10 15 20 25

    30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability
  37. 53 How to use Causal Performance Models? ? Cache Policy

    Cache Misses Through put How to generate a causal graph?
  38. 54 How to use Causal Performance Models? ? How to

    use the causal graph for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal graph?
  39. • Build a Causal Performance Model that capture the interactions

    options in the variability space using the observation performance data. • Iterative causal performance model evaluation and model update • Perform downstream performance tasks such as performance debugging & optimization using Causal Reasoning UNICORN: Our Causal AI for Systems Method
  40. UNICORN: Our Causal AI for Systems Method Software: DeepStream Middleware:

    TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s
  41. Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default

    number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
  42. Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default

    number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
  43. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model
  44. Performance measurement 61 ℂ = O1 × O2 × ⋯

    × O19 × O20 Dead code removal Con fi guration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Con fi gure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quanti fi able aspect
  45. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model
  46. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model
  47. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model
  48. Throughput Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding f f f f f Causal Interaction Causal Paths Software Options Perf. Events Performance Objective f Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses Decoder Muxer Causal Performance Model
  49. Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default

    number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
  50. 68 Causal Debugging • What is the root-cause of my

    fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)
  51. Extracting Causal Paths from the Causal Model Problem ✕ In

    real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 69 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths
  52. Extracting Causal Paths from the Causal Model 70 GPU Mem.

    Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.
  53. Ranking Causal Paths from the Causal Model 71 • They

    may be too many causal paths • We need to select the most useful ones • Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency 𝐴𝐶 𝐸 (GPU Mem . , Swap) = 1 𝑁 ∑ 𝑎 , 𝑏 ∈ 𝑍 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑏 )) − 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑎 )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.
  54. Ranking Causal Paths from the Causal Model 72 • Average

    the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis 𝑃𝐴𝐶𝐸 ( 𝑍 , 𝑌 ) = 1 2 ( 𝐴 𝐶 𝐸 ( 𝑍 , 𝑋 ) + 𝐴𝐶 𝐸 ( 𝑋 , 𝑌 )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.
  55. Best Query Counterfactual Queries Rank Paths What if questions. E.g.,

    What if the configuration option X was set to a value ‘x’? Extract Causal Paths 73 Diagnosing and Fixing the Faults • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)
  56. Diagnosing and Fixing the Faults 74 • Counterfactual inference asks

    “what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events: • We hypothetically set the new Swap memory to 4 Gb • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 Gb • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?
  57. Low? Load GPU Mem. Latency Swap = 4 Gb Diagnosing

    and Fixing the Faults 75 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question
  58. Diagnosing and Fixing the Faults 76 GPU Mem. Latency Swap

    Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load 𝑃 𝑜 𝑡 𝑒 𝑛 𝑡𝑖 𝑎 𝑙 = 𝑃 ( ^ 𝐿𝑎 𝑡 𝑒 𝑛𝑐 𝑦 = 𝑙 𝑜𝑤 . . ^ 𝑆𝑤 𝑎𝑝 = 4 𝐺 𝑏 , . 𝑆 𝑤 𝑎𝑝 = 2 𝐺 𝑏 , 𝐿𝑎 𝑡 𝑒 𝑛𝑐𝑦 𝑠 𝑤 𝑎 𝑝 =2 𝐺 𝑏 = h 𝑖𝑔 h, 𝑈 ) We expect a low latency The latency was high The Swap is now 4 Gb The Swap was initially 2 Gb Everything else stays the same
  59. Diagnosing and Fixing the Faults 77 Potential = 𝑃 (

    ^ 𝑜𝑢𝑡𝑐𝑜𝑚 𝑒 = 𝑔𝑜 𝑜𝑑 ~ ~ 𝑐 h 𝑎 𝑛 𝑔 𝑒 , ~ 𝑜 𝑢 𝑡𝑐𝑜 𝑚 𝑒 ¬ 𝑐 h 𝑎 𝑛 𝑔 𝑒 = 𝑏𝑎𝑑 , ~¬ 𝑐 h 𝑎 𝑛 𝑔𝑒 , 𝑈 ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = 𝑃 ( ^ 𝑜𝑢 𝑡 𝑐 𝑜 𝑚 𝑒 = 𝑏𝑎𝑑 ~ ~¬ 𝑐 h 𝑎 𝑛𝑔 𝑒 , 𝑈 ) Probability that the outcome was bad before the change
  60. Diagnosing and Fixing the Faults 78 GPU Mem. Latency Swap

    Mem. Top K paths ⋮ Enumerate all possible changes 𝐼 𝑇 𝐸 ( 𝑐 h 𝑎𝑛𝑔 𝑒 ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !
  61. Diagnosing and Fixing the Faults 79 Change with the largest

    ITE Fault fixed? Yes No • Add to observational data • Update causal model • Repeat… Measure Performance
  62. Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default

    number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
  63. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model
  64. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model
  65. FPS Energy Branch Misses Cache Misses No of Cycles Bitrate

    Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model
  66. There are two fundamental benefits that we get by our

    “Causal AI for Systems” methodology 1. We learn one central (causal) performance model from the data across di ff erent performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for di ff erent environments (e.g., canary-> production) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting. 85
  67. 86 The new version of CADET, called UNICORN, accepted at

    EuroSys 2022. https://github.com/softsys4ai/UNICORN
  68. Outline 87 Motivation Causal AI For Systems Causal AI for

    Autonomy and Robotics UNICORN Results Autonomy Evaluation at JPL
  69. Results: Case Study 88 When we are trying to transplant

    our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware
  70. More powerful Results: Case Study 89 Nvidia TX1 CPU 4

    cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 Gb/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 Gb/s Embedded real-time stereo estimation Source code 17 Fps 4 Fps 4 Slower! ×
  71. Results: Case Study 90 Configuration UNICO RN Decision Tree Forum

    CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_RT ✓ ✓ ✓ Swap Memory ✓ UNICORN Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accurately X No unnecessary changes X Better improvements than forum’s recommendation X Much faster Results The user expected 30-40% gain
  72. Evaluation: Experimental Setup Nvidia TX1 CPU 4 cores, 1.3 GHz

    GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 cores, 1.3 GHz Memory 32 Gb, 137 GB/s Hardware Systems Software Systems Xception Image recognition (50,000 test images) DeepSpeech Voice recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video) Configuration Space X 30 Configurations X 17 System Events • 10 software • 10 OS/Kernel • 10 hardware 91
  73. Evaluation: Data Collection • For each software/hardware combination create a

    benchmark dataset ◦ Exhaustively set each of configuration option to all permitted values. ◦ For continuous options (e.g., GPU memory Mem.), sample 10 equally spaced values between [min, max] • Measure the latency, energy consumption, and heat dissipation ◦ Repeat 5x and average 92 Multiple Faults ! Latency Faults ! Energy Faults !
  74. Evaluation: Ground Truth • For each performance fault: ◦ Manually

    investigate the root-cause ◦ “Fix” the misconfigurations • A “fix” implies the configuration no longer has tail performance ◦ User defined benchmark (i.e., 10th percentile) ◦ Or some QoS/SLA benchmark • Record the configurations that were changed 93 Multiple Faults ! Latency Faults ! Energy Faults !
  75. RQ2: How does UNICORN perform compared to Search-Based Optimization 94

    RQ1: How does UNICORN perform compared to Model based Diagnostics Results: Research Questions
  76. 95 Results: Research Question 1 (single objective) RQ1: How does

    UNICORN perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate than ML-based methods Better Gain Up to 20x faster
  77. 96 Results: Research Question 1 (multi-objective) RQ1: How does UNICORN

    perform compared to Model based Diagnostics X No deterioration of other performance objectives Takeaways Multiple Faults in Latency & Energy usage
  78. RQ1: How does UNICORN perform compared to Model based Diagnostics

    97 RQ2: How does UNICORN perform compared to Search-Based Optimization Results: Research Questions
  79. Results: Research Question 2 RQ2: How does UNICORN perform compared

    to Search-Based Optimization X Better with no deterioration of other performance objectives Takeaways 98
  80. 99 Results: Research Question 3 RQ2: How does UNICORN perform

    compared to Search-Based Optimization X Considerably faster than search-based optimization Takeaways
  81. Summary: Causal AI for Systems 1. Learning a Functional Causal

    Model for di ff erent downstream systems tasks 2. The learned causal model is transferable across di ff erent environments 100 Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s
  82. Arti fi cial Intelligence and Systems Laboratory (AISys Lab) Machine

    Learning Computer Systems Autonomy AI/ML Systems https://pooyanjamshidi.github.io/AISys/ 101 Ying Meng (PhD student) Shuge Lei (PhD student) Kimia Noorbakhsh (Undergrad) Shahriar Iqbal (PhD student) Jianhai Su (PhD student) M.A. Javidian (postdoc) Sponsors, thanks! Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Hamed Damirchi (PhD student) Mahdi Shari fi (PhD student) Lane Stanley (Intern) Sonam Kharde Postdoc
  83. Outline 104 Causal AI For Systems UNICORN Results Motivation Causal

    AI for Autonomy and Robotics Autonomy Evaluation at JPL
  84. RASPBERRY SI David Garlan CMU Co-I Bradley Schmerl CMU Co-I

    Pooyan Jamshidi UofSC PI Javier Camara York (UK) Collaborator Ellen Czaplinski NASA JPL Consultant Katherine Dzurilla UArk Consultant Jianhai Su UofSC Graduate Student Matt DeMinico NASA Co-I Resource Adaptive Software Purpose-Built for Extraordinary Robotic Research Yields - Science Instruments Abir Hossen UofSC Graduate Student Sonam Kharde UofSC Postdoc Autonomous Robotics Research for Ocean Worlds (ARROW)
  85. K. MICHAEL DALAL Team Lead USSAMA NAAL Software Engineer LANSSIE

    MA Software Engineer Autonomy • Quantitative Planning • Transfer & Online Learning • Causal AI JIANHAI SU UofSC, Graduate Student BRADLEY SCHMERL CMU, Co-I DAVID GARLAN CMU, Co-I JAVIER CAMARA York, Collaborator MATT DeMINICO NASA, Co-I HARI D NAYAR Team Lead ANNA E BOETTCHER Robotics System Engineer ASHISH GOEL Research Technologist ANJAN CHAKRABARTY Software Engineer CHETAN KULKARNI Prognostics Researcher THOMAS STUCKY Software Engineer TERENCE WELSH Software Engineer CHRISTOPHER LIM Robotics Software Engineer JACEK SAWONIEWICZ Robotics System Engineer ABIR HOSSEN UofSC, Graduate Student ELLEN CZAPLINSKI Arkansas, Consultant KATHERINE DZURILLA Arkansas, Consultant POOYAN JAMSHIDI UofSC, PI RASPBERRY SI Physical Testbed Virtual Testbed AISR: Autonomous Robotics Research for Ocean Worlds (ARROW) CAROLYN R. MERCER Program Manager Develop Develop and maintain Evaluate Evaluate Develop and maintain Sonam Kharde UofSC, Postdoc
  86. 107

  87. Autonomy Module: Evaluation 109 Design • MAPE-K loop based design

    • Machine learning driven quantitative planning and adaptation Evaluation • Two testbeds: different fidelities & simulation flexibilities Monitor Analyze Plan Execute Knowledge System Under Test (NASA Lander) Autonomy Physical Testbed OWLAT (NASA/JPL) Virtual Testbed OceanWATERS (NASA/ARC)
  88. Learning in Simulation for Transfer Learning to Physical Testbed Sim2Real

    Transfer 110 Physical Testbed Simulation Environment OWLAT OWLAT-sim Causal Invariances
  89. Causal AI for Autonomous Robot Testing • Testing cyber-physical systems

    such as robots are complicated. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provides the following opportunities: 1. Identifying di ff i cult-to-catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 111
  90. Outline 112 Causal AI For Systems UNICORN Results Motivation Causal

    AI for Autonomy and Robotics Autonomy Evaluation at JPL
  91. Lessons Learned • Open Science, Open Source, Open Data, and

    Open Collaborations • Diverse Team, Diverse Background, Diverse Expertise • Close Collaborations with the JPL and Ames teams • Evaluation in Real Environment Project Website: https://nasa-raspberry-si.github.io/raspberry-si
  92. Lessons Learned • In the simulation, we can debug/develop/test our

    implementation without worrying about damaging the hardware. • High bandwidth and close interaction between the testbed provider (JPL Team) and the autonomy team (RASPBERRY-SI) • Faster identification of the issues • Resolving the issues a lot faster • Getting help for development
  93. Lessons Learned • Importance of risk reduction phases • Integration

    testing • The interface and capability of the testbeds will evolve, and the autonomy needs to be designed at the same time. • The different capabilities between sim and physical testbed. • Rigorous testing remotely and in interaction with testbed providers. • The interaction would be beneficial for Autonomy providers as well as testbed providers.
  94. Incremental Integration Testing 116 Component Test Integration Test Model Learning

    Transfer Learning Model Compression Online Learning A B C D Quantitative Planning E Learning A E Case (Baseline) A E B A E B C A E B C D Case 2 (Transfer) Case 3 (Compress) Case 4 (Online) Test 1 Expected Performance Case 1 < Case 2 < Case 3 < Case 4 OWLAT Code: https://github.com/nasa/ow_simulator Physical Autonomy Testbed: https://www1.grc.nasa.gov/wp-content/uploads/2020_ASCE_OWLAT_20191028.pdf
  95. Real-World Experiments using OWLAT • Models learned from simulation •

    Adaptive System (Learning + Planning) • Sets of Tests 117 Adaptive System Machine learning Models Mission Environment Continual Learning: refining models Log Mission Reports Local Machine Cloud Storage
  96. Test Coverage • Mission Types: landing and scientific explorations ->

    sampling • Mission Difficulty: • Rough regions for landing • Number of locations where a sample needs to be fetched • Unexpected events: • Changes in the environments: e.g., uneven terrain and weather • Changes to the lander capabilities: e.g., deploy new sensors • Faults (power, instruments, etc) 118
  97. Infrastructure for Automated Evaluation 119 Test Generator Autonomy Module Test

    1 Test Harness Mission Configuration Testbed Monitoring & Logging Communication Logging Logs Log Analysis Evaluation Report Environment & Lander Simulation Adapter Interface Learning & Planning Plan Executive