A video conversation with performance and capacity management veteran, Boris Zibitsker, about how I saved a multi-million dollar computing platform, using a 1-line performance model (at 21:50 minutes). "Best practices" caused the problem.
the BEZNext Channel Dr. Neil J. Gunther — @DrQz Performance Dynamics August 2, 2017 SM c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 1 / 14
english and technology: UML software modeling model train set Kim Kardashian financial/accounting models Amdahl’s law statistical regression numerical mesh simulation benchmark workload simulation support vector machines convolutional neural nets We need to specify clearly and unambiguously which model c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 3 / 14
mathematical framework used to assess the validity of performance data (an overlooked necessity) data + model = information 1 Select performance metrics as inputs: λ, R, S, Q, . . . 2 Model is a relationship between those metrics: Q = λ R 3 Model outputs are calculated metrics 4 Compare calculated metrics with (other) measured metrics 5 Repeat until satisfied Can then project metric values into circumstances that are not measured or not measureable c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 4 / 14
tape silos IBM AIX/SP-2 50 nodes IBM AIX/SP-2 50 nodes SP2 SP2 FDDI rings User Tek X-terminals c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 6 / 14
Time Xfs1 8371 18.57 Xfs2 7113 16.72 NFS1 4781 17.01 NFS2 109 9.41 Observation: 109 files is nearly 128 = 27 Log2(128) = 7 is close to 9 seconds 4781 is near 4096 = 212 Log2(4096) = 12 is close to 17 seconds 8371 is near 8192 = 213 Log2(8192) = 13 is close to 18 seconds c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 8 / 14
log10 (N) where N is the number of remote-server files and k = 4.57 is proportionality constant for base-10 logarithms Table: Log model of mean R times Remote server Measured seconds Log model %Error Xfs1 18.57 17.929948 3.446698 Xfs2 16.72 17.606686 -5.303144 NFS1 17.01 16.818079 1.128281 NFS2 9.41 9.312522 1.035894 Model is accurate to within 5% But where does logarithmic behavior come from? c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 11 / 14
Get a Log, You Need a Tree • • • • • • 0 1 2 1 10 100 Level Number this is of this logarithm c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 12 / 14
replacement would NOT have solved anything 3 Problem caused by “best practices” for system management 4 Performance management was completely overlooked 5 Font server held ∼15000 files but only ∼1000 needed 6 Simple log performance model told the whole story 7 Simple fix with no CapEx cost: prune the tree! 8 300% performance win in shortened launch times! 9 Log model more about explanation than prediction/forecasting c 2018 Performance Dynamics WTF is “Modeling”, Anyway!? August 2, 2017 13 / 14