the type of signal? How do we manage off-line modeling? How do we manage online fault detection? How do we reconcile so users don’t hate us? How we we solve this in a big-data context?
Human identify a variety of categories. Devise a set of ad-hoc features. Bayesian model of features to categories. Human tests. https://www.flickr.com/photos/chrisyarzab/5827332576
really useful with p-values are low. p ≤ 0.01 : very strong presumption against null hyp. 0.01 < p ≤ 0.05 : strong presumption against null hyp. 0.05 < p ≤ 0.1 : low presumption against null hyp. p > 0.1 : no presumption against the null hyp. from xkcd #882 by Randall Munroe
a certain statistical model: How many few points need to be seen before we are sufficiently confident that it does not fit the model (presumption against the null hypothesis)? With few, we simply have outliers or insignificant aberrations. http://www.flickr.com/photos/rooreynolds/
Calculating means is “easy” There are some pitfalls What do mean that my mean is mean? Why can’t math be nice to people? https://www.flickr.com/photos/imagesbywestfall/3606314694
needs to adapt. Exponentially decaying averages are quite common in online control systems and used as a basis for creating control charts. Sliding windows are a bit more expensive.
V1 ❖ ST = VT + (1-)ST-1 ❖ Low computation overhead ❖ Low memory usage ❖ Hard to repeat offline ❖ SMW tracking the last N values : ST ❖ ST = ST-1 - VT-N/n + VT/n ❖ Low computational overhead ❖ High memory usage ❖ Easy to repeat offline
online algorithms to match our offline algorithms. This is because human beings get pissed off when they can’t repeat outcomes that woke them up in the middle of the night. EWM: not repeatable SWM: expensive in online application
issues. It’s challenging when signals are noise or of variable rate. We’re looking into the Tukey test: • compares all possible pairs of means • test is conservative in light of uneven sample sizes https://www.flickr.com/photos/st3f4n/4272645780
has never been evenly distributed All this “deviation from the mean” assumes some symmetric distribution. All that work and you tell me *now* that I don’t have a normal distribution? Statistics suck. https://www.flickr.com/photos/imagesbywestfall/3606314694
This is 10 billion to 1 trillion measurements per second. At least a million independent models. We need to cheat. https://www.flickr.com/photos/thost/319978448