do I choose ? P ( y | x ) < ⇢ Find your tolerance for false positives e.g. 100 false positives is equivalent to stopping one fraud How do I calculate ? Density Estimation P ( y | x ) < ⇢
of a small number of Gaussian distributions Number of Gaussian distributions to use Gaussian distribution Mean of ith Gaussian distribution P(y) = 1 n n X i=1 N(µi, 2 i ) Variance of ith Gaussian distribution
i=1 N(µi, 2 i ) How do I find and ? Maximum likelihood optimization P(y) = 1 n n X i=1 N(µi, 2 i ) P(y) = 1 n n X i=1 N(µi, 2 i ) How do I choose ? Fit models for several and choose one with best BIC P(y) = 1 n n X i=1 N(µi, 2 i )
= Mclust(df[c(“temperature”, “humidity”], G=seq(1,6)) plot(gmm, what=“classification”) Try anywhere from 1 to 6 Gaussians in the mixture Just two dimensions for now Load package and data
library(dplyr) df %>% mutate(score = gmm$density) %>% arrange(score) %>% head(3) City Temperature Humidity New York 5 20 New York -‐12 39 New York 29 27
gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Wrap training in a function } Train one GMM per city
%>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Paris -‐6 51 Paris -‐6 51 London 29 40
%>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Wind Speed Pressure New York 23 49 159 1015 London 29 40 16 1012 New York 14 80 30 980 Hurricane winds Insanely hot for London Crazy low pressure
data. Anomalies are then low-probability observations • GMMs provide a quick-and-easy way to estimate probability densities • Mclust is an awesome GMM package for R • Anomalies highly dependent on context (i.e. city) and the variables included in detector (i.e. temperature, humidity, wind speed, pressure). Semi-supervised learning