Semi-Supervised Anomaly Detection

Semi-Supervised Anomaly Detection Use cases, theory and hands-on

Dataiku The Numbers x 50 + x 1 + +
48

Me The Numbers 1/2 1/2 + = ! > df
= data_scientists %>% inner_join(r_users) %>% filter(speaker_at == “meetup”) > df$name [1] “Eric Kramer”

Agenda 30 minutes Introduction Definition Use Cases Formalization Building an
anomaly detector in R Theory R code Result Summary Improvements Concerns Questions } Iterate 3 times, increasingly complexity each time

Code Everything is on github https://github.com/erickramer/anomaly_detection_demo

Introduction

What is an anomaly? Global anomalies are unexpected based on
the entire dataset Local anomalies are unexpected given their context Anomalies are “unexpected” observations

Use Case: Bank Fraud

Use Case: EEG Is this an anomaly?

Key Principles Anomalies are rare Anomalies have little in common
with each other

Formalization P ( y | x ) < ⇢ An
observation is an anomaly if where y x ⇢ represents the observation represents the context is some arbitrary threshold

Formalization P ( y | x ) < ⇢ Observation
Context

Use Case: Anomalous Weather P ( y | x )
< ⇢ Current weather: • Temperature • Humidity • Pressure Current Location: • Lat, Long • Altitude • Date • Time

Use case: Bank Fraud P ( y | x )
< ⇢ Financial transaction: • Origin • Destination • Amount • Medium Account history: • Past transactions • Account address • Account balance • Account flux

Use case: EEG P ( y | x ) <
⇢ EEG reading Patient History: • Diagnoses • Interventions / Surgeries • Current medications

Questions P ( y | x ) < ⇢ How
do I choose ? P ( y | x ) < ⇢ Find your tolerance for false positives e.g. 100 false positives is equivalent to stopping one fraud How do I calculate ? Density Estimation P ( y | x ) < ⇢

Density Estimation We’re going to use Gaussian Mixture Models. Alternatives:
• Kernel density estimators • Histograms • K-means • Bayesian methods

Our data Weather in Paris, London and NYC for 2010-2015

Goals Weather in Paris, London and NYC for 2010-2015 Can
we find days with anomalousweather? Can we controlfor the location of the measurement? Can we controlfor the date?

Gaussian Mixture Models

Theory Gaussian Mixture Models The probability density is the sum
of a small number of Gaussian distributions = +

Theory Gaussian Mixture Models The probability density is the sum
of a small number of Gaussian distributions Number of Gaussian distributions to use Gaussian distribution Mean of ith Gaussian distribution P(y) = 1 n n X i=1 N(µi, 2 i ) Variance of ith Gaussian distribution

Questions Gaussian Mixture Models P(y) = 1 n n X
i=1 N(µi, 2 i ) How do I find and ? Maximum likelihood optimization P(y) = 1 n n X i=1 N(µi, 2 i ) P(y) = 1 n n X i=1 N(µi, 2 i ) How do I choose ? Fit models for several and choose one with best BIC P(y) = 1 n n X i=1 N(µi, 2 i )

Our first model

Fitting a GMM Using the mclust package library(mclust) load(”./data/weather_data.Rdata”) gmm
= Mclust(df[c(“temperature”, “humidity”], G=seq(1,6)) plot(gmm, what=“classification”) Try anywhere from 1 to 6 Gaussians in the mixture Just two dimensions for now Load package and data

Fitting a GMM Using the mclust package

Getting a density from a GMM Using the mclust package
Mclust(…) => densityMclust(…)

What are the most anomalous days? Using the mclust package
library(dplyr) df %>% mutate(score = gmm$density) %>% arrange(score) %>% head(3) City Temperature Humidity New York 5 20 New York -‐12 39 New York 29 27

Choosing a Threshold Using the mclust package

What about controlling for the location of measurement?

Controlling for City

P(temperature, humidity|city) Option 1: Option 2: Train one GMM for
each city Regress temperature and humidity on city Build GMM on residuals of model

Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c(“temperature”, “humidity”)], G=seq(1,6)) }
gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Wrap training in a function } Train one GMM per city

Raw Data Weather in NYC is much more variable London
NYC Paris

Multiple GMMs Weather in NYC is much more variable London
NYC Paris

What are the most anomalous days? Controlling for location gmms
%>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Paris -‐6 51 Paris -‐6 51 London 29 40

Increasing the Dimensionality

Fitting multiple GMMs train_gmm = function(df){ densityMclust(df[c("temperature", "humidity", "visibility", "wind_speed")],
G=seq(1,6)) } gmms = df %>% nest(-city) %>% mutate(gmm = map(data, train_gmm)) map(gmms$gmm, plot, what=“density”) } Include four columns this time

High-dimensional densities Visualizing more than 2 dimensions

High-dimensional densities Visualizing more than 2 dimensions London NYC Paris

High-dimensional densities Paris

High-dimensional densities London

High-dimensional densities New York

What are the most anomalous days? Controlling for location gmms
%>% mutate(score = map(gmm, "density")) %>% select(-gmm) %>% unnest() %>% arrange(score) %>% head(3) City Temperature Humidity Wind Speed Pressure New York 23 49 159 1015 London 29 40 16 1012 New York 14 80 30 980 Hurricane winds Insanely hot for London Crazy low pressure

Summary

Summary • Semi-supervised learning attemps to learn the distribution of
data. Anomalies are then low-probability observations • GMMs provide a quick-and-easy way to estimate probability densities • Mclust is an awesome GMM package for R • Anomalies highly dependent on context (i.e. city) and the variables included in detector (i.e. temperature, humidity, wind speed, pressure). Semi-supervised learning

Moving to Production • Train GMM on “normal” data, updated
weekly • REST API scores incoming data based on last weeks data • Threshold chosen based on false positive tolerance Semi-supervised learning

Semi-Supervised Anomaly Detection

Semi-Supervised Anomaly Detection

More Decks by Data Science London

Other Decks in Technology

Featured

Transcript