information and resources ▸ integrity of information and resources ▸ availability of information and resources ▸ basic definitions ▸ threat: potential violation of a security goal ▸ security: protection from intentional threats ▸ attack: intentional violation of a security goal SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY
policy: statement of what is and what is not allowed ▸ mechanism: method or tool enforcing a security policy ▸ security is a process, not a product! ▸ strategies for security mechanisms ▸ prevention of attacks, e.g. encryption ▸ detection of attacks, e.g. virus scanner ▸ analysis of attacks, e.g. forensic SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec
popular web services ▸ identities often include real names, addresses, emails, passwords, etc. ‘;--have i been pwned? 142 pwned websites 1,444,567,928 pwned accounts 39,842 pastes 31,108,929 paste accounts SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec
security cycle ▸ increasing number of vulnerabilities ▸ high amount of novel attacks ▸ high diversity of malicious software ▸ bottleneck: human analyst in the loop ▸ manual discovery of vulnerabilities ▸ manual generation of attack signatures ▸ manual analysis of malicious software SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec
▸ ineffective against novel and unknown attacks ▸ inherent delay to availability of novel signatures ▸ analysis obstructed by polymorphism and obfuscation HEADER APPLICATION PAYLOAD ... IP TCP GET /scripts/ ..%c1%9c.. /system32/cmd.exe TCP ..%c1%9c.. NIMDA WORM SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec
▸ combining computer security and machine learning ▸ minimum human intervention on prevention, detection, and analysis ▸ challenge in practice ▸ effectivity, efficiency, and robustness ▸ transparency and controlability SOURCE: MACHINE LEARNING FOR COMPUTER SECURITY — https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec
and expertise) high rates undetectable attacks (false negatives) delayed response (between detection and prevention) statistical driven (improved detection of new attacks) substantial investigative efforts (false positives) alarm fatigue and distrust (reversion to supervised method)
no history of previous attacks (required by supervised learning model) ▸ evolving attacks: attackers constantly change their behaviours, making current models obsolete ▸ limited resources: relying on security analysts to investigate the attacks can be costly and time consuming
from raw data ▸ outlier detection system: learning a descriptive model using features from unsupervised learning process ▸ feedback mechanism and continuous learning: incorporating analyst input
networking devices and applications log ▸ router, switch, firewall, ids, ips, and load balancer devices ▸ web, database, and micro services ▸ frontend and backend applications ▸ delivered in realtime from widely distributed systems
▸ volume of raw data: metrics (GB/TB) or number of lines (≥ tens of millions on a daily basis) ▸ specific to behavioural analytics: IP addresses, users, sessions, etc. 01010101010101001111010111010101 01010001100010010100010011110110 10100100010010010010001010111101 10100111101101001100011110101011 10101110011010111011011101100111 11100000101001100010000011101101 01100001000000011010111110111011 00111001110001000100010011100100 00111011111011110110100100100110 10001010001110111110001001001001
normal circumstances, malicious activities are extremely rare (generally ≤ 0.1%) ▸ resulting extreme class imbalance in supervised learning ▸ increasing the difficulty of detection processes ▸ unknown and/or unreported attacks introduce noise into data ▸ attack vectors can take a wide variety of shapes
AGGREGATED DATA JIM ✖ ✖ ✖ FEATURES IS NEW USER? LAST CHANGED PASSWORD LAST IP ADDRESS LAST SESSION LENGTH ..... ..... ..... ..... ..... NUMBER OF FAILED LOGIN JIM
signatures (often comprises the series of attack steps) from raw data ▸ quantitative values can be defined by security analysts ▸ extracting features per-entity and per-time-segment basis
absorbing the log stream: identifying the entities and updating corresponding records ▸ in short temporal window: 30 minutes, 1 hour, 12 hours, or 24 hours. ▸ focus on efficient retrieval for feature computation
computing behavioural features over an interval of time ▸ retrieving all activity records within given interval ▸ aggregating smaller time unit (minutes, hours, days, weeks) as the feature demands
konrad rieck, fabian yamaguchi, and alwin maier (institute of system security, tu-braunschweig) https://www.tu-braunschweig.de/sec/teaching/ss16/mlsec ▸ 360º unsupervised anomaly-based intrusion detection by stefano zanero https://www.blackhat.com/presentations/bh-dc-07/Zanero/Presentation/bh-dc-07-Zanero.pdf ▸ mlsec project http://www.mlsecproject.org/ ▸ redefining infosec by combining ai and human intuition https://www.patternex.com/redefining-infosec-by-combining-ai-and-human-intuition-wp