Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Processing tera bytes of data every day and sle...

Processing tera bytes of data every day and sleeping at night

This is the story of how we built a highly available data pipeline that processes terabytes of network data every day, making it available to security researchers for assessment and threat hunting. Building this kind of stuff in the cloud is not that complicated, but if you have to make it near real-time, fault tolerant and 24/7 available, well... that's another story. In this talk, we will tell you how we achieved this ambitious goal and how we missed a few good nights of sleep while trying to do that! Spoiler alert: contains AWS, serverless, elastic search, monitoring, alerting & more!

Luciano Mammino

November 30, 2018
Tweet

More Decks by Luciano Mammino

Other Decks in Technology

Transcript

  1. Processing Terabytes of data every day … and sleeping at

    night @katavic_d - @loige Milan, 30/11/2018 loige.link/terabytes
  2. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  3. AI to detect and hunt for cyber attackers Cognito Platform

    • Detect • Recall @katavic_d - @loige
  4. Cognito Detect on premise solution • Analyzing network traffic and

    logs • Uses AI to deliver real-time attack visibility • Behaviour driven Host centric • Provides threat context and most relevant attack details @katavic_d - @loige
  5. Cognito Recall • Collects network metadata and stores it in

    “the cloud” • Data is processed, enriched and standardised • Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response
  6. Recall requirements • Data isolation • Ingestion speed: ~2GB/min x

    customer (up ~3TB x day per customer) • Forensic tool: Flexible data exploration @katavic_d - @loige
  7. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  8. Security • Separate VPCs • Strict Security Groups (whitelisting) •

    Red, amber, green subnets • Encryption at rest through AWS services • Client Certificates + TLS • Pentest @katavic_d - @loige
  9. Warning: different timezones! A cu m Our ne * @katavic_d

    - @loige *yeah, we actually look that cute when we sleep!
  10. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  11. Lambda timeouts incident • AWS Lambda timeout: 5 minutes (now

    15) • We are receiving files every minute (containing 1 minute of network traffic) • During peak hours for the biggest customer, files can be too big to be processed within 5 minutes @katavic_d - @loige
  12. Lessons learned • Predictable data input for predictable performance •

    Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  13. Lambdas IP starvation incident • Spinning up many lambdas consumed

    all the available IPs in a subnet • Failure to get an IP for the new ES machines • ElasticSearch cannot scale up • Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  14. Lessons learned • Every lambda takes an IP from the

    subnet • Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! • Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  15. • New lambda version: triggered insertion failures • ElasticSearch rejecting

    inserts and logging errors • Our log reporting agents got stuck (we DDoS’d ourselves!) • Monitoring/Alerting failed Resolution: • Fix mismatching schema • Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  16. Alerting on lambda failures Using logs: • Best case: no

    logs • Worst case: no logs (available)! A better approach: • Attach a DLQ to your lambdas • Alert on queue size with CloudWatch! • Visibility on Lambda retries @katavic_d - @loige
  17. Missing data incident: the return! • Missing data in the

    database • Stack is working fine • Not receiving data in the collector node • Root cause: a new customer firewall rule was blocking our traffic at source! • As soon as the rule was fixed, data was flowing in again @katavic_d - @loige
  18. • When is this an actual problem? • How to

    alert effectively? @katavic_d - @loige How to deal with lack of data...
  19. Lessons learned • Ping & health checks to make sure

    everything is working and data can flow • Tracing to track performance degradations and pipeline issues. • Instrumentation can be very valuable: do it! @katavic_d - @loige
  20. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  21. Instrumentation & metrics Statsd • Daemon for stats aggregations •

    Very lightweight (UDP based) • Simple to integrate • Visualize data through Grafana @katavic_d - @loige
  22. Timing Used to report time measurements E.g. how long did

    it take to process and insert a batch
  23. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  24. AWS nuances • Serverless is cheap, but be aware of

    timeouts! • Not every service/feature is available everywhere ◦ SQS FIFO :( ◦ Not all AWS regions have 3 AZs ◦ Not all instance types are available in every availability zone • Limits everywhere! ◦ Soft vs hard limits ◦ Take them into account in your design @katavic_d - @loige
  25. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  26. Process How to deal with incidents • Page • Engineers

    on call • Incident Retrospective • Actions @katavic_d - @loige
  27. Pages • Page is an alarm for people on call

    (Pagerduty) • Rotate ops & devs (share the pain) • Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) • When a page is received, it needs to be acknowledged or it is automatically escalated • If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  28. Engineers on call 1. Use operational handbook 2. Might escalate

    to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  29. Incidents Retrospective "Regardless of what we discover, we understand and

    truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  30. Incidents Retrospective • Summary • Events timeline • Contributing Factors

    • Remediation / Solution • Actions for the future • Transparency @katavic_d - @loige
  31. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  32. Development best practices • Regular Retrospectives (not just for incidents)

    ◦ What’s good ◦ What’s bad ◦ Actions to improve • Kanban Board ◦ All work visible ◦ One card at the time ◦ Work In Progress limit ◦ “Stop Starting Start Finishing” @katavic_d - @loige
  33. Development best practices • Clear acceptance criteria ◦ Collectively defined

    (3 amigos) ◦ Make sure you know when a card is done • Split the work in small cards ◦ High throughput ◦ More predictability • Bugs take priority over features! @katavic_d - @loige
  34. Development best practices • Pair programming ◦ Share the knowledge/responsibility

    ◦ Improve team dynamics ◦ Enforced by low WIP limit • Quality over deadlines • Don’t estimate without data @katavic_d - @loige
  35. Agenda • The problem space • Our first MVP &

    Beta period • INCIDENTS! And lessons learned • Monitoring and instrumentation • AWS Nuances • Process to deal with incidents • Development best practices • Release process @katavic_d - @loige
  36. Release process • Infrastructure as a code (Terraform + Ansible)

    ◦ Deterministic deployments ◦ Infrastructure versioning using git • No “snowflakes”, one code base for all customers • Feature flags: ◦ Special features ◦ Soft releases • Automated tests before release @katavic_d - @loige
  37. Conclusion @katavic_d - @loige We are still waking up at

    night sometimes, but we are definitely sleeping a lot more and better! Takeaways: • Have healthy and clear processes • Always review and strive for improvement • Monitor/Instrument as much as you can (even monitoring) • Use managed services to reduce the operational overhead (but learn their nuances)
  38. Credits Pictures from Unsplash Huge thanks to: • All the

    Vectra team • Paul Dolan • @gbinside • @augeva • @Podgeypoos79 • @PawrickMannion • @micktwomey • Vedran Jukic for support and reviews!