cloud are composed of stuff which sometimes breaks • How do I find and fix problems in the cloud service before my customers are affected? …all the time? …with potentially millions of customers? …without spending infinite money on telemetry? • I want to spend precious development time building the core logic … not telemetry solutions • Most customers are where we were a few years ago with our own service visibility • I’ll talk about the lessons we’ve learned managing the platform and apps running on it
The choices today will change further 12 months from now – so assume you will revisit the choices you have now • Invest in re-usable pieces, not monoliths • Scale of service usually determines the option • Not all options scale to the largest sizes
as follows. 1. Hook up something like SQL Azure or WA Tables to store data 2. Dump more and more stuff in 3. Queries get slower OR you run out of space (or both ) Once you hit this limit, things get interesting and you use Big Data approaches. These work OK for reporting/data science, poorly for alerts This leads to two systems: “Batch” pipeline and a “Fast” pipeline We will go through this evolution so you can see how to do each one
• XEvents (SQL) • DMVs (SQL) • Custom Tracing (Application-Generated Events) (Expect to iterate on this – as you run your service, you find things you need and things that are not worth collecting – you tune frequently)
into WA Table Storage • Manually Query Table Storage to find data when there is a problem (no cross partition or table query support) • Put each kind of data (errors, perf counters) in separate tables • Hook up to on-premises SCOM and run machines like you do on-premises • This model works fine for limited scales • Often this is the “first attempt” for telemetry systems to re-use on-premises capabilities for their first cloud deployments SCOM Azure Management Pack: http://www.microsoft.com/en-us/download/details.aspx?id=11324 Application DB DB Telemetry SCOM/Cerebrata
Multitude of agents (WAD, SCOM, New Relic), that don’t meet all the requirements • Ability to transform arbitrary logging data into common format (Logstash-GROK filter capability in the ELK stack) • Target a diverse set of analytics repositories • Surface area support e.g. no worker roles for App insights • Guarantees on the ingestion pipeline – how do I meet my time to detect, time to remediate requirements! • Separation of channels (cold, warm, hot) • Walled gardens: Ambient information focused vs. active application logging • Lack of developer composition points – choice of analytics repository, access to the data, configurable pipeline properties • Quick, iterative, temporal visualization, ability to define derivative KPIs and drill down into active app logging is missing in our stack Full Access Fully Extensible Producers Collection Broker Transformation Storage Presentation and Action Devices Services Apps No Access WAD Agent SCOM Agent 3rd Party Agent Configurable/Somewhat Extensible App Insights SCOM Build Your Own BI Tool
the ability to take arbitrary data, transform it and provide temporal visualization and search Pipeline: • Transform arbitrary data easily: Distributed Logstash agent, grok filters • Load Log Content: using log stash or hitting REST endpoint :9200 with JSON • Store documents in Elastic Search: Runs on Windows or Linux • Quickly Get Insight: Choose a timestamp field and get instance temporal visualization, and full search fast interactive exploration using Kibana 1 3 4 2
things at scale • But they are not fast – often it takes minutes to hours to process at scale • Alerting for errors is all about speed (time-to-detect) • This leads to a different class of solution for “fast pipe” monitoring • We measure incidents on how long it took us to detect it (every time) • We have repair bugs to keep working on that metric to be lower next time • You need to be selective about what you pass through the fast pipe • Perhaps you only look at key errors or pre-aggregate values • Otherwise you will overwhelm the alerting system • Storage efficiency is also key – I see lots of denormalized row solutions
do Machine Learning • Applications • Auto-tuned alerts • Prediction Models (for failures based on historical behaviors, etc.) • Watching multiple things for errors without defined alerts • We use ML algorithms to detect new bugs in WA SQL Database (SQL Azure) • Watch all errors from all users (every minute or two) • See if new kinds of errors start spiking • Fire alerts for errors of appropriate severity • This is far better than • Firing alerts with static limits (break as your service grows) • Hand-coding each limit (takes a long time)
it is free • Then figure out how to pump lots of data through it, do alerts, etc. • Option 2: Try the Azure ML Service (not free, but easier to start) • Go author a job and try it out
• If they spike, our systems fire alerts • False positives get corrected over time with auto-leveling of thresholds • This is a simple example of SQL Azure user errors
agent to determine if it is sending data properly • We can detect when spikes or lack of data should cause alerts • We file bugs/incidents automatically with no knowledge • This lets us detect and fix issues before our customers notice • This example: we deployed a bug fix to reduce telemetry volume and our alerting caught it - we resolved bug as “expected change”
term storage Presentation and Action Producers Data Store Results Cache Applications Devices Internet of Things Data Broker Data Directory Stream Processing Map-Reduce Processing Text Indexing Machine Learning Infrastructure Devices Dashboard Common schema and data encoding for democratization, correlation, and reuse of data Instr. Library f() Agent External Data Sources Log Search Distributed Tracing Data Analytics App Insights Dial Tone Services Connector (for devices) Monitoring Health State Remediation Monitoring is a form of Stream Processing and logs a copy of its results to Analytics Multi-dimensional metrics Alerting !! Synthetic Transactions