Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic{ON} 2018 - The Path to Intelligent Oper...

Elastic Co
March 01, 2018

Elastic{ON} 2018 - The Path to Intelligent Operation with NetApp OnCommand Insight

Elastic Co

March 01, 2018
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. NetApp March 1st, 2018 @kerendg, @majorouch The Path to Intelligent

    Operation with NetApp OnCommand Insight Keren Dagan, Senior Product Manager Francisco Rosa, Senior Software Engineer
  2. Agenda 2 1 Introduction to NetApp OnCommand Insight 2 Evolving

    into Elasticsearch 3 Evolving into Machine Learning 4 Lessons learned
  3. 4 NetApp Inc. Next Gen Data Center - Digital Transformation

    • 25 years • $5.5 Billion Revenue • 10,000+ employees • 110+ offices Fastest growing All-Flash Array vendor Recognized as a Leader in multiple Gartner Magic Quadrants • General-Purpose Disk Arrays • Solid-State Arrays • Integrated Systems We help out customers to: • Drive value with data • Gain Insight, Access, and Control NetApp Cloud Central And
  4. Assets Under Management Capacity 11EB+ Switch Ports 5M+ Systems (VMs

    and Servers) 10M+ Storage Systems 35K+ • The top 10 Fortune 500 companies • 6 of the top 10 US Retailers • 8 of the top 10 Banks • 5 of the top 10 Insurance companies • 7 of the top 10 Tech and Service Providers Top Companies Rely on OnCommand Insight 5 NetApp OnCommand Insight Hybrid Cloud Infrastructure Analytics Manage growth and complexity Troubleshoot issues Identify and monitor cost
  5. 6 OnCommand® Insight Hybrid Infrastructure Management Private Cloud Fiber Channel

    Switches On-Premise Public Cloud Consistent insights across multi-vendor, hybrid infrastructure Intelligent Operations • Discover and monitor resources, their relationships and dependencies • Proactive alerting and fast troubleshooting with advanced analytics Business Insights • Resource optimization • Cost alignment and show back • Forecast performance and capacity planning • Enables business workflows such as billing, cost, change management and automation Ecosystem Integration • Open API provide access to discovered and monitored data Inventory - Resources Performance - KPI Topology and Relationships
  6. Resources in OnCommand Insight Resource Type and Name Summary Information

    Topology Business Context Data Expert View Analytics Section Related Resources Resource Landing Page • Customized view for each resource type • A 360 degree view of the resource including metrics, topology and business context • Expert view with charts, and advanced analytics section • Quick navigation to related resources’ landing pages 7
  7. 8 The Applications Infrastructure Stack Topologies – OnPrem, Private and

    Public Cloud! Switch Switch Storage Storage Storage VM VM VM VM VM Hypervisor volume Storage VM Hypervisor Switch Switch App – running OmPrem App – running on AWS(or Azure) AWS Instance EC2 AWS Instance AWS Instance EBS Volume S3 Buckets Switch Storage VM VM KVM App - running on OpenStack VM Switch Storage
  8. Ralph Waldo Emerson PM { } I need grouping of

    timeseries from virtual machine’s attributes
  9. 17 Indices evolve…the big split MySQL Elasticsearch OCI Elasticsearch plugin

    Single Node OCI Engine Index join search Custom aggregations: weighted average
  10. PM { } I would like to push my own

    objects into the system
  11. 22 And we embrace the dynamic world! metricbeat Kubernetes master

    Kubernetes node Kubernetes node Kubernetes node OCI Engine Elasticsearch OCI Elasticsearch plugin Single Node Nodes (X) Logstash OCI output plugin
  12. • Systems tends to converge overtime to a rhythmic pattern

    of operation, a predictable cyclical pattern. • This pattern is not a simple, and the cycles can span over hours, days, weeks and months. • Static threshold works when the user knows what is “bad”, otherwise creates noise • ML is good for detecting when the pattern has changed • Prelert (now Elastic ML) and Elastic. • OnCommand Insight implementation of Elastic ML for Anomaly Detection Anomaly Detection 25 +
  13. 26 OnCommand Insight Anomaly Detection Engine! • OCI Data sources

    • Discovering infrastructure resources • Collecting key performance metrics (Latency, IOPS, Utilization) • OnCommand Insight Server • Compute the service path and relationships • Realizes all the Application resources • Packages and send the data to the Elastic ML (job) • Anomaly Detection Engine – Elastic ML • Learns and models normal and detect anomalies • OnCommand Insight UI • Presents the Application anomaly score, with anomalous resources Anomaly Detection Engine – Elastic ML Data Sources
  14. 27 Anomaly Detection Results Application Landing Page §Forensic view –

    Application infrastructure resource stack §Overall Anomaly Score and Time §Highlight anomalous resources – 1,2,3 blue bars to indicate the significance of the Anomaly §Call out resources for further investigation §Application Anomaly Score chart Application Anomaly Score and the Time of Anomaly Add to Expert View # of resources Forensic View Anomalous resource Application Anomaly Score at this time
  15. 28 Anomaly Types - Pattern Anomaly Detecting a Change in

    Behavior Break in the pattern Pattern - Rhythm
  16. 29 Anomaly Types – Point and Change Point Anomaly Detecting

    a Change in Behavior Crash Change Point Point Anomaly
  17. OnCommand Insight’s customers appreciate: • The scale of our solution

    • The data richness and the data quality • Understanding the path, the relationships, and enriching the data with business context • The powerful search, flexible visualization of the data with the topologies • Analytics and Machine learning for proactive alerting OCI technology helps our customers to become aware of issues before becoming a problem, preventing an approaching outage in their environment. Another leap forward in the path Intelligent Operation Summary – The Path to Intelligent Operation 30
  18. • Embedding works for us! • Elasticsearch works really well

    with timeseries and heavy analysis • Elasticsearch can do a lot but it can be further extended with plugins • Plugins are good but… documentation is scarce and code hooked up to Elasticsearch version • Elasticsearch is better at smaller number of large indexes than larger number of smaller indexes • Rollover API can be your friend Lessons Learned - Elastic 32
  19. 33 Lessons Learned – Anomaly Detection Applying Domain Expertise! The

    math might be right, but this is not always enough. Excluding Anomalies Below the Thresholds • A change in very small numbers (0.005 – 0.5) is mathematically significant. • Yet, it is very case specific, becoming an interesting anomaly! Dormant Resources • Resources who does very little work – mostly inactive • A sudden, even subtle change in the performance can generate anomaly • In most cases this is not a critical resource to alert for These resources are not excluded from the learning, only from the results