Pro: · Powerful query language · No predefined schema ‒ Con: · ~14 days retention · High load ⇒ ingest backs up (logs up to 30 minutes late) · $$$ • Splunk contract up for renewal Oct 2016 • Let’s use Elasticsearch, that’s what the cool kids are doing 6
0.2) • Learned the basics of keeping a cluster alive ‒ Cluster state! ‒ Mappings! ‒ Routing! ‒ Hot/warm! ‒ Index management! • Forgot most of it just in time to do the same thing all over again at Lyft 7
experience ‒ Stable APIs are great (except when they’re not) • Still time-based/manually time-sharded indices • Still Logstash*/Kibana ‒ And their wartsquirks 8
to Elasticsearch 5, but • Amazon was dragging their feet on upgrades ‒ They got better towards the end • Amazon makes parts of the recommended index lifecycle difficult ‒ Shrink in particular • Not Amazon’s fault: some parts of the lifecycle are counterproductive ‒ Shrinking turns out to be bad for query performance • Definitely Amazon’s fault: EBS ‒ Newer instance types are EBS-only, and EBS performance/reliability is sub-optimal for Elasticsearch at scale ‒ Instance storage is limited and bound to instance type 10
‒ Ingest timeouts? Retention shrinking? Kibana slow? Scale up! • 100k epm → 1.5M epm ‒ Amazon’s biggest cluster • Then we hit Amazon’s cluster node limit ‒ 20 nodes at the time, eventually 40 • Then 11
getting the hiccups • Cluster’s red, we’re not sure why* ‒ * more on this later • Can infer through CloudWatch that one node is sick ‒ High CPU, JVM memory pressure (GC death spiral) • Not unusual, relatively simple to fix: ‒ Just restart Elasticsearch ‒ If that doesn’t work: ‒ Add a replacement node ‒ Disable routing to sick node ‒ Wait for shards to evacuate ‒ Decommission sick node But on AWS... 12
hours) (during business hours) ‒ First-line support: “I see that your cluster is red” ‒ “Please give us the output of these API endpoints …” • 2. Escalate to ES team engineers ‒ “We see that one of your nodes needs to be shot” ‒ “We see JVM memory pressure is high, please try to reduce it” ‒ “Can you maybe stop logging so much?” ‒ Wait some more • 3. Expedite, option 1: call the TAM ‒ Eventually started going directly through TAM to engineers, who knew the routine • 4. Expedite, option 2: roll the cluster ‒ Trivial change to IAM role ⇒ get an entirely new cluster (blue/green deploy) ‒ Would often get stuck “between” deploys, old nodes sticking around ‒ Still requires manual intervention by AWS support You have opened a new Support case 13
• Push-button solution • Great for many use cases What it isn’t: a fully functional Elasticsearch cluster • The whole thing is behind a gateway ‒ Round-robin load balancer ‒ 60s timeout (on everything) • Most APIs are obfuscated • Configuration change ⇒ whole new cluster 14
Cold? Ingest? Tribe? ‒ How many instances? ‒ i2? r3? c4? ‒ How many nodes per instance? • Index lifecycle management ‒ Rollover ‒ Alias management ‒ Bootstrap? Move? Shrink? • Find the land mines character-building opportunities
big enough company starts to look a lot like Logging-as-a-Service (but you can yell at your customers) Who’s logging? • All engineers ‒ Owned services ‒ Upstream services • Security ‒ Enriched audit logs • Data teams Some logs are more important than others • Info vs. warn/error/critical • 200 vs. 500
a bit of a pitfall • Same index, multiple types • Namespacing is a must • Mapping conflicts cause missing logs ‒ Mitigated (mostly) by namespaces • Perfect world: ‒ Stable event IDs ‒ One doc_type ‒ Better-behaved logs • “Log everything” ≠ “log anything”
Would reliably kill a large enough cluster ‒ Hacked periodic manual updates as a workaround • “View surrounding documents” ‒ Also used to murder the cluster (by blasting a search to every single index) • Lots of mappings? ‒ Refreshing mappings in Kibana console can break in several ways It Builds Character 21
this endpoint ‒ The overhead acted as a load multiplier and reliably brought us down • Allocation settings ‒ “enable”: “none” (the “page me at 3am” button) • Routing settings ‒ Easy to mess these up and end up with eternally unassigned shards and a red cluster It Builds Character 22
spirals • Use G1GC ‒ Seriously, turn it on it right now ‒ Lots of FUD online about data corruption ‒ No more GC spirals (at all) (ever) It Builds Character 23
Cluster died at 11:45pm sharp every Saturday • Mystified us for weeks • Looked at random instance metrics • “Hmm, why is it stuck in iowait for 2 hours?” It Builds Character 24
Engineering and support are improving Elasticsearch is great, but • Never intended to be a TSDB • Need to add your own tools Know what you’re getting into • Know your scale • Know your data • No Wrong Way to get logs into ES ‒ (but some are better than others) In Conclusion 25