30 billion log messages a day • Between 7 and 90 days of retention based on log type • Between 170 and 180 billion logs on disk • Daily kafka throughput of 44 - 48 Tb/day Where We Are Now
search for disk and memory resources • facet searches are expensive in both heap and cpu Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x
sometimes contain binary data • Cluster crash recovery took between 5 and 6 hours Scaling And Performance Issues Elasticsearch 1.3.1 / Logstash 1.5.x / Kibana 3.x
Double host count for log sources • Ingest double the number of log types from existing hosts • Stop dropping haproxy logs • Handle 2+ years of growth, including spikes of up to 2x traffic The New Cluster
• Logstash has a file output plugin • Dump the logs to disk by tier and type, pigz -9 to compress • AES encrypt and upload to S3 Log Archiving Elasticsearch 1.5.4 / Logstash 1.5.x / Kibana 3.x
Limit cluster node counts for performance and stability • Move to Elasticsearch 2.x and Kibana 4.x • Handle 2+ years of growth The New, New Cluster Design Goals
node • Every node must acknowledge every change in cluster state • ZenDisco timeouts are set for a reason A Word About Cluster Sizes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
they connect to • Queries issued to tribe nodes to go to all clusters • Search results are merged and returned to the tribe node • For large clusters, upgrades are no longer possible • Stability is questionable for the tribe node A Word About Tribe Nodes Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
on disk • Logs are between 673 and 2600 bytes each • Not all logs are created equal • Some can be dropped early E_TOO_MUCH_OPEX Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
inspected closely • We need them for trending and health checking • We need a consistent portion of the logs Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
balancer • Pass the UUIDs along with the request through the stack • Hash the UUID with murmur3 • code => "require 'murmurhash3'; event['request_id_sampling'] = (MurmurHash3::V32.str_hash(event['[request_id]']).to_f/4294967294)” Consistent Hashing Elasticsearch 2.2.1 / Logstash 2.x / Kibana 4.x
This means reading all fields of all documents • With hundreds of billions of logs in indexes, this can take… time • indices.query.query_string.allowLeadingWildcard : false Leading Wildcard Searches
hours • Now, a full restart of three clusters takes less than 45 minutes • More reliable, faster and more resilient to failure • > 380,000 logs/s during stall recovery Where We Are Now
uses HyperLogLog • 5.3.x has a (since fixed) bug where search result buckets are allocated before circuit breakers are evaluated Testing Is Important Cardinality Aggregations and You
testing • Support > 400,000 logs/s in 2018 • Explore containerized deployment • Reduce use of custom tooling, move to _rollover and curator Where We Go Next
Creative Commons and the double C in a circle are registered trademarks of Creative Commons in the United States and other countries. Third party marks and brands are the property of their respective holders. Please attribute Elastic with a link to elastic.co