The Tail at Scale - HasGeek Meetup

The Tail at Scale Jeffrey Dean & Luiz André Barroso
https://research.google/pubs/pub40801/ Piyush Goel @pigol1 [email protected]

What are Percentiles? • A percentile (or a centile) is
a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found. Equivalently, 80% of the observations are found above the 20th percentile. -- Wikipedia • Percentile Latency - The latency number below which a certain percentage of requests fall. A 95%tile of 100ms means 95% of the request are being served under 100ms.

What is Tail Latency? • Last 0.X% of the request
latency distribution graph. • In general we can take slowest 1% response times or the 99%ile response times as the tail latency of that request. ◦ May vary depending on the scale of the system ▪ InMobi - 8B Ad Requests/day (99.5 %tile) ▪ Capillary - 500M Requests/day (99% tile) • Responsive/Interactive Systems ◦ Better User Experience, Fluidity -- Higher Engagement ◦ 100ms -- Large Information Retrieval Systems ◦ Reads out number the writes!

Why should we care about Tail Latency? • Tail latency
becomes important as the scale of our system increases. ◦ 10K requests/day → 100M requests/day ◦ Slowest 1% queries - 100 requests → 1M requests • Just as fault-tolerant computing aims to create a reliable whole out of less-reliable parts, large online services need to create a predictably responsive whole out of less-predictable parts; we refer to such systems as “latency tail-tolerant,” or simply “tail-tolerant.” • Many Tail Tolerant Methods can leverage infra provided for Fault Tolerance - Better Utilisation

Why does such Variability Exists? • Software Variability ◦ Shared
Resources ▪ Co-hosted applications • Containers can provide isolation. • Within app-contention still exists. ◦ Daemons ▪ Background or batch jobs running in the background. ◦ Global Resource Sharing ▪ Network Switches, Shared File Systems, Shared Disks. ◦ Maintenance activities (Log compaction, data shuffling, etc). ◦ Queueing (Network buffers, OS Kernels, intermediate hops) • Other Aspects - Hardware Variability, Garbage Collection, Energy Management

Variability is amplified by Scale! • At large scale, components
or parallel operations increase. (Fan-outs) ◦ Micro-Services ◦ Data Partitions • Further increase in the overall latency of the request. ◦ Overall Latency ≥ Latency of Slowest Component • Server with 1 ms avg. but 1 sec 99%ile latency ◦ 1 Server: 1% of requests take ≥ 1 sec ◦ 100 Servers: 63% of requests take ≥ 1 sec (Fan-out)

The effect of large fan-out on Latency Distributions.

At Google Scale - 10ms 99% percentile for any single
request, the 99% percentile for all requests to ﬁnish is 140ms, and the 95% percentile is 70ms. ◦ meaning that waiting for the slowest 5% of the requests to complete is responsible for half of the total 99%-percentile latency. Techniques that concentrate on these slow outliers can yield dramatic reductions in overall service performance.

Can we control Variability?

Reducing Component Variability - Some Practices • Prioritize Interactive Requests
or Real-time requests over background requests ◦ Differentiating service classes and prefer higher-level queuing (Google File System) ◦ AWS uses something similar, albeit for Load Shedding [Ref#2] • Reducing head-of-line blocking ◦ Break long-running requests into a sequence of smaller requests to allow interleaving of the execution of other short-running requests. (Search Queries) ◦ Example - Pagination requests when scanning large lists. • Managing background activities and synchronized disruption ◦Throttling, Service Operation Breakdowns. ◦Example - Anti-virus, security scans, log compression can be run during off-business hours. • Caching doesn’t impact variability much unless whole data is residing in the cache.

Variability is Inevitable!

Living with Variability! • ~Impossible to eliminate all latency variability
-- especially with the Cloud Era! • Develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. • Classes of Tail Tolerant Techniques ◦ Within Request // Immediate Response Adaptations ▪ Focus on reducing variations within a single request path. ▪ Time Scale - 10ms ◦ Cross Request Long Term Adaptations ▪ Focus on holistic measures to reduce the tail at the system level. ▪ Time Scale - 10 seconds and above.

Within Request Immediate Response Adaptations

Within Request Immediate Response Adaptations • Cope with a slow
subsystem in context of a higher level request • Time Scale == Right Now ( User is waiting ) • Multiple Replicas for additional throughput capacity. ◦ Availability in the presence of failures. ◦ This approach is particularly effective when most requests operate on largely read-only, loosely consistent datasets. ◦ Spelling Correction Service, Contacts Lookup - Write Once Read in Millions • Replication (request & data) can be used to reduce variability in a single higher-level request ◦ Hedged Requests ◦ Tied Requests

• Issue Same Request to multiple replicas & use the
first quickest response. ◦ Send a request to a most appropriate Replica (Appropriate Definition is Open) ◦ In case, no response within a threshold, issue to another replica. ◦ Cancel the outstanding requests after receiving the first response. • Can amplify the traffic significantly if not implemented prudently. ◦ One such approach is to defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. ◦ Mitigates the effect of external interferences. Not the same request slowness. Hedged Requests

• Google Benchmark ◦ Read 1,000 keys stored in a
BigTable table ◦ Distributed across 100 different servers. ◦ Sending a hedging request after a 10ms delay reduces the 99.9th-percentile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. • Vulnerabilities ◦ Multiple Servers might execute the same requests - redundant computation. ◦ 95% tile techniques reduces the impact. ◦ Further reduction requires aggressive cancellation of requests. Hedged Requests

• Twitter Finagle ◦ https://twitter.github.io/finagle/docs/com/twitter/finagle/context/BackupRequest$.html ◦ Server as a Function
- https://dl.acm.org/doi/abs/10.1145/2626401.2626413 • Envoy Proxy ◦ https://www.envoyproxy.io/docs/envoy/v1.12.0/intro/arch_overview/http/http_routing.html#arc h-overview-http-routing-hedging • Linkerd ◦ Open Feature Request - https://github.com/linkerd/linkerd/issues/1069 • Spring Reactor ◦ https://spring.io/blog/2019/01/23/spring-tips-hedging-client-requests-with-the-reactive-webcli ent-and-a-service-registry Hedged Requests - Usable Examples

Queueing Delays • Queueing delay add variability before a request
begins execution. • Once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. • Mitzenmacher* - Allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uniform random scheme. * Mitzenmacher, M. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Computing 12, 10 (Oct. 2001), 1094–1104.

Tied Requests • Instead of choosing one, enqueue in multiple
servers. • Send the identity of the other servers as a part of the request -- Tieing! • Send a cancellation request to the other servers once a server picks it off the queue. • Corner Case: ◦ What if both servers pick up the request while the cancellation messages are in transit? Network Delay? ◦ Typical under low traffic scenario when server queues are empty. ◦ Client can introduce a small delay of 2X the average network message delay (1ms or less in modern data-center networks) between the first request and the second request.

• Tied Request Performance on BigTable - Uncached Read Requests
from Disk. • Case 1 - Tests in Isolation. No external interference. • Case 2 - Concurrent Sorting Job running along with the benchmark test. • In both scenarios, the overhead of tied requests in disk utilization is less than 1%, indicating the cancellation strategy is effective at eliminating redundant reads. • Tied requests allow the workloads to be consolidated into a single cluster, resulting in dramatic computing cost reductions.

Cross Request Long Term Adaptations

Cross Request Long Term Adaptations • Reducing latency variability caused
by coarse-grain phenomenon ◦ Load imbalance. (Unbalanced data partitions) ▪ Centered on data distribution/placement techniques to optimize the reads. ◦ Service-time variations ▪ Detecting and mitigating effects of service performance variations. • Simple Partitioning ◦ Partitions have equal cost. Static assignment of a partition to a single machine? ◦ Sub-Optimal ▪ Performance of the underlying machines is neither uniform nor constant over time (Thermal Throttling and Shared workload Interference) ▪ Outliers/Hot Items can cause data-induced load imbalance • Particular item becomes popular and the load for its partition increases.

Micro-Partitioning • # Partitions > # Machines • Partition large
datasets into multiple pieces (10-100 per machine) [BigTable - Ref#3] • Dynamic assignment and Load balancing of these partitions to particular machines.

Micro-Partitioning* • Fine grain load balancing - Move the partitions
from one machine to another. • Average ~20 partitions per machine, the system can shed load in 5% increments and in 1/20th the time it would take if the system had a one-to-one mapping of partitions to machines • Using such partitions also lead to an improved failure recovery rate. ◦ More nodes to takeover the work of a failed node. * Stoica I., Morris, R., Karger, D., Kaashoek, F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for Internet applications. In Proceedings of SIGCOMM (San Diego, Aug. 27–31). ACM Press, New York, 2001, 149–160. • Does this sound familiar? (Hint : Think Amazon!) • DynamoDB Paper also talks about a similar concept of multiple virtual nodes mapped to physical nodes. [Ref#4]

Micro-Partitioning - Practical Examples • Cassandra - Partition key ◦
https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key • Hbase - Row Keys ◦ https://medium.com/@ajaygupta.hbti/design-principles-for-hbase-key-and-rowkey-3016a77fc52d • Riak - Partition Keys ◦ https://docs.riak.com/riak/kv/latest/learn/concepts/clusters/index.html

Selective Replication • An enhancement of Micro-partitioning. • Detection or
Prediction of items that are likely to cause load imbalance. • Create additional replicas of these items/micro-partitions. • Distribute the load among more replicas rather than moving the partitions across nodes. Practical Examples • HDFS ◦ Replication Factor ◦ Rebalancer (Human Triggered) • Cassandra ◦ Auto-rebalancing on topology changes. ◦ https://docs.datastax.com/en/opscenter/6.5/opsc/online_help/opscRebalanceCluster.html

Latency Induced Probation Techniques • Servers can sometimes become slow
dues to: ◦ May be Data issues ◦ Most likely Interference issues (discussed earlier) • Intermediate Servers - Observe the Latency distribution of the fleet. • Put a slow server on Probation in case of slowness. ◦ Pass shadow requests to collect statistics. ◦ Put the node into rotation if it stabilizes. • Reducing Server Capacity during load can improve overall latency!

Large Information Retrieval Systems • Latency is a key quality
metric. • Retrieving good results quickly is better than returning best results slowly. • Good Enough Results? ◦ Sufficient amount of corpus has been searched - return the results without waiting for the rest of the queries. ◦ Tradeoff between Completeness vs. Responsiveness • Canary Requests ◦ High Fan out systems. ◦ Requests may hit untested code paths - crash or degrade multiple servers simultaneously. ◦ Forward to limited leaf servers and send to the fleet only if successful.

Canary Requests • Istio ◦ https://istio.io/latest/blog/2017/0.1-canary/ • AWS API Gateway
◦ https://docs.aws.amazon.com/apigateway/latest/developerguide/create-canary-deployment.html • Netflix Zuul ◦ https://cloud.spring.io/spring-cloud-netflix/multi/multi__router_and_filter_zuul.html

Mutations/Writes in Data Intensive Services • For Data Latency variability
for mutation is relatively easier. ◦ The scale of latency-critical modifications is generally small. ◦ Updates can often be performed off the critical path, after responding to the user. ◦ Many services can be structured to tolerate inconsistent update models for (inherently more latency-tolerant) mutations. • Services that require consistent updates, commonly used techniques are quorum-based algorithms (Paxos*) ◦ These algorithms touch only three to five replicas, they are inherently tail-tolerant. *Lamport, L. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (May 1998), 133–169.

Hardware Trends • Some Bad: ◦ More aggressive power optimizations
to save energy can lead to an increase in variability due to added latency in switching from/to power saving mode. • Variability technique friendly trends: ◦ Lower latency data center networks can make things like Tied Request Cancellations work better.

Conclusions • Living with Variability ◦ Fault-tolerant techniques needed as
guaranteeing fault-free operation isn’t feasible beyond a point. ◦ Not possible to eliminate all source of variabilities. • Tail Tolerant Techniques ◦ Will become more important with the hardware trends -- software level handling. ◦ Require some additional resources but with modest overheads. ◦ Leverage the capacity deployed for redundancy and fault tolerance. ◦ Can drive higher resource utilization overall with better responsiveness. ◦ Common Patterns - easy to bake into baseline libraries.

References & Further Reading 1. https://cseweb.ucsd.edu/~gmporter/classes/sp18/cse124/post/schedule/p74-dean.pdf 2. https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/ 3. Chang
F., Dean J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.E. BigTable: A distributed storage system for structured data. In Proceedings of the Seventh Symposium on Operating Systems Design and Implementation (Seattle, Nov.). USENIX Association, Berkeley CA, 2006, 205–218. 4. https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf 5. Managing Tail Latency in Datacenter-Scale File Systems Under Production Constraints Image Source - https://unsplash.com/photos/npxXWgQ33ZQ by @glenncarstenspeters

Questions?

Co-ordinates • Twitter - @pigol1 • Mail - [email protected] •
LinkedIn - https://www.linkedin.com/in/piyushgoel1/

The Tail at Scale - HasGeek Meetup

The Tail at Scale - HasGeek Meetup

More Decks by pigol

Other Decks in Technology

Featured

Transcript