Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Project “Instance Lottery"@JAWS PANKRATION 2024

Project “Instance Lottery"@JAWS PANKRATION 2024

Slides for JAWS PANKRATION 2024 (Aug. 24-25, 2024)
Recording: https://youtu.be/NeT5xj1BXFw
My blog post: https://zenn.dev/p0n/articles/32d8893d3fdb5e
Event page: https://jawspankration2024.jaws-ug.jp/ja/timetable/TT-54/

Hiroshi Hayakawa (p0n)

August 24, 2024
Tweet

More Decks by Hiroshi Hayakawa (p0n)

Other Decks in Technology

Transcript

  1. 2 Hiroshi HAYAKAWA 󰏦 AWS Community Builders (Security & Identity)

    AWS Ambassadors Japan AWS Top Engineers (Services) Japan AWS All Certifications Engineers GameDay enthusiast: 🥇x2 🥈x1 🥉x3 Favorites: GuardDuty, Step Functions Photo from AWS Blog
  2. 1.1 Motivations • Trigger = an X post ◦ Implementing

    proactive instance replacement against noticeable difference in network latency to an RDS endpoint • First impression: ◦ Distance-induced latency will be around 500 microseconds and is unlikely to impact the performance for most applications. ◦ Possibility of other factors in AWS infrastructure causing latency 5 c.f. https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html Distances between two DCs < 100 km?
  3. 1.2 Goals and approach • Goals ◦ Diving deep into

    Instance "hit or miss" • Approach ◦ Launch as many EC2 instances as possible ◦ Look for instances with significant latency from others by measuring latency ◦ 6
  4. • Selected method: ◦ Develop a simple tool “DBPing” (Psycopg

    3) to connect & disconnect with RDS and record TAT ◦ Collect tcpdump logs and calculate TAT & RTT (in three-way handshake sequence) in TCP-level 2.1 Measurement method 8 RTT TAT . . Connect (SYN) Disconnected (ACK) SYN+ACK Connected (ACK) . . . . Disconnect (FIN+ACK)
  5. 2.2 Validation of the selected method • Objectives: ◦ Check

    the reliability of the measurement method ◦ Check whether the metrics are suitable to identify the “hit-or-miss” of instances ◦ Determine the configuration of large-scale measurements (# of trials/instance, instance types, …) ◦ • Configuration: ◦ EC2 instance type: m7g.large & t4g.large (2 vCPUs) ◦ RDS instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ Place EC2 instances and RDS in the same AZ (usw2-az2) ◦ Run DBPing every second for 3 hours for each instance 9
  6. 2.3 Validation results #1 10 • RTT is stable enough,

    and outliers will have little impact on app performance. • The first measurement of each instance should be excluded. ◦ Narrow box-range Small deviation Largest one appears at 1st trial Sudden & not consistent
  7. 2.4 Validation results #2 • TAT or RTT as KPI

    ◦ Little correlation seen between RTT and TAT -> TAT are not suitable to decide the need of Instance replacement due to network latency • Instance type difference: ◦ c7g.large shows 10% faster TAT than m4g.large. ◦ RTT has no difference as expected. -> Burstable instances are applicable. 11 Little correlation between RTT and app-level TAT c7g.large m4g.large
  8. 3.1 Lottery Time • Configuration: ◦ EC2 - Auto scaling

    group across four AZs in Oregon with spot instances - Instance type : any (2+ vCPUs, 1GB+ Mem) except for A1 - Tested 12,239 instances ◦ RDS - Instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ DBPing ◦ - Run every second for 60 times for each instance (1st measurement to be excluded) ◦ - Placement: us-west-2a ◦ 13
  9. 3.2 Trends by AZs • Cross-AZ communication tends to have

    larger latency and spikes. • Any instance may experience spikes in latency regardless of AZs. 14
  10. 3.3 Trends by AZs #2 • RTT trends: • No

    consistent latency was observed affecting most applications. 15 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0
  11. 3.4 Trends by Instance types • Regardless of instance type…

    ◦ Occurrence rate of RTT spikes appears almost same. ◦ Spikes occur approximately once during 60 measurements. • 16
  12. 3.5 Trends by Instance types #2 • t4g.micro instances seem

    to have larger differences in spike size among instances of the same type. • Further investigation is required to conclude by eliminating the possibility that the characteristics of the population may not be well-captured. 17
  13. 4. Takeaways • From the measurement results, ◦ RTT is

    small enough and stable for most applications. ◦ There is little justification for proactively replacing instances. • Be aware of spikes when designing latency-sensitive systems • [Tips] Config is costly… • 19 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0