Project “Instance Lottery"@JAWS PANKRATION 2024

Project “Instance Lottery" (Instance Gatcha) Hiroshi Hayakawa 󰏦

2 Hiroshi HAYAKAWA 󰏦 AWS Community Builders (Security & Identity)
AWS Ambassadors Japan AWS Top Engineers (Services) Japan AWS All Certifications Engineers GameDay enthusiast: 🥇x2 🥈x1 🥉x3 Favorites: GuardDuty, Step Functions Photo from AWS Blog

Agenda 1. Introduction 2. Measurement methods 3. Lottery Results 4.
Takeaways 3

1. Introduction 4

1.1 Motivations • Trigger = an X post ◦ Implementing
proactive instance replacement against noticeable difference in network latency to an RDS endpoint • First impression: ◦ Distance-induced latency will be around 500 microseconds and is unlikely to impact the performance for most applications. ◦ Possibility of other factors in AWS infrastructure causing latency 5 c.f. https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/availability-zones.html Distances between two DCs < 100 km?

1.2 Goals and approach • Goals ◦ Diving deep into
Instance "hit or miss" • Approach ◦ Launch as many EC2 instances as possible ◦ Look for instances with significant latency from others by measuring latency ◦ 6

2. Measurement Methods 7

• Selected method: ◦ Develop a simple tool “DBPing” (Psycopg
3) to connect & disconnect with RDS and record TAT ◦ Collect tcpdump logs and calculate TAT & RTT (in three-way handshake sequence) in TCP-level 2.1 Measurement method 8 RTT TAT . . Connect (SYN) Disconnected (ACK) SYN+ACK Connected (ACK) . . . . Disconnect (FIN+ACK)

2.2 Validation of the selected method • Objectives: ◦ Check
the reliability of the measurement method ◦ Check whether the metrics are suitable to identify the “hit-or-miss” of instances ◦ Determine the configuration of large-scale measurements (# of trials/instance, instance types, …) ◦ • Configuration: ◦ EC2 instance type: m7g.large & t4g.large (2 vCPUs) ◦ RDS instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ Place EC2 instances and RDS in the same AZ (usw2-az2) ◦ Run DBPing every second for 3 hours for each instance 9

2.3 Validation results #1 10 • RTT is stable enough,
and outliers will have little impact on app performance. • The first measurement of each instance should be excluded. ◦ Narrow box-range Small deviation Largest one appears at 1st trial Sudden & not consistent

2.4 Validation results #2 • TAT or RTT as KPI
◦ Little correlation seen between RTT and TAT -> TAT are not suitable to decide the need of Instance replacement due to network latency • Instance type difference: ◦ c7g.large shows 10% faster TAT than m4g.large. ◦ RTT has no difference as expected. -> Burstable instances are applicable. 11 Little correlation between RTT and app-level TAT c7g.large m4g.large

3. Lottery Results 12

3.1 Lottery Time • Configuration: ◦ EC2 - Auto scaling
group across four AZs in Oregon with spot instances - Instance type : any (2+ vCPUs, 1GB+ Mem) except for A1 - Tested 12,239 instances ◦ RDS - Instance type: db.m7g.large, PostgreSQL (15.8-R1) ◦ DBPing ◦ - Run every second for 60 times for each instance (1st measurement to be excluded) ◦ - Placement: us-west-2a ◦ 13

3.2 Trends by AZs • Cross-AZ communication tends to have
larger latency and spikes. • Any instance may experience spikes in latency regardless of AZs. 14

3.3 Trends by AZs #2 • RTT trends: • No
consistent latency was observed affecting most applications. 15 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0

3.4 Trends by Instance types • Regardless of instance type…
◦ Occurrence rate of RTT spikes appears almost same. ◦ Spikes occur approximately once during 60 measurements. • 16

3.5 Trends by Instance types #2 • t4g.micro instances seem
to have larger differences in spike size among instances of the same type. • Further investigation is required to conclude by eliminating the possibility that the characteristics of the population may not be well-captured. 17

4. Summary 18

4. Takeaways • From the measurement results, ◦ RTT is
small enough and stable for most applications. ◦ There is little justification for proactively replacing instances. • Be aware of spikes when designing latency-sensitive systems • [Tips] Config is costly… • 19 Same-AZ Cross-AZ Communication cost < 376 microseconds < 0.96 ms. Spike size < 567 microseconds < 3.2 ms. Spikes / Instance P99 2.0 MAX 4.0

Thank you!

Project “Instance Lottery"@JAWS PANKRATION 2024

Project “Instance Lottery"@JAWS PANKRATION 2024

Hiroshi Hayakawa (p0n)

More Decks by Hiroshi Hayakawa (p0n)

Other Decks in Technology

Featured

Transcript

Project “Instance Lottery" (Instance Gatcha) Hiroshi Hayakawa 󰏦

2 Hiroshi HAYAKAWA 󰏦 AWS Community Builders (Security & Identity)

Agenda 1. Introduction 2. Measurement methods 3. Lottery Results 4.

1. Introduction 4

1.1 Motivations • Trigger = an X post ◦ Implementing

1.2 Goals and approach • Goals ◦ Diving deep into

2. Measurement Methods 7

• Selected method: ◦ Develop a simple tool “DBPing” (Psycopg

2.2 Validation of the selected method • Objectives: ◦ Check

2.3 Validation results #1 10 • RTT is stable enough,

2.4 Validation results #2 • TAT or RTT as KPI

3. Lottery Results 12

3.1 Lottery Time • Configuration: ◦ EC2 - Auto scaling

3.2 Trends by AZs • Cross-AZ communication tends to have

3.3 Trends by AZs #2 • RTT trends: • No

3.4 Trends by Instance types • Regardless of instance type…

3.5 Trends by Instance types #2 • t4g.micro instances seem

4. Summary 18

4. Takeaways • From the measurement results, ◦ RTT is

Thank you!