Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
AWS SQS queues & Kubernetes Autoscaling Pitfall...
Search
Eric Khun
October 26, 2020
Programming
480
2
Share
AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories
Talk at the Cloud Native Computing Foundation meetup @dcard.tw
Eric Khun
October 26, 2020
More Decks by Eric Khun
See All by Eric Khun
From PHP to Golang: Migrating a real-time data replication service
erickhun
1
170
Other Decks in Programming
See All in Programming
ルールルルルルRubyの中身の予備知識 ── RubyKaigiの前に予習しなイカ?
ydah
1
160
ドメインイベントでビジネスロジックを解きほぐす #phpcon_odawara
kajitack
2
600
煩雑なSkills管理をSoC(関心の分離)により解決する――関心を分離し、プロンプトを部品として育てるためのOSSを作った話 / Solving Complex Skills Management Through SoC (Separation of Concerns)
nrslib
4
870
RSAが破られる前に知っておきたい 耐量子計算機暗号(PQC)入門 / Intro to PQC: Preparing for the Post-RSA Era
mackey0225
3
130
YJITとZJITにはイカなる違いがあるのか?
nakiym
0
200
Rethinking API Platform Filters
vinceamstoutz
0
11k
Linux Kernelの1文字のミスで 権限昇格ができた話
rqda
0
2.3k
forteeの改修から振り返るPHPerKaigi 2026
muno92
PRO
3
270
レガシーPHP転生 〜父がドメインエキスパートだったのでDDD+Claude Codeでチート開発します〜
panda_program
0
720
おれのAgentic Coding 2026/03
tsukasagr
1
140
瑠璃の宝石に学ぶ技術の声の聴き方 / 【劇場版】アニメから得た学びを発表会2026 #エンジニアニメ
mazrean
0
240
Radical Imagining - LIFT 2025-2027 Policy Agenda
lift1998
0
280
Featured
See All Featured
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.5k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
110
So, you think you're a good person
axbom
PRO
2
2k
Facilitating Awesome Meetings
lara
57
6.8k
Exploring anti-patterns in Rails
aemeredith
3
320
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
670
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.5k
How Software Deployment tools have changed in the past 20 years
geshan
0
33k
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
420
Are puppies a ranking factor?
jonoalderson
1
3.3k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
330
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
150
Transcript
AWS SQS queues & Kubernetes Autoscaling Pitfalls Stories Cloud Native
Foundation meetup @dcard.tw @eric_khun
Make it work, Make it right, Make it fast kent
beck (agile manifesto - extreme programming)
Make it work, Make it right, Make it fast kent
beck (agile manifesto - extreme programming)
Make it work, Make it right, Make it fast kent
beck (agile manifesto - extreme programming)
Buffer
None
Buffer • 80 employees , 12 time zones, all remote
Quick intro
None
Main pipelines flow
it can look like ... golang Talk @Maicoin :
None
None
How do we send posts to social medias?
A bit of history... 2010 -> 2012: Joel (founder/ceo) 1
cronjob on a Linode server $20/mo 512 mb of RAM 2012 -> 2017 : Sunil (ex-CTO) Crons running on AWS ElasticBeanstalk / supervisord 2017 -> now: Kubernetes / CronJob controller
AWS Elastic Beanstalk: Kubernetes:
At what scale? ~ 3 million SQS messages per hour
Different patterns for many queues
Are our workers (consumers of the SQS queues ) efficients?
Are our workers efficients?
Are our workers efficients?
Empty messages? > Workers tries to pull messages from SQS,
but receive “nothing” to process
Number of empty messages per queue
Sum of empty messages on all queues
None
1,000,000 API calls to AWS costs 0.40$ We have 7,2B
calls/month for “empty messages” It costs ~$25k/year > Me:
None
AWS SQS Doc
None
Or in the AWS console
Results?
empty messages
AWS
None
$120 > $50 saved daily > $2000 / month >
$25,000 / year (it’s USD, not TWD)
Paid for querying “nothing”
(for the past 8 years )
Benefits - Saving money - Less CPU usage (less empty
requests) - Less throttling (misleading) - Less containers > Better resources allocation: memory/cpu request
Why did that happen?
Default options
None
Never questioning what’s working decently or the way it’s been
always done
What could have helped? Infra as code (explicit options /
standardization) SLI/SLOs (keep re-evaluating what’s important) AWS architecture reviews (taging/recommendations from aws solutions architects)
Make it work, Make it right, Make it fast
Make it work, Make it right, Make it fast
Do you remember?
None
None
None
Need to analytics on Twitter/FB/IG/LKD… on millions on posts faster
workers consuming time
None
What’s the problem?
Resources allocated and not doing anything most of the time
Developer trying to put find compromises on the number of workers
How to solve it?
Autoscaling! (with Keda.sh) Supported by IBM / Redhat / Microsoft
None
Results
None
But notice anything?
Before autoscaling
After autoscaling
After autoscaling
What’s happening?
Downscaling
Why?
delete pod lifecycle
what went wrong - Workers didn’t manage SIGTERM sent by
k8s - Kept processing messages - Messages were halfway processed and killed - Messages were sent back to the the queue again - Less workers because of downscaling
solution - When receiving SIGTERM stop processing new messages -
Set a graceful period long enough to process the current message if (SIGTERM) { // finish current processing and stop receiving new messages }
None
None
And it can also help with sqs empty messages
Make it work, Make it right, Make it fast
Make it work, Make it right, Make it fast
Thanks!
Questions? monitory.io taiwangoldcard.com travelhustlers.co ✈