From resilience to ultra-resilience of data for modern applications
Deep dive into how distributed PostgreSQL is architected to meet the demands of modern cloud-native applications, as well as sharing how real-life customers are using YugabyteDB to power a range of business-critical applications.
flexible Geo-Distribution Cost Efficiency Run your business-critical applications with using PostgreSQL-compatible & Cassandra-inspired APIs while enjoying without compromising on performance 2
default database API ◦ Powerful RDBMS capabilities: matches Oracle features ◦ Robust and mature: hardened over 30 years ◦ Fully open source: permissive license, large community ◦ Cloud providers adopting: managed services on all clouds “Most popular database” of 2022 “DBMS of the year” over multiple years 2017 2018 2020
Feature Compatibility Runtime Compatibility Compatible with PG client drivers ✓ ✓ ✓ ✓ Parses PG syntax properly (but execution may be different) ✘ ✓ ✓ ✓ Supports equivalent features (but with different syntax & runtime) ✘ ✘ ✓ ✓ Appears and behaves just like PG to applications ✘ ✘ ✘ ✓ Not all “PostgreSQL Compatibilityˮ is created equal
interruptions are common More apps as everything is digital and more headless services Unexpected successes can overwhelm systems Resilience to ultra-resilience: what changed? Cloud Native = More Failures Bigger Scale = More Failures Viral Success = More Failures
uncommon Per quarter outages in Asia Pacific “Outages costing companies more than $1 million has increased from 11% to 15% since 2019.ˮ https://foundershield.com/blog/real-world-statistics-on-managing-cloud-outage-risks/
Region and data center outage • User, app or operator error • Upgrades / patching downtime • Intermittent or partial failures • Massive or unexpected Spikes Different failure modes require different elements of resilience In-region resilience Multi-region BCDR Data protection Zero-downtime operations Grey failures Peak and freak events
Closer to Their End Users With the anticipated expansion through globalization and release of new services and content, Paramount+ needed a database platform that could perform and scale to support peak demands to provide the best user experience. • Multi-Region/Cloud Deployment ◦ High availability and resilience ◦ Performance at peak scale • Compliance with local laws ◦ Conform to GDPR regulations ◦ Conform to local security laws
2024 ◦ Use Case ◦ Media live streaming platform ◦ User registrations and entitlement lookup ◦ Peak ◦ CBS Sportsʼ presentation of Super Bowl LVIII was the most-watched telecast in history, with 123.4 million viewers across platforms ◦ Challenges ◦ Massively scaling user entitlements lookup ◦ Resilience ◦ Low latency for users around the world
Cloud Outage Top 5 Global Retailer ◦ Use Case: ◦ Product catalog for a global top 5 retailer ◦ Over 1.6 billion products ◦ Freak events: ◦ Snowstorm in Texas took out a cloud region ◦ Key Challenges ◦ High availability: Keeping the product catalog up during peak holiday season in spite of the cloud outage ◦ Sustaining high throughput of 250k+ tps
- US, EU & APJ • Single YB cluster providing Strong Consistency across multi-region • Scalable and highly available operational data tier • Business continuity, able to withstand Region failure with RPO=0 • Geo-partitioning, Data Locality & Compliance 25
take many shapes • Infrastructure failures • Region and data center outages • User, app or operator errors • Downtime from upgrades / patching • Intermittent or partial failures • Massive or unexpected spikes
take many shapes • Infrastructure failures • Region and data center outages • User, app or operator errors • Downtime from upgrades / patching • Intermittent or partial failures • Massive or unexpected spikes TRADITIONAL RESILIENCE Only these 2 failure types are addressed
/ DC outages to ensure business continuity ✓ E.g., power grid failures, natural disasters ✓ Nations are increasingly mandating multi-region resilience through regulatory compliance What is multi-region resilience?
data center failure—low probability but we see it happen regularly • Failures that last a while • Complex process to “healˮ once the region / DC is back online • Ability to tradeoff between steady-state performance (latency) and potential data loss RPO • Very quick recovery (low RTO • Ability to run DR drills - planned switchover What can go wrong… What you want…