Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Design Patterns for Multi-Region Mission Critic...

Design Patterns for Multi-Region Mission Critical Apps

Join us to learn about the strategies for building ultra-resilient distributed database architectures that seamlessly handle mission-critical business demands

Cloud Service Providers have the largest infrastructure footprint in the world, spanning countries and continents. This makes it possible to build multi-Region applications that serve user requests with low latency from nearly anywhere, tolerate all sorts of possible outages including major Region-level incidents, and comply with data regulatory requirements. Join us to learn about how YugabyteDB helps with design patterns for building mission critical applications on global cloud IaaS with a focus on low latency and high availability. Explore common deployment patterns used by innovative digital native companies and global Fortune 500 corporations, and learn about trade-offs with each pattern.

Join us as we explore the essential strategies for building ultra-resilient distributed database architectures that seamlessly handle mission-critical business demands. In this session, we’ll discuss how large-scale companies in #media, #retail, #financialservices, and beyond leverage YugabyteDB on public cloud IaaS to support globally distributed applications. Whether you're aiming for high availability, scalable data platforms, or cross-region resilience, this talk will give you actionable insights on architecting for scale and reliability in the cloud.

Avatar for AMEY BANARSE

AMEY BANARSE

March 19, 2025
Tweet

More Decks by AMEY BANARSE

Other Decks in Technology

Transcript

  1. © 2024 – All Rights Reserved seamless scalability built-in resilience

    flexible geo-distribution cost efficiency Run your business-critical applications with using PostgreSQL-compatible and Cassandra-inspired APIs while enjoying without compromising on performance
  2. © 2024 – All Rights Reserved Postgres architected as a

    flexible managed service in any public or private cloud. We are reimagining Postgres as a native cloud service, not just running it in the cloud. 3. Architected as a Cloud DBMS Bring capabilities of leading commercial RDBMS to Postgres in a cloud-native architecture. E.g. DR & replication, perf & observability, security, etc. 2. Enterprise Grade by Default YugabyteDB Building on Top of Postgres Innovations Fully PostgreSQL compatible API for workload portability. Leverage resilience, dynamic scalability, and multi-site distribution in the DB to make your app cloud native. 1. Distributed Postgres
  3. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. PostgreSQL has become the default database API ◦ Powerful RDBMS capabilities: matches Oracle features ◦ Robust and mature: hardened over 30 years ◦ Fully open source: permissive license, large community ◦ Cloud providers adopting: managed services on all clouds “Most popular database” of 2022* “DBMS of the year” over multiple years** 2017 2018 2020 * https://www.eversql.com/most-popular-databases ** https://db-engines.com/en/blog_post/85
  4. © 2024 – All Rights Reserved PostgreSQL Compatibility and Cloud

    Native Architecture are Critical 7 Can use PostgreSQL client drivers and psql shell Parse PG syntax - but execution is different Syntax Supports some advanced PG features - but they will work differently Feature Exactly like Postgres. Port over all existing apps, PG developers instantly at home. Runtime Wire How much Postgres compatibility? How cloud native (distributed) is the architecture? Cannot deliver high data durability, availability, scale, best in class DR Low Delivers data durability and some vertical scale. Weak HA, horizontally scale, DR Medium High data durability, availability, scalability, DR, multi-region High
  5. © 2024 – All Rights Reserved PostgreSQL Compatibility and Cloud

    Native Architecture are Critical 8 Can use PostgreSQL client drivers and psql shell Parse PG syntax - but execution is different Syntax Supports some advanced PG features - but they will work differently Feature Exactly like Postgres. Port over all existing apps, PG developers instantly at home. Runtime Wire How much Postgres compatibility? How cloud native (distributed) is the architecture? Cannot deliver high data durability, availability, scale, best in class DR Low Delivers data durability and some vertical scale. Weak HA, horizontally scale, DR Medium High data durability, availability, scalability, DR, multi-region High Can benefit from Postgres innovation (like pg_vector for genAI, QoS for multi-tenancy, etc) PG Innovation Threshold
  6. © 2024 – All Rights Reserved PostgreSQL Compatibility and Cloud

    Native Architecture are Critical 9 Can use PostgreSQL client drivers and psql shell Parse PG syntax - but execution is different Syntax Supports some advanced PG features - but they will work differently Feature Exactly like Postgres. Port over all existing apps, PG developers instantly at home. Runtime Wire How much Postgres compatibility? How cloud native (distributed) is the architecture? Cannot deliver high data durability, availability, scale, best in class DR Low Delivers data durability and some vertical scale. Weak HA, horizontally scale, DR Medium High data durability, availability, scalability, DR, multi-region High Can innovate on distributed, cloud native architecture (like zero downtime, global apps, fast auto-scaling, connection scaling, etc.) PG Innovation Threshold Cloud DBMS Innovation Threshold Can benefit from Postgres innovation (like pg_vector for genAI, QoS for multi-tenancy, etc)
  7. © 2024 – All Rights Reserved PostgreSQL Compatibility and Cloud

    Native Architecture are Critical 10 Can use PostgreSQL client drivers and psql shell Parse PG syntax - but execution is different Syntax Supports some advanced PG features - but they will work differently Feature Exactly like Postgres. Port over all existing apps, PG developers instantly at home. Runtime Wire How much Postgres compatibility? How cloud native (distributed) is the architecture? Cannot deliver high data durability, availability, scale, best in class DR Low Delivers data durability and some vertical scale. Weak HA, horizontally scale, DR Medium High data durability, availability, scalability, DR, multi-region High Can benefit from Postgres innovation (like pg_vector for genAI, QoS for multi-tenancy, etc) Can innovate on distributed, cloud native architecture (like zero downtime, global apps, fast auto-scaling, connection scaling, etc.) Can innovate on both dimensions PG Innovation Threshold Cloud DBMS Innovation Threshold
  8. © 2024 – All Rights Reserved PostgreSQL Compatibility and Cloud

    Native Architecture are Critical 11 Can use PostgreSQL client drivers and psql shell Parse PG syntax - but execution is different Syntax Supports some advanced PG features - but they will work differently Feature Exactly like Postgres. Port over all existing apps, PG developers instantly at home. Runtime Wire How much Postgres compatibility? How cloud native (distributed) is the architecture? Cannot deliver high data durability, availability, scale, best in class DR Low Delivers data durability and some vertical scale. Weak HA, horizontally scale, DR Medium High data durability, availability, scalability, DR, multi-region High Cloud DBMS Innovation Threshold PG Innovation Threshold
  9. © 2024 – All Rights Reserved 12 The ability of

    a system to readily respond to or recover from change, disruption, or a crisis Resilience
  10. © 2024 – All Rights Reserved Commodity servers fail, network

    interruptions are common More apps as everything is digital and more headless services Unexpected successes can overwhelm systems Resilience was always critical: so what changed? Cloud Native = More Failures Bigger Scale = More Failures Viral Success = More Failures 13
  11. © 2024 – All Rights Reserved 15 Modern applications demand

    ultra-resilience Customers expect always-on apps Nations run on digital infrastructure Brand reputation requires uptime
  12. © 2024 – All Rights Reserved 16 Modern applications need

    resilience built into Postgres, not layered on top of it.
  13. © 2024 – All Rights Reserved How do you architect

    for zero downtime? • Assume nodes and zones will fail often • Users should have zero impact • RPO=0, RTO~3s with sync replication • Replication lag typically <500 ms with async Async replication between two clusters in different regions Region 2 Region 3 Region 1 Sync replication across regions within a cluster
  14. © 2024 – All Rights Reserved … for no downtime,

    no limits In-Region resilience Multi-Region BCDR Zero-downtime operations Data protection Peak and freak events Grey failures From resilience to ultra-resilience…
  15. © 2024 – All Rights Reserved 19 Let’s dive into

    the real-world examples of ultra-resilience architectures
  16. © 2025 All Rights Reserved Business Objective: Get Paramount+ Closer

    to Our End Users! With the anticipated expansion through globalization and release of new services and content, Paramount+ needs a database platform that can perform and scale to support our peak demands to provide the best user experience. • Multi-Region/Cloud Deployment ◦ High Availability and Resilience ◦ Performance at Immense Scale • Compliance to local laws ◦ Conform to GDPR Regulations ◦ Conform to Local Security Laws 21
  17. © 2024 – All Rights Reserved YugabyteDB & Paramount+ :

    Global Authentication and User Profiles Past Challenges Paramount+ original single-region architecture with MySQL • Slow read performance due to MySQL being limited to a single region • No horizontal scalability due to limited primary and follower architecture (single 64-core node handled all writes) • Costly downtimes due to no region-level fault tolerance in single-region architecture • Potential data loss due to high replication lag and potential primary node failure Use Case: Powers user log in, authorizes content viewing, and manages profile information (watchlist, account details, preferences, etc). YugabyteDB is the system of record for global authentication and all user profiles to view content.
  18. © 2024 – All Rights Reserved 23 New multi-region design

    that powered Super Bowl 2024 • Multi-Region – Stretch Sync deployment ◦ Verified RPO0 and RTO 10 secs on failures ◦ Global DB  3 Regions (east, central, west) on Public cloud IaaS, 5 AZs, replication RF5 • Performance and scalability ◦ Read latencies < 30ms, transactional multi-region write latencies 100ms ◦ Scaled clusters seamlessly for peak events AFC playoffs, TopGun Maverick launch etc.) • PostgreSQL runtime compatibility • Live Migration from MySQL to YugabyteDB • Ecosystem integrations and extensibility ◦ Compliance with local laws for data residency
  19. YugabyteDB & Paramount+: Real World Success Story Consistent global growth

    on YugabyteDB • Launched AFCs, Grammys, Top Gun Maverick • Expecting 36x growth across some events • Helped expand business to the EU region for Paramount+ International https://www.paramountpressexpress.com/cbs-sports/shows/nfl-on-cbs/releases/?view=109115-nfl-on-cbs-scores-the-most-watched-nfl-divisional-playoff-game-ever-with-more-than-50-million-viewers
  20. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Peak event: Super Bowl 2024 • Use case ▪ Media live-streaming platform ▪ User registrations and entitlement lookup • Peak Scale ◦ CBS Sports presentation of Super Bowl LVIII was the most-watched telecast in history, with 123.4 million viewers across platforms • Challenges ▪ Massively scaling user entitlements lookup ▪ Resilience ▪ Low latency for users around the world
  21. © 2024 – All Rights Reserved YugabyteDB & Large Financial

    Investment Firm: Success Story Key Requirements • Support 150 TB (up from 2 TB • Support 150K ops/sec (up from 1K ops/s) • 600K bulk ops/sec • Predictable reads at scale with P99  10ms • Resilience and availability - effective business continuity for the cloud outage scenarios at very high throughput Use Case: This Financial App aggregates and stores retail customers' portfolio data, organizing it by customer/account and seamlessly transmitting to Fintechs and third-party aggregators such as Intuit Past Challenges • Federal regulations mandated increased data retention from 4 days to 2 years, but high projected cost of current solution • Anticipated the service becoming more critical, thereby resulting in more queries, needed more efficiency and flexibility • Had to move from on-prem to AWS to support resilience, higher scale at lower costs & with higher agility
  22. © 2024 – All Rights Reserved Large Financial Investment Firm:

    Technical Results Achieved • Scalability ◦ Production cluster has 15 nodes across 3 AZs, RF3 with data scale up to 150TB ◦ Demonstrated TPS of 700K ops/sec ◦ Data retention of 2 years achieved through TTL feature • Performance ◦ The production universe continues to ingest several TBs of financial data per week ◦ P99 read latency < 3ms * P99 write latency 5ms • Resilience: YugabyteDB is deployed in each AWS region across multiple zones with RF3 to ensure business continuity even on cloud outages (node or zone failures) • Disaster Recovery: ◦ YugabyteDB clusters in two AWS regions (us-east-1 and us-east-2 ◦ Bidirectional, asynchronous data replication between them ◦ A planned failover (where the data replication was drained before failing over) was completed and verified to have RPO  0 and RTO  10 sec ◦ The deployment continues to meet the high throughput and low latency needs.
  23. © 2024 – All Rights Reserved Master Cluster 1 in

    Region 1 Consistent Across Zones No Cross-Region Latency for Both Writes & Reads Master Cluster 2 in Region 2 Consistent Across Zones No Cross-Region Latency for Both Writes & Reads Async Replication Availability Zone 2 Availability Zone 3 Availability Zone 2 Availability Zone 3 Availability Zone 1 Availability Zone 1 xCluster Asynchronous Replication: eliminate cross-region latency 28
  24. © 2024 – All Rights Reserved Large Credit Bureau -

    Consumer Credit Portal A credit reporting agency that collects and analyzes information about consumers and businesses. Use Case: Tier 0 business system which stores and manages credit information for all U.S. individuals, providing read access to customers and consumers. This app enables users to: • Stay on top of changes to personal data • Maintain a healthy credit score • Protect against identity theft • Monitor credit reports and receive alerts on changes. Why YugabyteDB? Incumbent Oracle Exadata in an on-prem datacenter environment impacted ability for scale and business agility • Move from traditional data centers to the cloud (multi-region) • Migration from Oracle to cloud native DB for resilience & scale • Replace Oracle GoldenGate replication • Consolidated solution for APIs and batch processing and integrate with Spark data lake environment
  25. © 2024 – All Rights Reserved Global Credit Bureau can

    survive regional cloud outage 30 Multi-Region YugabyteDB Cluster US-West US-Central US-East • 50 Nodes across 3 AWS Regions - USEast, USWest and USCentral • Implement Preferred Leaders for low latency access in US Central • Cluster + topology aware drivers quickly identify newly added nodes • Supports reactive microservices & event driven architecture patterns • Transactional Change Data Capture(CDC) to downstream Data LakeHouse
  26. © 2024 – All Rights Reserved Global Retailer survives regional

    cloud outage 31 YugabyteDB design • Service will remain resilient and available through any entire single region failure • Applications automatically redirected to other live regions • No data loss RPO  0) and RTO 10 secs • Sustaining high throughput of 250K TPS & geo-distributed for low latency read access Multi-Region YugabyteDB Cluster US-West US-Central US-East Use Case: Tier 0 business system which stores and manages credit information for all U.S. individuals, providing read access to customers and consumers
  27. © 2024 – All Rights Reserved • Migration from Oracle

    Exadata: They transitioned from an on-prem, single-region legacy Exadata deployment to a multi-region YugabyteDB cluster on AWS, aiming for a cloud-native, resilient, and scalable architecture. • For this use case, YB topology deployment will be 50 Nodes with 3 Regions in AWS Cloud • AWS Aurora competitive: After 18 months of testing AWS Aurora without success, Credit Buerea got YugabyteDB OSS running in under 2 months. • Critical for win: YugabyteDB's Change Data Capture(CDC): YB’s ability to seamlessly failover between regions without disrupting the application or CDC to their downstream DataLake was a key differentiator. • Potential expansion with new use-cases: Aligning with the CIO led to a secondary opportunity in Fraud Management, where Yugabyte is now positioned for a new greenfield application in conjunction with Datalakehouse. • From OSS to Enterprise: They initially deployed YugabyteDB using our OSS offering, fully validating their app later and engaging with Yugabyte for enterprise support for this Tier 0 app. Large Credit Bureau - Learnings from the Field
  28. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Lessons Learned ✓ Prepared for unexpected bursts ✓ Built for expected peaks ✓ Surviving DDoS attacks ✓ Flexible expansion, anywhere ✓ Multitenancy ✓ No performance compromise
  29. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. • Entire Region or data center failure—low probability but we see it happen regularly • Failures that last a while • Complex process to “heal” once the Region/DC is back online • Ability to trade off between steady-state performance (latency) and potential data loss (RPO) • Very quick recovery (low RTO) • Ability to run DR drills, planned switchover and chaos testing What can go wrong… What you want…
  30. © 2023 All Rights Reserved Latency Optimized Geo Partitioning -

    Design Pattern Details Replication factor = 3 or 5 Create one table partition per geo 35 Partition application and DB (one partition per geo) 2 regions per geography Num regions = 2 * num geo’s Notes • Need to partition application and db, one partition per geography • 2 regions per geo for low read and write latency • 1 region per geo for low read latency (but higher write latency) Original table (parent table) Table partition in geo #1 Table partition in geo #2 APP APP
  31. Yugabyte © 2022 – All Rights Reserved Goal: Ensure local

    placement of data 36 Geo-distributed YB cluster Table transparently partitioned into per-region tables based on the value of column ‘region’. SELECT * FROM users WHERE region=’US’ SELECT * FROM users WHERE region=’EU’’ SELECT * FROM users WHERE region=’India’ • Low latency for due to local access • Satisfies data residency requirements like GDPR Each partition table and all of its replicas are pinned to one region
  32. © 2024 – All Rights Reserved 37 Customers around the

    world trust YugabyteDB to run their business 37
  33. © 2024 – All Rights Reserved RAG (Retrieval Augmented Generation)

    - What? Why ? Where do vector indexes fit it ? 39
  34. © 2024 – All Rights Reserved What is our Vector

    Indexing story ? Native Vector Search powered by pgvector • In-database vector similarity search • Query vector and relational data • PostgreSQL compatibility • Fully ACID compliant • Fully open-source • Support for multiple distance functions, indexing methods Scalable & Extensible • Built natively for YugabyteDB, Vector index automatically sharded across nodes for horizontal scale and high throughput • Pluggable indexing design enables integration with new algorithms and libraries as the vector search ecosystem evolves • Integration with other Postgres extensions to enhance value Resilience & Global Distribution • Inherits all of YugabyteDBʼs fault tolerance, high availability, and geo-distribution for low-latency vector search across regions 40
  35. © 2024 – All Rights Reserved What can you do

    with PgVector in YugabyteDB ? 41 Load data (PgVector does not support generating embeddings) \COPY public.ybarticles FROM 'File.csv' Create tables with vector columns CREATE TABLE images( id INTEGER NOT NULL, title TEXT, image_vector vector(1536)); Create vector indexes with different distance functions CREATE INDEX ON images USING ybhnsw (image_vector vector_l2_ops); Search for vectors closest to an input SELECT id, title FROM images ORDER BY image_vector <=> '[-0.01,-0.02,0.01...]'::vector LIMIT 10; HNSW Index Different indexing functions Vector Search
  36. © 2024 – All Rights Reserved What is under the

    hood ? 42 2. Distributed/Parallel Execution for Performance 5. Horizontal Scalability by adding nodes 4. Extensibility 1. Postgres Compatibility 3. Colocated vector index with Main table for better performance
  37. © 2024 – All Rights Reserved YugabyteDB is on a

    Path to Become the Default Database in the Enterprise
  38. © 2024 – All Rights Reserved FerretDB 45 • Mongo

    proxy. • V2 relies on DocumentDB extension for PostgreSQL.
  39. © 2024 – All Rights Reserved 1. Retrieval-Augmented Generation (RAG)

    • Contextual Data Retrieval: YugabyteDB stores structured metadata and embeddings used in similarity searches, complementing vector databases or extensions (such as pgvector) for effective retrieval. • Conversational AI: YugabyteDB manages persistent state and transactional consistency for conversational history in chatbot or virtual assistant applications, providing contextually accurate responses. 2. Vector Search Integration (with YugabyteDB PostgreSQL & pgvector) • Similarity Searches: YugabyteDB YSQL supports the pgvector extension, enabling direct storage, indexing, and querying of high-dimensional embeddings, simplifying the architecture for AI-driven search applications. • Recommendation Engines: Leverage vector similarity search directly within YugabyteDB to quickly provide personalized recommendations in real-time based on user behavior embeddings. 3. Personalization and Recommendation Systems • User Profiling and Analytics: YugabyteDB efficiently stores and manages structured customer profile data, enabling GenAI models to deliver hyper-personalized content. • Dynamic Recommendations: YugabyteDB integrates with AWS SageMaker or AWS Bedrock-based LLMs to dynamically generate personalized recommendations based on user actions and transactional history. YugabyteDB can power these GenAI usecases seamlessly
  40. © 2024 – All Rights Reserved 4. Content Generation &

    Management • Real-time Content Management: YugabyteDB provides rapid retrieval and updating of structured data, allowing GenAI models (like AWS Bedrock, Claude, or Nova Pro models) to dynamically generate personalized content, such as product descriptions, summaries, or marketing materials. • Automated Document Generation: YugabyteDB manages structured input data and template parameters, facilitating real-time document creation powered by GenAI workflows. 5. Enterprise Knowledge Bases and Semantic Search • Structured Knowledge Repositories: Store structured data used for semantic retrieval and inference, enabling GenAI-powered knowledge base queries. • Real-time Question Answering: YugabyteDB integrates smoothly with AWS Bedrock-hosted models to dynamically answer queries based on structured data stored in relational formats. 6. Fraud Detection and Risk Analysis • Real-time Fraud Analytics: YugabyteDB maintains transactional records and risk indicators, enabling GenAI-powered anomaly detection systems to rapidly flag and investigate suspicious activity. • Predictive Risk Modeling: YugabyteDB provides consistent, transactional data to continuously train and retrain GenAI models, improving accuracy in risk assessment and fraud detection use cases. 7. AI-Powered Customer Support • Contextual Support Automation: YugabyteDB stores customer interactions, order history, and product details, enabling intelligent support bots powered by GenAI (like Amazon Bedrock models) to provide precise, timely, and relevant customer support. • Ticket Routing and Prioritization: AI models query YugabyteDB’s structured database to evaluate and prioritize customer requests based on historical context YugabyteDB can power these GenAI usecases seamlessly
  41. © 2024 – All Rights Reserved YugabyteDB is on a

    Path to Become the Default Database in the Enterprise 2021 2016 2022 2023 Resilience and Scale Scalable YSQL and YCQL Sync & async replication Geo-residency 2024 Enhanced PG Compatibility Like Postgres + built-in resilience On-demand scaling 2025 Serverless Serverless offering for small workloads that go to 0 CONFIDENTIAL: DO NOT DISTRIBUTE 2026 2027 Multitenancy Workload consolidation on a single cluster Great for workloads that need resilience (HA), scale, or geo- distribution Great for mid-size workloads that may need unpredictable scale in the future Great for low scale workloads that require QoS Great for small scale standalone cloud native workloads