Architectures for PostgreSQL High Availability & Disaster Recovery
This presentation was given by Alexey Shiskin (Cloud Architect, Postgres Professional) at the "High Availability Architectures for PostgreSQL" webinar.
key contributor to PostgreSQL community At Postgres Professional we develop Postgres Pro database, a private PostgreSQL fork Postgres Professional also specializes in 24x7 technical support and other professional services (database migration, audit and performance tuning) for PostgreSQL
consider an HA/DR architecture ? HA/DR for Postgres in a nutshell (how to make it work) Introduction to commonly used HA/DR architectures Pros and cons of various HA/DR architectures for PostgreSQL HA/DR field experience
how long an application can be unavailable for business users - 99,99% - 52,56 minutes of downtime per year (~0,9 hrs.) - 99,9% - 525,6 minutes of downtime per year (~9 hrs.) - 99% - 5256 minutes of downtime per year (~90 hrs.) RPO (Recovery Point Objective) - how much business data can be lost
parts in the technology stack the better - monitoring and alerting systems work fine to inform the Operations team about the database issues - switchover/failover scenarios are well documented and both day and night shifts of the Operations team have adequate expertise to cope with database availability issues - RTO is not very strict (up to 5 minutes for switchover/failover tasks) - the number of databases is relatively small (up to 50)
database cluster, which is a collection of databases that is managed by a single instance of a running database server) - RTO is strict (within a minute for switchover/failover tasks) - the number of databases is big (up to 100+)
in functionality and architecture - depend on DCS (Distributed Configuration Store) - require Postgres streaming replication - suitable for physical servers and virtual machines (VMs) - open-source and free of charge Patroni uses external TCP-proxy to connect to master or standby(s) Stolon has built-in TCP-proxy to connect to master or standby(s)
functionality and architecture - use resource agents (disk volume, file system, IP-address, Postgres) - use Virtual IP-address (VIP) to connect to master or standby(s) - can be used with streaming replication and shared disk configuration - mostly used with physical servers - can be applied to build geo-clusters Corosync/Pacemaker is open-source and free of charge Veritas is proprietary and requires license (the only HA-cluster which integrates with disk replication)
HA-clusters - uses logical replication - all nodes can process read-write requests (all nodes are masters) - delivers minimal possible switchover/failover time (single digit seconds) - open-source, but requires license to run in a production environment
(main site lost) If the required RPO is strictly zero, synchronous replication has to be used between main and DR sites, otherwise asynchronous replication is enough Postgres streaming replication is the most popular solution among the customers, it’s included both in PostgreSQL and Postgres Pro database Streaming replication is integrated with HA-clusters (Patroni, Corosync/Pacemaker, Stolon) and uses master/standby(s) configuration, where master is available for read/write requests, while standby(s) can only be used for read-only requests
replication by design Logical replication is more flexible - replication of only some database objects instead of whole database - replication between two databases of different major versions - bi-directional replication between two databases Postgres Pro Multimaster uses logical replication to set up an HA-cluster where all nodes can handle read/write requests
requires a license) Disk-arrays have to support this replication on both main and DR sites (have to be of the same type) Delivers maximum performance for write intensive load profiles Veritas HA-cluster is integrated with disk/LUN replication
among our customers is ‘pg_probackup’ - https://github.com/postgrespro/pg_probackup - supports both full and incremental backup/restore - supports point-in-time-recovery (PITR) - provides backup catalog - supports backup compression - backup validation without actual data restore - parallelism of backup/restore tasks - and many more
disk-array snapshots, which allows to do backup/restore of the database very fast (seconds to single digit minutes) regardless of its size Built-in PostgreSQL - ‘pg_basebackup’ - full backups only (no incremental backups) - no parallelism of backup/restore tasks Built-in PostgreSQL - ‘pg_dump’ and ‘pg_dumpall’ - logical backup (no PITR)
local HA-cluster (all nodes within one site) - stretched HA-cluster (between two or three sites, up to 30 km to each other) 3-node HA-cluster sync 2+1 HA-cluster Client Client Master Standby Standby Referee Master Standby Client
restrictions, which allows the master server to continue running while some of the standbys are temporarily unavailable: https://postgrespro.com/docs/enterprise/13/runtime-config- replication#GUC-SYNCHRONOUS-STANDBY-GAP Automatic database block repair via streaming replication from standby in case of data corruption: https://postgrespro.com/docs/enterprise/13/warm-standby#REPAIR-PAGE- FROM-STANDBY
in-memory WAL buffers: https://postgrespro.com/docs/enterprise/13/wal-restoration Support for database minor version upgrades without a database instance restart: https://postgrespro.com/docs/enterprise/13/release-proee-13-2-1 Compressed file system (CFS) offers database compression at the database block level: https://postgrespro.com/docs/enterprise/13/cfs