Downtimeless PostgreSQL server replacement

Downtimeless PostgreSQL server replacement Maciej Pasternacki <[email protected]> Warsaw PostgreSQL User
Group, 2019-02-07 @mpasternacki 3ofcoins

Maciej Pasternacki • Freelance Web infrastructure / DevOps engineer since
2010 • Automation, Configuration Management, Infrastructure As Code • PostgreSQL user & admin

Codility […] helps tech recruiters and hiring managers assess their
candidates’ skills by testing their code online. — codility.com

No such thing as a “5–minute downtime” Scheduled downtime cannot
happen during candidate’s test session. New test cannot be started even 2 hours before planned maintenance. Still need to do regular system patching, configuration changes, …

Most servers can be replaced without downtime Load balancer App
App App Worker

Most servers can be replaced without downtime Load balancer App
App App Worker What about database?

Database Setup client pgbouncer database host

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new
client connection • Must–have for webapp workload https://pgbouncer.github.io/

client connection • Must–have for webapp workload • Can pause client connections https://pgbouncer.github.io/

client connection • Must–have for webapp workload • Can pause client connections • Can do live configuration reload https://pgbouncer.github.io/

PgBouncer The PAUSE command • Tries to disconnect from all
servers, first waiting for all current queries (transactions / sessions) to complete • Returns after servers have been safely disconnected • New client connections will wait until RESUME is called https://pgbouncer.github.io/usage.html

The General Idea client pgbouncer master db pgbouncer standby replica
replication

The General Idea client pgbouncer master db pgbouncer standby replica
replication PAUSE

The General Idea client pgbouncer master db pgbouncer replication Promote

The General Idea client pgbouncer master db pgbouncer new master
replication

replication reconfigure

replication RESUME

replication

The General Idea 1. PAUSE pgbouncer 2. Promote standby replica
to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer

What Could Possibly Go Wrong? 1. PAUSE pgbouncer 2. Promote
standby replica to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer

What Could Possibly Go Wrong? • PAUSE hangs because some
client holds transaction • Replica’s lagging, promoted too early • Replica’s not even a replica • New server broken or misconfigured • Failure leaves pgbouncer paused Too long PAUSE is as bad as “real” downtime!

The Revised Idea 1. Stop all non–essential db clients 2.
With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)

What Could Possibly Go Wrong? 1. Stop all non–essential db
clients 2. With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)

What Could Possibly Go Wrong? The procedure is safe –
failure won’t affect user. But it will affect us! Failure during the procedure means we have to clean up. Refusing to start is cheaper than failing.

Early sanity checks Before doing anything: • Show list of
database processes to review • Open and test SSH to replica • Open and test admin connection to pgbouncer and both PostgreSQLs • Test replica’s pgbouncer • Check that replica is connected to master and doesn’t lag

The Golden Rules • Fail safe. It will break. Plan
for that, don’t let that affect the user. • Look before you leap. The earlier you fail, the easier it is to clean up. Don’t even start if you see you won’t finish.

Downtimeless PostgreSQL server replacement

Downtimeless PostgreSQL server replacement

Maciej Pasternacki

More Decks by Maciej Pasternacki

Other Decks in Programming

Featured

Transcript

Downtimeless PostgreSQL server replacement Maciej Pasternacki <[email protected]> Warsaw PostgreSQL User

Maciej Pasternacki • Freelance Web infrastructure / DevOps engineer since

Codility […] helps tech recruiters and hiring managers assess their

No such thing as a “5–minute downtime” Scheduled downtime cannot

Most servers can be replaced without downtime Load balancer App

Most servers can be replaced without downtime Load balancer App

Database Setup client pgbouncer database host

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

PgBouncer The PAUSE command • Tries to disconnect from all

The General Idea client pgbouncer master db pgbouncer standby replica

The General Idea client pgbouncer master db pgbouncer standby replica

The General Idea client pgbouncer master db pgbouncer replication Promote

The General Idea client pgbouncer master db pgbouncer new master

The General Idea client pgbouncer master db pgbouncer new master

The General Idea client pgbouncer master db pgbouncer new master

The General Idea client pgbouncer master db pgbouncer new master

The General Idea client pgbouncer master db pgbouncer new master

The General Idea 1. PAUSE pgbouncer 2. Promote standby replica

What Could Possibly Go Wrong? 1. PAUSE pgbouncer 2. Promote

What Could Possibly Go Wrong? • PAUSE hangs because some

The Revised Idea 1. Stop all non–essential db clients 2.

What Could Possibly Go Wrong? 1. Stop all non–essential db

What Could Possibly Go Wrong? The procedure is safe –

Early sanity checks Before doing anything: • Show list of

The Golden Rules • Fail safe. It will break. Plan