Optimizing Postgres for high write workloads - Citus Con

Samay Sharma Manager, PostgreSQL Open Source team @ Microsoft Optimizing
Postgres for write-heavy workloads ft. checkpointer & WAL configs

About Me Manage in the Open Source PG team at
Microsoft Working with PostgreSQL for > 10 years Previously solutions & customer engineer at Microsoft, Citus Want to make Postgres performance and tuning easier to understand and do

Optimize Postgres, but before that… Ensure low latency b/w app
and database Deal with connections effectively Make sure you have the right hardware PostgreSQL Client Server

Writing performant applications for PG • Manage Indexes carefully •
Parallelize your loading job with concurrent connections • Use multi-value INSERTs or COPY to speed up throughput • Consider partitioning to reduce index sizes • Use unlogged tables for unimportant data

Moving onto internals of Postgres and tuning it

Journey of a Postgres write When a write comes in,
it is first written to WAL for durability reasons. Then the change is made in shared buffers (not to the data file yet!) At some point, those pages are written to disk by the background writer or checkpointer. WAL (disk) Shared buffers Data files (disk) BG writer Checkpointer INSERT INTO …

Getting rid of WAL - Checkpoints • After a crash,
recovery requires you to replay WAL. • To make recovery faster and avoid maintaining TBs of WAL, Postgres has checkpoints • Checkpoint guarantees all dirty buffers until a point are written to disk • Recovery can start from last checkpoint • WAL before that point can be recycled

For healthy long-term operation Autovacuum = well tuned Checkpointer, WAL
parameters = well tuned Bgwriter = well tuned

Autovacuum PostgreSQL’s way of removing unneeded row versions Checks if
tables significantly modified & if yes, runs VACUUM and ANALYZE Defaults generally low for large workloads Header Data4 Header Data3 (Updated to Data4) Header Data2 Header Data1 (Deleted) Header Data4 Header Data2

Common VACUUM problems and solutions • VACUUM isn’t triggered often
for large tables • Reduce autovacuum_vacuum_scale_factor, autovacuum_vacuum_insert_scale_factor • VACUUM running slow • Adjust / disable cost limits • Increase maintenance_work_mem / autovacuum_work_mem • Adjust max_parallel_maintenance_workers for parallel index vacuuming • Dead rows not reducing • Long running queries • Unused replication slots • Uncommitted prepared transactions

Detailed autovacuum talk last year! aka.ms/cituscon-samaysharma-2022

Tuning Checkpointer

Healthy Checkpointer operation • Checkpoints are frequent enough to •
Ensure a reasonable recovery time • Lesser disk utilization from WAL accumulation • Checkpoints are spaced out enough to • Ensure that I/O impact to user workload is low • Less WAL is generated and lot of pages are not flushed repeatedly

How checkpoints are triggered • Manual actions • Running CHECKPOINTcommand
• Commands which run CHECKPOINT for you e.g. pg_start_backup, pg_ctl STOP etc. • When a particular time passes since the last checkpoint • When you hit a maximum limit of how much wal you can accumulate

Best practice for tuning checkpoint triggering • If checkpoints are
triggered too often, you will get this message in your log LOG: checkpoints are occurring too frequently (9 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size" • Checkpoints should ideally be triggered with checkpoint_timeout • max_wal_size should be set such that it is rarely reached

Setting checkpoint_timeout • Checkpoint_timeout depends on RTO • Hard to
predict specific recovery time • 5 min (default) is generally small – 15 – 30 min are common • Low values may read to write amplification due to full_page_writes (more later)

Setting max_wal_size • Measure wal generated over checkpoint_timeout period •
Can use pg_stat_bgwriter, or results of log_checkpoints to determine • Set to 2-3x WAL you expect to generate • (Small workloads): Bound by size of disk for WAL – Soft limit

Tuning checkpointing rate • Checkpointing can lead to huge I/O
spikes if done all at once • Spread checkpoints out by setting checkpoint_completion_target • 0.9 is generally a good value (PG will aim to complete checkpointing in 90% of time)

Impact of full_page_writes • Setting full_page_writes = on prevents corruption
when OS crashes • Write content of entire page to WAL during first modification after a checkpoint • Greatly magnifies WAL volume with frequent checkpoints 8kb Postgres OS 8kb 8kb 4kb 4kb

Other WAL tuning parameters • wal_compression • Decreases of WAL
volume by compressing it • Useful to enable to reduce WAL I/O • wal_buffers • Amount of shared memory used for WAL • Beneficial to increase (upto 128MB) if you have bulk writes or many concurrent writes • min_wal_size • Recycle WAL files until this size is reached instead of removing • Helpful to avoid new WAL file creation during spikes. Set based on log checkpoint output

Other WAL tuning parameters – Contd. • backend_flush_after • Attempt
to force OS to issue writes once this amount of data is written by a backend • If you see I/O spikes and have a lot of concurrent writes, worth considering this. See caveats. • wal_segment_size • Sets the size of individual WAL files • When you have high WAL volume, number of WAL files per directory can become an issue. Increasing this helps. • Bunch of other parameters in docs – not as commonly used.

Tuning Background Writer

My reads can do writes? • Any pages needed for
reads go into shared_buffers • When there isn’t space to put a buffer into shared_buffers, a page needs to be evicted • If that page is ”dirty”, then it must be written to disk synchronously Bitmap Heap Scan on t0 (cost=38054.04..2162680.07 rows=2867546 width=36) (actual time=1310.569..2181906.171 rows=2859373 loops=1) Recheck Cond: ((col1 = 1) AND (col2 IS NULL)) Rows Removed by Index Recheck: 3617136 Heap Blocks: exact=38631 lossy=859715 Buffers: shared hit=113729 read=792973 written=24959

In comes bgwriter • Bgwriter issues writes of “dirty” shared
buffers • When clean shared buffers are insufficient, bgwriter writes dirty buffers and marks them as clean • Increases total I/O • Reduces likelihood of user queries having to write buffers themselves

Bgwriter tuning parameters • bgwriter_delay : Time interval at which
bgwriter wakes up to write dirty buffers • bgwriter_lru_multiplier : Chooses how many buffers should be written per cycle. • Number of buffers is average recent need * bgwriter_lru_multipler. • Default it 2 (giving some cushion for spikes). • Values < 1 indicate writes will be done by server processes. • bgwriter_lru_maxpages : Max limit on how many pages bgwriter can write in each round.

Concepts for bgwriter tuning Tradeoff b/w increased total I/O vs
I/O done by user queries Goal: See writes by backends and want lower latency Solution: Make bgwriter more aggressive How: Decrease bgwriter_delay and /or increase bgwriter_lru_multiplier Goal: Less total I/O Solution: Make bgwriter less aggressive How: Increase bgwriter_delay and decrease bgwriter_lru_multiplier

There’s more • Shared_buffers tuning – larger for write heavy
workloads • OS parameters – data_writeback= on, vm.dirty_background_bytes etc. • Group commit settings • And others…

Tuning write workloads is tricky Defaults are not good for
large workloads Explain doesn’t tell you as much Requires tuning bg processes vs individual statements Problems are more long term

Optimize Postgres for write workloads Setup Make sure you are
setup well and follow application best practices Monitor Understand the bottleneck by monitoring your database & logs Tune Tune Postgres processes like autovacuum, checkpointer and background writer to improve performance Adapt Keep adapting based on changing workload demands

© Copyright Microsoft Corporation. All rights reserved. danke schön dank
u merci धन्यवाद teşekkürler thank you grazie gracias tack @samay_sharma [email protected] aka.ms/open-source-discord, #cituscon channel Samay Sharma

Optimizing Postgres for high write workloads - ...

Optimizing Postgres for high write workloads - Citus Con

Samay Sharma

Other Decks in Technology

Featured

Transcript

Samay Sharma Manager, PostgreSQL Open Source team @ Microsoft Optimizing

About Me Manage in the Open Source PG team at

Optimize Postgres, but before that… Ensure low latency b/w app

Writing performant applications for PG • Manage Indexes carefully •

Moving onto internals of Postgres and tuning it

Journey of a Postgres write When a write comes in,

Getting rid of WAL - Checkpoints • After a crash,

For healthy long-term operation Autovacuum = well tuned Checkpointer, WAL

Autovacuum PostgreSQL’s way of removing unneeded row versions Checks if

Common VACUUM problems and solutions • VACUUM isn’t triggered often

Detailed autovacuum talk last year! aka.ms/cituscon-samaysharma-2022

Tuning Checkpointer

Healthy Checkpointer operation • Checkpoints are frequent enough to •

How checkpoints are triggered • Manual actions • Running CHECKPOINTcommand

Best practice for tuning checkpoint triggering • If checkpoints are

Setting checkpoint_timeout • Checkpoint_timeout depends on RTO • Hard to

Setting max_wal_size • Measure wal generated over checkpoint_timeout period •

Tuning checkpointing rate • Checkpointing can lead to huge I/O

Impact of full_page_writes • Setting full_page_writes = on prevents corruption

Other WAL tuning parameters • wal_compression • Decreases of WAL

Other WAL tuning parameters – Contd. • backend_flush_after • Attempt

Tuning Background Writer

My reads can do writes? • Any pages needed for

In comes bgwriter • Bgwriter issues writes of “dirty” shared

Bgwriter tuning parameters • bgwriter_delay : Time interval at which

Concepts for bgwriter tuning Tradeoff b/w increased total I/O vs

There’s more • Shared_buffers tuning – larger for write heavy

Tuning write workloads is tricky Defaults are not good for

Optimize Postgres for write workloads Setup Make sure you are

© Copyright Microsoft Corporation. All rights reserved. danke schön dank