Revisiting XTM: A Practical Case Study Highlighting Its Needs - PGConf.dev 2025

Revisiting XTM: A Practical Case Study Highlighting Its Needs Yuya
Watari PGConf.dev 2025

2 © 2025 NTT CORPORATION Yuya Watari  From Tokyo,
Japan Work  NTT Open Source Software Center/NTT Software Innovation Center  Developing distributed databases Interests  Distributed transaction  Planning  Optimization  Photography About me

3 © 2025 NTT CORPORATION Agenda 1. Introduction 2. Making
PostgreSQL Distributed – FDW-based Sharding 3. Making Transaction Manager Pluggable – XTM 4. Case Study – User-specific Transaction Method 5. Discussion 6. Conclusion

5 © 2025 NTT CORPORATION Keywords – Transactional extensibility and
XTM PostgreSQL is well known for its high extensibility  Hooks in planner or executor, and access methods, etc. However, PostgreSQL lacks transactional extensibility  No hooks or pluggable APIs for replacing the transaction manager  Custom transaction methods cannot be implemented as extensions, and forking of the PostgreSQL core is required eXtensible Transaction Manager (XTM) is a promising solution  XTM allows the replacement of core transactional APIs • Snapshot mechanism, tuple visibility determinations, etc.  XTM was proposed in 2015 by PostgresPro but has not yet been implemented

6 © 2025 NTT CORPORATION New usage of XTM by
user-specific requirements The discussion in 2015 mainly focused on implementing state-of-the-art methods  Global transaction managers (GTMs), clock-based algorithms, etc. Beyond such purposes, XTM plays an important role in complementing these existing methods to meet user-specific requirements  Requirements for distributed transactions vary from user to user • Various patterns of consistency • Performance oriented? Or Consistency oriented?  Existing methods do not always satisfy these requirements  Using XTM to complement existing methods has not been discussed so far This talk introduces a case study of complementing existing state-of-the-art methods to meet user-specific requirements

7 © 2025 NTT CORPORATION Case study The case study
models a banking system  Account information is distributed and located in multiple nodes  Two queries: transfer and balance inquiry User-specific requirements  Should support local transactions on shards to provide high performance  Should ensure session consistency even for local transactions XTM is utilized to complement existing methods, which fail to meet these requirements Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A

8 © 2025 NTT CORPORATION Making PostgreSQL Distributed – FDW-based
Sharding

9 © 2025 NTT CORPORATION FDW-based sharding Combines FDW and
table partitioning  Foreign Data Wrapper (FDW) • Accesses remote data by SQLs • Enables the use of external data sources as if they were local tables  Table partitioning • Splits table into multiple partitions Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A

10 © 2025 NTT CORPORATION Two user-specific requirements In addition
to standard global transaction support, such as atomic commit and atomic visibility, users require the following 1. Allowing local transactions on shards  Transactions enclosed within a single shard should bypass the coordinator and be performed locally on the shard 2. Ensuring session consistency  If a user observes that a (global) transaction has finished, subsequent transactions should reflect the result  Must be applied to local transactions Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A

11 © 2025 NTT CORPORATION Methods for standard global transaction
support State-of-the-art methods have been proposed to ensure atomic visibility  Global transaction managers (GTMs), e.g., Postgres-XC/XL • Assign global transaction IDs to uniquely identify global transactions  Clock-based algorithms, e.g., Clock-SI • Use loosely synchronized clocks to coordinate global transactions  Pessimistic coordination • Assigns local snapshots atomically when global transaction begins Centralized coordination Decentralized coordination

12 © 2025 NTT CORPORATION System design Ensure atomic commit
by 2PC  Using postgres_fdw_plus Ensure atomic visibility by existing state- of-the-art method  To support local transactions on shards, we assume decentralized coordination However, this design does not ensure session consistency, which the user in the case study requires Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A

13 © 2025 NTT CORPORATION postgres_fdw_plus – FDW supporting 2PC
postgres_fdw is provided as a contrib module  Enables the use of PostgreSQL servers as external data sources  Does not support distributed transactions postgres_fdw_plus is a fork of postgres_fdw with support of atomic commit  Developed by NTT DATA  Basic usage is the same as original postgres_fdw  Atomic commit • Ensures that a transaction is either fully committed or totally aborted in all participating nodes • Essential in distributed transactions • postgres_fdw_plus achieves atomic commit by two-phase commit (2PC) protocol The case study adopts postgres_fdw_plus

14 © 2025 NTT CORPORATION Atomic commit Naïve transaction methods
can lead to an inconsistent state  Naïve one-phase commit Coordinator Shard 1 Shard 2 Client Commit Commit OK Commit OK OK

15 © 2025 NTT CORPORATION Atomic commit Naïve transaction methods
can lead to an inconsistent state  Naïve one-phase commit Coordinator Shard 1 Shard 2 Client Commit Commit OK Commit NO NO If some commits in shards fail Global transaction is committed in Shard 1 but fails in Shard 2  Inconsistent state

16 © 2025 NTT CORPORATION Two-phase commit (2PC) protocol A
solution to ensure atomic commit  Involves Prepare phase and Commit phase Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase

17 © 2025 NTT CORPORATION static void CommitTransaction(void) { …
CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } Flow of committing a transaction – Original CommitTransaction() is called when committing a transaction Note: The above code has been modified for simplicity Callbacks before and after main operations The pre-commit callback is allowed a returning error, leading to AbortTransaction() The post-commit callback is NOT allowed a returning error

18 © 2025 NTT CORPORATION Utilizes these callbacks to implement
2PC Implementation of postgres_fdw_plus Note: The above code has been modified for simplicity Success static void CommitTransaction(void) { … CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } In pre-commit callback, postgres_fdw_plus issues PREPARE TRANSACTION In post-commit callback, postgres_fdw_plus issues COMMIT PREPARED

19 © 2025 NTT CORPORATION Utilizes these callbacks to implement
2PC Implementation of postgres_fdw_plus Note: The above code has been modified for simplicity static void CommitTransaction(void) { … CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } If PREPARE TRANSACTION fails in some nodes, control flow moves to AbortTransaction() and postgres_fdw_plus issues ROLLBACK PREPARED

20 © 2025 NTT CORPORATION Optimization of 2PC protocol Current
postgres_fdw_plus returns ack after Commit phase of 2PC Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase

21 © 2025 NTT CORPORATION Optimization of 2PC protocol Returning
ack at the end of Prepare phase is a typical optimization of 2PC  To achieve high performance, case study adopts this optimization Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase

22 © 2025 NTT CORPORATION Session consistency is NOT guaranteed
Consider the following scenario 1. Initially, both accounts A and B have $10 2. User transfers $1 from A to B • This is a global transaction • Account A will have $9, and B will have $11 3. User inquires balance of B • This is a local transaction on Shard B • The user must observe updated result $11, i.e., the transfer must not be lost • We call this session consistency Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A

23 © 2025 NTT CORPORATION Why is session consistency not
guaranteed? User may read inconsistent state after Prepare phase Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B $11

guaranteed? User may read inconsistent state after Prepare phase Transfer was finished! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B

guaranteed? User may read inconsistent state after Prepare phase Start TX2 to confirm the transfer! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B

guaranteed? User may read inconsistent state after Prepare phase Observed $10, and transfer was lost! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B

27 © 2025 NTT CORPORATION Achieving session consistency by XTM
This case study implements a custom transaction method to ensure session consistency even for local transactions by XTM  User-specific method for user-specific requirement  Customizes snapshots and tuple visibility determination XTM is effective for complementing existing methods to meet this kind of user-specific requirement that is not always needed in general

28 © 2025 NTT CORPORATION Making Transaction Manager Pluggable –
XTM

29 © 2025 NTT CORPORATION eXtensible Transaction Manager (XTM) eXtensible
Transaction Manager (XTM) significantly improves transactional extensibility in PostgreSQL  Allows the core transactional APIs to be replaced  First proposed in 2015 by PostgresPro  Previously discussed, but not yet merged into PostgreSQL core

30 © 2025 NTT CORPORATION XTM Interface The proposed interface
of XTM[1]  Each pointer encapsulates original function in transaction manager typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM

31 © 2025 NTT CORPORATION XTM Interface The proposed interface
of XTM[1]  Each pointer encapsulates original function in transaction manager typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM Question: What do these functions originally do?  Need to know behavior of transaction manager

32 © 2025 NTT CORPORATION Deep dive into transaction manager
PostgreSQL adopts MultiVersion Concurrency Control (MVCC) to achieve isolation between transactions with high concurrency  Transaction manager assigns transaction id (XID) to each transaction  Transaction manager obtains a snapshot when transaction starts • This is the behavior when isolation level is REPEATABLE READ or higher • For READ COMMITTED, transaction manager obtains a snapshot for each command • Snapshot stores information about which transactions are committed, aborted, or in- progress when the snapshot was taken  Executor returns tuples that are visible according to the current snapshot • Called tuple visibility determination

33 © 2025 NTT CORPORATION Snapshot Holds information on which
transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible

transactions are visible All transactions whose XID is less than xmin are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible

transactions are visible All transactions whose XID is greater than or equal to xmax are invisible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible

transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Transactions between xmin and xmax have different visibility  Stored in array of snapshot data structure

37 © 2025 NTT CORPORATION Execution – Writing tables Each
tuple stores XIDs that wrote and deleted it xmin xmax ID Balance 100 1 10 Table XID that wrote this tuple XID that deleted this tuple

38 © 2025 NTT CORPORATION Execution – Writing tables Each
tuple stores XIDs that wrote and deleted it xmin xmax ID Balance 100 103 1 10 103 1 11 Table After updating this record by XID 103, a new tuple is added to the table

39 © 2025 NTT CORPORATION Execution – Scanning During table
scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table XID 100 is visible, but XID 103 is invisible  This tuple is visible (shown in the result)

scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table XID 103 is invisible and nobody deletes it  This tuple is invisible (not shown in the result)

scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table ID Balance 1 10 Query result

42 © 2025 NTT CORPORATION Implementation – Snapshot Snapshot data
structure holds this information typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … } SnapshotData; In-progress transactions

43 © 2025 NTT CORPORATION Implementation – Execution Example flow
of executing next query  Point query SELECT * FROM accounts WHERE id = 1 Having index PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()

44 © 2025 NTT CORPORATION Example flow of executing next
query  Point query PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Implementation – Execution Having index SELECT * FROM accounts WHERE id = 1 Get snapshot when transaction starts

45 © 2025 NTT CORPORATION Example flow of executing next
query  Point query PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Implementation – Execution Having index Scan table during execution while checking tuple visibility SELECT * FROM accounts WHERE id = 1

46 © 2025 NTT CORPORATION XTM provides encapsulation of these
functions For example, XTM allows snapshot architecture to be replaced typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()

47 © 2025 NTT CORPORATION Case Study – User-specific Transaction
Method

48 © 2025 NTT CORPORATION Looking back at the problem
Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B Problem is that the read of account B is done before COMMIT PREPARED

49 © 2025 NTT CORPORATION Solution Wait for COMMIT PREPARED
to read the updated results Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B Can read updated result!

50 © 2025 NTT CORPORATION Problem with this solution TX1
is invisible in snapshot  Snapshot architecture needs to be customized Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B TX1 is invisible!

51 © 2025 NTT CORPORATION Snapshot in PostgreSQL Holds information
on which transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress (Prepared) 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Prepared transactions are invisible in current PostgreSQL

52 © 2025 NTT CORPORATION Solution Prepared transactions are regarded
as completed at the time they are prepared even if they are not finished Wait for prepared transactions when accessing tuples modified by them and determine their visibility XID 100 Committed 101 In-progress 102 Aborted 103 In-progress (Prepared) 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Regarded as complete even if not finished xmin xmax ID Balance 100 103 1 10 103 1 11 Wait for prepared transactions to finish and determine visibility of tuples Table

53 © 2025 NTT CORPORATION Implementation Example flow of executing
next query  Point query The method in this case study modifies the following functions:  Snapshot architecture • SnapshotData struct and GetSnapshotData()  Tuple visibility determination • HeapTupleSatisfiesVisibility() and its callee Having index SELECT * FROM accounts WHERE id = 1 PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()

54 © 2025 NTT CORPORATION Add a field to SnapshotData
to hold transactions that were prepared when the snapshot was taken  Before typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … } SnapshotData; In-progress transactions Implementation – Snapshot

55 © 2025 NTT CORPORATION In-progress but not prepared transactions
typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … TransactionId *x2pc; uint32 x2pc_cnt; /* # of xact ids in x2pc[] */ … } SnapshotData; Implementation – Snapshot Add a field to SnapshotData to hold transactions that were prepared when the snapshot was taken  After Prepared transactions

56 © 2025 NTT CORPORATION Implementation – Visibility determination Tuple
visibility determination is done by HeapTupleSatisfiesVisibility()  MVCC is important in this case study bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer) { switch (snapshot->snapshot_type) { case SNAPSHOT_MVCC: return HeapTupleSatisfiesMVCC(htup, snapshot, buffer); … } … } Is the given tuple (htup) visible according to the given snapshot?

57 © 2025 NTT CORPORATION Wait for prepared transactions in
the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } }

the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } If xmin might not be committed

the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } Check xmin according to snapshot If xmin is prepared in snapshot, wait for it to finish

the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } After waiting, determine whether xmin is committed or aborted

61 © 2025 NTT CORPORATION The method in this case
study After waiting for prepared transactions, users can see updated results Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B TX1 is visible because it is prepared!

62 © 2025 NTT CORPORATION Simple experiment Conducted simple experiment
with SmallBank benchmark  Schema • Tables for users and their accounts  Six types of queries • Balance inquiry • Amalgamate • Etc. CREATE TABLE accounts ( custid bigint NOT NULL, name varchar(64) NOT NULL, CONSTRAINT pk_accounts PRIMARY KEY (custid) ); CREATE INDEX idx_accounts_name ON accounts (name); CREATE TABLE savings ( custid bigint NOT NULL, bal float NOT NULL, CONSTRAINT pk_savings PRIMARY KEY (custid), FOREIGN KEY (custid) REFERENCES accounts (custid) ); CREATE TABLE checking ( custid bigint NOT NULL, bal float NOT NULL, CONSTRAINT pk_checking PRIMARY KEY (custid), FOREIGN KEY (custid) REFERENCES accounts (custid) ); SELECT bal FROM savings WHERE custid = ? UPDATE savings SET bal = bal - ? WHERE custid = ?

63 © 2025 NTT CORPORATION Simple experiment – Configuration Make
cluster configuration with one coordinator and four shards Table Table Table Partitioned table Coordinator Shard 1 Shard 2 Shard 3 Table Shard 4 Client Global transactions (which involve multiple shards) Local transactions (which are enclosed within a single shard)

64 © 2025 NTT CORPORATION Result – Distributed transaction rate
is 1% The method achieves almost the same high throughput and low latency as vanilla PostgreSQL while ensuring session consistency 0 10000 20000 30000 40000 50000 60000 70000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 5 10 15 20 25 30 0 100 200 300 400 500 95th percentile latency （ms） No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study

65 © 2025 NTT CORPORATION The method obtains as good
a result as 1% case 0 10000 20000 30000 40000 50000 60000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 5 10 15 20 25 30 35 0 100 200 300 400 500 95th percentile latency （ms） No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study Result – Distributed transaction rate is 10%

66 © 2025 NTT CORPORATION Even when half of transactions
are distributed transactions, there is no performance penalty with the method  Very high effective 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 10 20 30 40 50 60 70 80 0 100 200 300 400 500 95th percentile latency （ms） No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study Result – Distributed transaction rate is 50%

67 © 2025 NTT CORPORATION Correctness of this method This
method is generally not suitable for snapshot isolation  Transactions that were in progress when the snapshot was taken are visible  It behaves differently from snapshot isolation in PostgreSQL However, it takes advantage of the following user-specific assumptions  READ COMMITTED is enough as an isolation level • READ COMMITTED allows reading the latest data for each command • In PostgreSQL, if two transactions write the same tuple, the successor waits for the predecessor and re-reads the updated tuple (“READ COMMITTED Update Checking”) • The method in this case study is similar to this (but has potential risk of infinite wait) If users understand these restrictions and assumptions, they can enjoy significant benefits that are not available with normal transaction methods  Highlights the strong need for introducing custom transaction methods

68 © 2025 NTT CORPORATION Pros and cons Pros 
Very simple implementation  Ensures session consistency even when users bypass coordinator  High throughput and low latency Cons  Different behavior from standard snapshot isolation  Potential risk of infinite wait  Requires modifications to PostgreSQL core XTM is needed! Preferable to be implemented as an extension for users who can understand the method deeply and use it carefully Significant benefits that are not available with normal transaction methods

69 © 2025 NTT CORPORATION XTM is effective for this
case study The modification in this case study can be achieved by XTM typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Effective!

71 © 2025 NTT CORPORATION User-specific requirements are diversifying The
situation surrounding PostgreSQL has changed greatly since 2015  2015, when XTM was first proposed • Sharded clusters were underdeveloped in PostgreSQL • Distributed configurations were limited to a few users  2025 (Today) • PostgreSQL is widely utilized with many distributed configurations, such as built-in sharding or other extensions • Cloud providers also offer a wide range of managed distributed databases • Many users are taking full advantage of these technologies No single transaction manager can meet all these increasingly diverse requirements XTM is significantly rising in importance today

72 © 2025 NTT CORPORATION Path towards XTM implementation Does
XTM provide enough APIs for transactional extensibility?  If XTM is merged into the PostgreSQL core, we need to maintain it forever  To provide APIs for XTM, we must confirm that the APIs can satisfy a wide range of needs for transactional extensibility and are sufficient  Finishing the design of XTM too early has potential problems from this point of view We need to implement various distributed transaction methods including lightweight or user-specific ones through XTM and discuss what extensibilities are missing in the current proposal to improve XTM designs  A variety of user-specific examples are needed for discussion in the community I hope today’s talk will be the beginning of such a discussion

73 © 2025 NTT CORPORATION Missing pieces in XTM Other
extensibility may be required to implement various kinds of transaction methods  Modifications to the transaction manager affect a wide range of components in the PostgreSQL system • Replication, crash recovery, WAL logging, etc.  The case study in this talk may indeed have some problems (e.g., with WAL application, etc.) Rather than simply making the transaction manager pluggable, we must explore which modules of PostgreSQL will be affected by custom transaction methods and should be pluggable

75 © 2025 NTT CORPORATION Conclusion  As distributed configurations
continue to spread, the requirements for transaction methods are now diversifying, and existing methods fail to meet them  The current PostgreSQL lacks transactional extensibility, thus preventing users from implementing user-specific transaction methods as extensions  XTM is a promising solution, and this talk has presented a case study that highlights the potential of XTM to meet diverse user-specific needs  The case study addressed the banking system and showed that XTM is effective to implement a user-specific transaction method while maintaining high performance  To merge XTM into the PostgreSQL core, we need to discuss other extensibilities that are required to implement various transaction methods  XTM is growing in importance and is worthy of continued discussion

76 © 2025 NTT CORPORATION References  postgres_fdw_plus • https://github.com/pgfdwplus/postgres_fdw_plus
 eXtensible Transaction Manager (XTM) • https://wiki.postgresql.org/wiki/DTM  Postgres-XC • https://wiki.postgresql.org/wiki/Postgres-XC  Clock-SI • DU, Jiaqing; ELNIKETY, Sameh; ZWAENEPOEL, Willy. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems. IEEE, 2013. pp. 173–184.  Pessimistic coordination • BINNIG, Carsten, et al. Distributed snapshot isolation: Global transactions pay globally, local transactions pay locally. The VLDB journal, 2014, 23: pp. 987–1011.

79 © 2025 NTT CORPORATION Pessimistic coordination and our method
Coordinator Shard 1 Shard 2 GTX 1 GTX 2 Global-begin Local-begin Do work… Global-commit 2PC (omitted) Global-begin In pessimistic coordination, global- begin and global-commit are serialized and guaranteed to operate atomically. This provides consistent snapshots across multiple shards The method in the case study customizes the snapshots taken in local-begin and the tuple visibility determinations on shards Reference: BINNIG, Carsten, et al. Distributed snapshot isolation: Global transactions pay globally, local transactions pay locally. The VLDB journal, 2014, 23: pp. 987–1011.

Revisiting XTM: A Practical Case Study Highligh...

Revisiting XTM: A Practical Case Study Highlighting Its Needs - PGConf.dev 2025

More Decks by Yuya Watari

Other Decks in Technology

Featured

Transcript