Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Revisiting XTM: A Practical Case Study Highligh...

Revisiting XTM: A Practical Case Study Highlighting Its Needs - PGConf.dev 2025

Avatar for Yuya Watari

Yuya Watari

May 14, 2025
Tweet

More Decks by Yuya Watari

Other Decks in Technology

Transcript

  1. 2 © 2025 NTT CORPORATION Yuya Watari  From Tokyo,

    Japan Work  NTT Open Source Software Center/NTT Software Innovation Center  Developing distributed databases Interests  Distributed transaction  Planning  Optimization  Photography About me
  2. 3 © 2025 NTT CORPORATION Agenda 1. Introduction 2. Making

    PostgreSQL Distributed – FDW-based Sharding 3. Making Transaction Manager Pluggable – XTM 4. Case Study – User-specific Transaction Method 5. Discussion 6. Conclusion
  3. 5 © 2025 NTT CORPORATION Keywords – Transactional extensibility and

    XTM PostgreSQL is well known for its high extensibility  Hooks in planner or executor, and access methods, etc. However, PostgreSQL lacks transactional extensibility  No hooks or pluggable APIs for replacing the transaction manager  Custom transaction methods cannot be implemented as extensions, and forking of the PostgreSQL core is required eXtensible Transaction Manager (XTM) is a promising solution  XTM allows the replacement of core transactional APIs • Snapshot mechanism, tuple visibility determinations, etc.  XTM was proposed in 2015 by PostgresPro but has not yet been implemented
  4. 6 © 2025 NTT CORPORATION New usage of XTM by

    user-specific requirements The discussion in 2015 mainly focused on implementing state-of-the-art methods  Global transaction managers (GTMs), clock-based algorithms, etc. Beyond such purposes, XTM plays an important role in complementing these existing methods to meet user-specific requirements  Requirements for distributed transactions vary from user to user • Various patterns of consistency • Performance oriented? Or Consistency oriented?  Existing methods do not always satisfy these requirements  Using XTM to complement existing methods has not been discussed so far This talk introduces a case study of complementing existing state-of-the-art methods to meet user-specific requirements
  5. 7 © 2025 NTT CORPORATION Case study The case study

    models a banking system  Account information is distributed and located in multiple nodes  Two queries: transfer and balance inquiry User-specific requirements  Should support local transactions on shards to provide high performance  Should ensure session consistency even for local transactions XTM is utilized to complement existing methods, which fail to meet these requirements Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A
  6. 9 © 2025 NTT CORPORATION FDW-based sharding Combines FDW and

    table partitioning  Foreign Data Wrapper (FDW) • Accesses remote data by SQLs • Enables the use of external data sources as if they were local tables  Table partitioning • Splits table into multiple partitions Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A
  7. 10 © 2025 NTT CORPORATION Two user-specific requirements In addition

    to standard global transaction support, such as atomic commit and atomic visibility, users require the following 1. Allowing local transactions on shards  Transactions enclosed within a single shard should bypass the coordinator and be performed locally on the shard 2. Ensuring session consistency  If a user observes that a (global) transaction has finished, subsequent transactions should reflect the result  Must be applied to local transactions Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A
  8. 11 © 2025 NTT CORPORATION Methods for standard global transaction

    support State-of-the-art methods have been proposed to ensure atomic visibility  Global transaction managers (GTMs), e.g., Postgres-XC/XL • Assign global transaction IDs to uniquely identify global transactions  Clock-based algorithms, e.g., Clock-SI • Use loosely synchronized clocks to coordinate global transactions  Pessimistic coordination • Assigns local snapshots atomically when global transaction begins Centralized coordination Decentralized coordination
  9. 12 © 2025 NTT CORPORATION System design Ensure atomic commit

    by 2PC  Using postgres_fdw_plus Ensure atomic visibility by existing state- of-the-art method  To support local transactions on shards, we assume decentralized coordination However, this design does not ensure session consistency, which the user in the case study requires Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A
  10. 13 © 2025 NTT CORPORATION postgres_fdw_plus – FDW supporting 2PC

    postgres_fdw is provided as a contrib module  Enables the use of PostgreSQL servers as external data sources  Does not support distributed transactions postgres_fdw_plus is a fork of postgres_fdw with support of atomic commit  Developed by NTT DATA  Basic usage is the same as original postgres_fdw  Atomic commit • Ensures that a transaction is either fully committed or totally aborted in all participating nodes • Essential in distributed transactions • postgres_fdw_plus achieves atomic commit by two-phase commit (2PC) protocol The case study adopts postgres_fdw_plus
  11. 14 © 2025 NTT CORPORATION Atomic commit Naïve transaction methods

    can lead to an inconsistent state  Naïve one-phase commit Coordinator Shard 1 Shard 2 Client Commit Commit OK Commit OK OK
  12. 15 © 2025 NTT CORPORATION Atomic commit Naïve transaction methods

    can lead to an inconsistent state  Naïve one-phase commit Coordinator Shard 1 Shard 2 Client Commit Commit OK Commit NO NO If some commits in shards fail Global transaction is committed in Shard 1 but fails in Shard 2  Inconsistent state
  13. 16 © 2025 NTT CORPORATION Two-phase commit (2PC) protocol A

    solution to ensure atomic commit  Involves Prepare phase and Commit phase Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase
  14. 17 © 2025 NTT CORPORATION static void CommitTransaction(void) { …

    CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } Flow of committing a transaction – Original CommitTransaction() is called when committing a transaction Note: The above code has been modified for simplicity Callbacks before and after main operations The pre-commit callback is allowed a returning error, leading to AbortTransaction() The post-commit callback is NOT allowed a returning error
  15. 18 © 2025 NTT CORPORATION Utilizes these callbacks to implement

    2PC Implementation of postgres_fdw_plus Note: The above code has been modified for simplicity Success static void CommitTransaction(void) { … CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } In pre-commit callback, postgres_fdw_plus issues PREPARE TRANSACTION In post-commit callback, postgres_fdw_plus issues COMMIT PREPARED
  16. 19 © 2025 NTT CORPORATION Utilizes these callbacks to implement

    2PC Implementation of postgres_fdw_plus Note: The above code has been modified for simplicity static void CommitTransaction(void) { … CallXactCallbacks(XACT_EVENT_PRE_COMMIT); … /* * We need to mark our XIDs as committed in * pg_xact. This is where we durably commit. */ latestXid = RecordTransactionCommit(); … CallXactCallbacks(XACT_EVENT_COMMIT); … } If PREPARE TRANSACTION fails in some nodes, control flow moves to AbortTransaction() and postgres_fdw_plus issues ROLLBACK PREPARED
  17. 20 © 2025 NTT CORPORATION Optimization of 2PC protocol Current

    postgres_fdw_plus returns ack after Commit phase of 2PC Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase
  18. 21 © 2025 NTT CORPORATION Optimization of 2PC protocol Returning

    ack at the end of Prepare phase is a typical optimization of 2PC  To achieve high performance, case study adopts this optimization Coordinator Shard 1 Shard 2 Client Commit Prepare OK Prepare OK Commit Prepared OK Commit Prepared OK OK Prepare phase Commit phase
  19. 22 © 2025 NTT CORPORATION Session consistency is NOT guaranteed

    Consider the following scenario 1. Initially, both accounts A and B have $10 2. User transfers $1 from A to B • This is a global transaction • Account A will have $9, and B will have $11 3. User inquires balance of B • This is a local transaction on Shard B • The user must observe updated result $11, i.e., the transfer must not be lost • We call this session consistency Shard A Shard B (1) Transfer (2) Balance inquiry Partitioned table Coordinator Table Table Account B FDW Account A
  20. 23 © 2025 NTT CORPORATION Why is session consistency not

    guaranteed? User may read inconsistent state after Prepare phase Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B $11
  21. 24 © 2025 NTT CORPORATION Why is session consistency not

    guaranteed? User may read inconsistent state after Prepare phase Transfer was finished! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B
  22. 25 © 2025 NTT CORPORATION Why is session consistency not

    guaranteed? User may read inconsistent state after Prepare phase Start TX2 to confirm the transfer! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B
  23. 26 © 2025 NTT CORPORATION Why is session consistency not

    guaranteed? User may read inconsistent state after Prepare phase Observed $10, and transfer was lost! Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B
  24. 27 © 2025 NTT CORPORATION Achieving session consistency by XTM

    This case study implements a custom transaction method to ensure session consistency even for local transactions by XTM  User-specific method for user-specific requirement  Customizes snapshots and tuple visibility determination XTM is effective for complementing existing methods to meet this kind of user-specific requirement that is not always needed in general
  25. 29 © 2025 NTT CORPORATION eXtensible Transaction Manager (XTM) eXtensible

    Transaction Manager (XTM) significantly improves transactional extensibility in PostgreSQL  Allows the core transactional APIs to be replaced  First proposed in 2015 by PostgresPro  Previously discussed, but not yet merged into PostgreSQL core
  26. 30 © 2025 NTT CORPORATION XTM Interface The proposed interface

    of XTM[1]  Each pointer encapsulates original function in transaction manager typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM
  27. 31 © 2025 NTT CORPORATION XTM Interface The proposed interface

    of XTM[1]  Each pointer encapsulates original function in transaction manager typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM Question: What do these functions originally do?  Need to know behavior of transaction manager
  28. 32 © 2025 NTT CORPORATION Deep dive into transaction manager

    PostgreSQL adopts MultiVersion Concurrency Control (MVCC) to achieve isolation between transactions with high concurrency  Transaction manager assigns transaction id (XID) to each transaction  Transaction manager obtains a snapshot when transaction starts • This is the behavior when isolation level is REPEATABLE READ or higher • For READ COMMITTED, transaction manager obtains a snapshot for each command • Snapshot stores information about which transactions are committed, aborted, or in- progress when the snapshot was taken  Executor returns tuples that are visible according to the current snapshot • Called tuple visibility determination
  29. 33 © 2025 NTT CORPORATION Snapshot Holds information on which

    transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible
  30. 34 © 2025 NTT CORPORATION Snapshot Holds information on which

    transactions are visible All transactions whose XID is less than xmin are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible
  31. 35 © 2025 NTT CORPORATION Snapshot Holds information on which

    transactions are visible All transactions whose XID is greater than or equal to xmax are invisible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible
  32. 36 © 2025 NTT CORPORATION Snapshot Holds information on which

    transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Transactions between xmin and xmax have different visibility  Stored in array of snapshot data structure
  33. 37 © 2025 NTT CORPORATION Execution – Writing tables Each

    tuple stores XIDs that wrote and deleted it xmin xmax ID Balance 100 1 10 Table XID that wrote this tuple XID that deleted this tuple
  34. 38 © 2025 NTT CORPORATION Execution – Writing tables Each

    tuple stores XIDs that wrote and deleted it xmin xmax ID Balance 100 103 1 10 103 1 11 Table After updating this record by XID 103, a new tuple is added to the table
  35. 39 © 2025 NTT CORPORATION Execution – Scanning During table

    scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table XID 100 is visible, but XID 103 is invisible  This tuple is visible (shown in the result)
  36. 40 © 2025 NTT CORPORATION Execution – Scanning During table

    scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table XID 103 is invisible and nobody deletes it  This tuple is invisible (not shown in the result)
  37. 41 © 2025 NTT CORPORATION Execution – Scanning During table

    scanning, executor checks visibility of tuples according to snapshot XID 100 Committed 101 In-progress 102 Aborted 103 In-progress 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Snapshot xmin xmax ID Balance 100 103 1 10 103 1 11 Table ID Balance 1 10 Query result
  38. 42 © 2025 NTT CORPORATION Implementation – Snapshot Snapshot data

    structure holds this information typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … } SnapshotData; In-progress transactions
  39. 43 © 2025 NTT CORPORATION Implementation – Execution Example flow

    of executing next query  Point query SELECT * FROM accounts WHERE id = 1 Having index PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()
  40. 44 © 2025 NTT CORPORATION Example flow of executing next

    query  Point query PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Implementation – Execution Having index SELECT * FROM accounts WHERE id = 1 Get snapshot when transaction starts
  41. 45 © 2025 NTT CORPORATION Example flow of executing next

    query  Point query PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Implementation – Execution Having index Scan table during execution while checking tuple visibility SELECT * FROM accounts WHERE id = 1
  42. 46 © 2025 NTT CORPORATION XTM provides encapsulation of these

    functions For example, XTM allows snapshot architecture to be replaced typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()
  43. 48 © 2025 NTT CORPORATION Looking back at the problem

    Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B Problem is that the read of account B is done before COMMIT PREPARED
  44. 49 © 2025 NTT CORPORATION Solution Wait for COMMIT PREPARED

    to read the updated results Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B Can read updated result!
  45. 50 © 2025 NTT CORPORATION Problem with this solution TX1

    is invisible in snapshot  Snapshot architecture needs to be customized Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B TX1 is invisible!
  46. 51 © 2025 NTT CORPORATION Snapshot in PostgreSQL Holds information

    on which transactions are visible XID 100 Committed 101 In-progress 102 Aborted 103 In-progress (Prepared) 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Prepared transactions are invisible in current PostgreSQL
  47. 52 © 2025 NTT CORPORATION Solution Prepared transactions are regarded

    as completed at the time they are prepared even if they are not finished Wait for prepared transactions when accessing tuples modified by them and determine their visibility XID 100 Committed 101 In-progress 102 Aborted 103 In-progress (Prepared) 104 Committed 105 In-progress xmin = 101 xmax = 105 Visible Invisible Regarded as complete even if not finished xmin xmax ID Balance 100 103 1 10 103 1 11 Wait for prepared transactions to finish and determine visibility of tuples Table
  48. 53 © 2025 NTT CORPORATION Implementation Example flow of executing

    next query  Point query The method in this case study modifies the following functions:  Snapshot architecture • SnapshotData struct and GetSnapshotData()  Tuple visibility determination • HeapTupleSatisfiesVisibility() and its callee Having index SELECT * FROM accounts WHERE id = 1 PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot()
  49. 54 © 2025 NTT CORPORATION Add a field to SnapshotData

    to hold transactions that were prepared when the snapshot was taken  Before typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … } SnapshotData; In-progress transactions Implementation – Snapshot
  50. 55 © 2025 NTT CORPORATION In-progress but not prepared transactions

    typedef struct SnapshotData { … /* all XID < xmin are visible to me */ TransactionId xmin; /* all XID >= xmax are invisible to me */ TransactionId xmax; … TransactionId *xip; uint32 xcnt; /* # of xact ids in xip[] */ … TransactionId *x2pc; uint32 x2pc_cnt; /* # of xact ids in x2pc[] */ … } SnapshotData; Implementation – Snapshot Add a field to SnapshotData to hold transactions that were prepared when the snapshot was taken  After Prepared transactions
  51. 56 © 2025 NTT CORPORATION Implementation – Visibility determination Tuple

    visibility determination is done by HeapTupleSatisfiesVisibility()  MVCC is important in this case study bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer) { switch (snapshot->snapshot_type) { case SNAPSHOT_MVCC: return HeapTupleSatisfiesMVCC(htup, snapshot, buffer); … } … } Is the given tuple (htup) visible according to the given snapshot?
  52. 57 © 2025 NTT CORPORATION Wait for prepared transactions in

    the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } }
  53. 58 © 2025 NTT CORPORATION Wait for prepared transactions in

    the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } If xmin might not be committed
  54. 59 © 2025 NTT CORPORATION Wait for prepared transactions in

    the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } Check xmin according to snapshot If xmin is prepared in snapshot, wait for it to finish
  55. 60 © 2025 NTT CORPORATION Wait for prepared transactions in

    the given snapshot bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) { /* Wait for prepared transactions */ if (pg_lfind32(xid, snapshot->x2pc, snapshot->x2pc_cnt)) XactLockTableWait(xid, NULL, NULL, XLTW_None); } Implementation – Visibility determination static bool HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer) { if (!HeapTupleHeaderXminCommitted(tuple)) { else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot)) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetRawXmin(tuple)); } } After waiting, determine whether xmin is committed or aborted
  56. 61 © 2025 NTT CORPORATION The method in this case

    study After waiting for prepared transactions, users can see updated results Shard A Shard B PREPARE COMMIT PREPARED PREPARE Snapshotting (2) Local TX2 on Shard B: Balance Inquiry (1) Global TX1: Transfer Read account B COMMIT PREPARED Ack Write account A Write account B TX1 is visible because it is prepared!
  57. 62 © 2025 NTT CORPORATION Simple experiment Conducted simple experiment

    with SmallBank benchmark  Schema • Tables for users and their accounts  Six types of queries • Balance inquiry • Amalgamate • Etc. CREATE TABLE accounts ( custid bigint NOT NULL, name varchar(64) NOT NULL, CONSTRAINT pk_accounts PRIMARY KEY (custid) ); CREATE INDEX idx_accounts_name ON accounts (name); CREATE TABLE savings ( custid bigint NOT NULL, bal float NOT NULL, CONSTRAINT pk_savings PRIMARY KEY (custid), FOREIGN KEY (custid) REFERENCES accounts (custid) ); CREATE TABLE checking ( custid bigint NOT NULL, bal float NOT NULL, CONSTRAINT pk_checking PRIMARY KEY (custid), FOREIGN KEY (custid) REFERENCES accounts (custid) ); SELECT bal FROM savings WHERE custid = ? UPDATE savings SET bal = bal - ? WHERE custid = ?
  58. 63 © 2025 NTT CORPORATION Simple experiment – Configuration Make

    cluster configuration with one coordinator and four shards Table Table Table Partitioned table Coordinator Shard 1 Shard 2 Shard 3 Table Shard 4 Client Global transactions (which involve multiple shards) Local transactions (which are enclosed within a single shard)
  59. 64 © 2025 NTT CORPORATION Result – Distributed transaction rate

    is 1% The method achieves almost the same high throughput and low latency as vanilla PostgreSQL while ensuring session consistency 0 10000 20000 30000 40000 50000 60000 70000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 5 10 15 20 25 30 0 100 200 300 400 500 95th percentile latency (ms) No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study
  60. 65 © 2025 NTT CORPORATION The method obtains as good

    a result as 1% case 0 10000 20000 30000 40000 50000 60000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 5 10 15 20 25 30 35 0 100 200 300 400 500 95th percentile latency (ms) No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study Result – Distributed transaction rate is 10%
  61. 66 © 2025 NTT CORPORATION Even when half of transactions

    are distributed transactions, there is no performance penalty with the method  Very high effective 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 0 100 200 300 400 500 tps No. of clients Throughput Vanilla PostgreSQL Method in Case Study 0 10 20 30 40 50 60 70 80 0 100 200 300 400 500 95th percentile latency (ms) No. of clients 95th percentile latency Vanilla PostgreSQL Method in Case Study Result – Distributed transaction rate is 50%
  62. 67 © 2025 NTT CORPORATION Correctness of this method This

    method is generally not suitable for snapshot isolation  Transactions that were in progress when the snapshot was taken are visible  It behaves differently from snapshot isolation in PostgreSQL However, it takes advantage of the following user-specific assumptions  READ COMMITTED is enough as an isolation level • READ COMMITTED allows reading the latest data for each command • In PostgreSQL, if two transactions write the same tuple, the successor waits for the predecessor and re-reads the updated tuple (“READ COMMITTED Update Checking”) • The method in this case study is similar to this (but has potential risk of infinite wait) If users understand these restrictions and assumptions, they can enjoy significant benefits that are not available with normal transaction methods  Highlights the strong need for introducing custom transaction methods
  63. 68 © 2025 NTT CORPORATION Pros and cons Pros 

    Very simple implementation  Ensures session consistency even when users bypass coordinator  High throughput and low latency Cons  Different behavior from standard snapshot isolation  Potential risk of infinite wait  Requires modifications to PostgreSQL core XTM is needed! Preferable to be implemented as an extension for users who can understand the method deeply and use it carefully Significant benefits that are not available with normal transaction methods
  64. 69 © 2025 NTT CORPORATION XTM is effective for this

    case study The modification in this case study can be achieved by XTM typedef struct { XidStatus (*GetTransactionStatus)(TransactionId xid, XLogRecPtr *lsn); void (*SetTransactionStatus)(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn); Snapshot (*GetSnapshot)(Snapshot snapshot); TransactionId (*GetNewTransactionId)(bool isSubXact); TransactionId (*GetOldestXmin)(Relation rel, bool ignoreVacuum); bool (*IsInProgress)(TransactionId xid); TransactionId (*GetGlobalTransactionId)(void); bool (*IsInSnapshot)(TransactionId xid, Snapshot snapshot); bool (*DetectGlobalDeadLock)(PGPROC* proc); } TransactionManager; [1] https://wiki.postgresql.org/wiki/DTM PortalStart() GetTransactionSnapShot() PortalRun() ExecutorRun() GetSnapshotData() ExecIndexScan() IndexNext() HeapTupleSatisfiesVisibility() XidInMVCCSnapshot() Effective!
  65. 71 © 2025 NTT CORPORATION User-specific requirements are diversifying The

    situation surrounding PostgreSQL has changed greatly since 2015  2015, when XTM was first proposed • Sharded clusters were underdeveloped in PostgreSQL • Distributed configurations were limited to a few users  2025 (Today) • PostgreSQL is widely utilized with many distributed configurations, such as built-in sharding or other extensions • Cloud providers also offer a wide range of managed distributed databases • Many users are taking full advantage of these technologies No single transaction manager can meet all these increasingly diverse requirements XTM is significantly rising in importance today
  66. 72 © 2025 NTT CORPORATION Path towards XTM implementation Does

    XTM provide enough APIs for transactional extensibility?  If XTM is merged into the PostgreSQL core, we need to maintain it forever  To provide APIs for XTM, we must confirm that the APIs can satisfy a wide range of needs for transactional extensibility and are sufficient  Finishing the design of XTM too early has potential problems from this point of view We need to implement various distributed transaction methods including lightweight or user-specific ones through XTM and discuss what extensibilities are missing in the current proposal to improve XTM designs  A variety of user-specific examples are needed for discussion in the community I hope today’s talk will be the beginning of such a discussion
  67. 73 © 2025 NTT CORPORATION Missing pieces in XTM Other

    extensibility may be required to implement various kinds of transaction methods  Modifications to the transaction manager affect a wide range of components in the PostgreSQL system • Replication, crash recovery, WAL logging, etc.  The case study in this talk may indeed have some problems (e.g., with WAL application, etc.) Rather than simply making the transaction manager pluggable, we must explore which modules of PostgreSQL will be affected by custom transaction methods and should be pluggable
  68. 75 © 2025 NTT CORPORATION Conclusion  As distributed configurations

    continue to spread, the requirements for transaction methods are now diversifying, and existing methods fail to meet them  The current PostgreSQL lacks transactional extensibility, thus preventing users from implementing user-specific transaction methods as extensions  XTM is a promising solution, and this talk has presented a case study that highlights the potential of XTM to meet diverse user-specific needs  The case study addressed the banking system and showed that XTM is effective to implement a user-specific transaction method while maintaining high performance  To merge XTM into the PostgreSQL core, we need to discuss other extensibilities that are required to implement various transaction methods  XTM is growing in importance and is worthy of continued discussion
  69. 76 © 2025 NTT CORPORATION References  postgres_fdw_plus • https://github.com/pgfdwplus/postgres_fdw_plus

     eXtensible Transaction Manager (XTM) • https://wiki.postgresql.org/wiki/DTM  Postgres-XC • https://wiki.postgresql.org/wiki/Postgres-XC  Clock-SI • DU, Jiaqing; ELNIKETY, Sameh; ZWAENEPOEL, Willy. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems. IEEE, 2013. pp. 173–184.  Pessimistic coordination • BINNIG, Carsten, et al. Distributed snapshot isolation: Global transactions pay globally, local transactions pay locally. The VLDB journal, 2014, 23: pp. 987–1011.
  70. 79 © 2025 NTT CORPORATION Pessimistic coordination and our method

    Coordinator Shard 1 Shard 2 GTX 1 GTX 2 Global-begin Local-begin Do work… Global-commit 2PC (omitted) Global-begin In pessimistic coordination, global- begin and global-commit are serialized and guaranteed to operate atomically. This provides consistent snapshots across multiple shards The method in the case study customizes the snapshots taken in local-begin and the tuple visibility determinations on shards Reference: BINNIG, Carsten, et al. Distributed snapshot isolation: Global transactions pay globally, local transactions pay locally. The VLDB journal, 2014, 23: pp. 987–1011.