From Tokyo, Japan Work For NTT Open Source Software Center/NTT Software Innovation Center Developing distributed databases Interests Planning Optimization Distributed transaction Photography
and Its Performance Problem 2. A Deep Dive into the Planner – Why Does Planning Have 𝑶𝑶 𝒏𝒏𝟐𝟐 ? 3. Hacking the Planner – How to Make It 𝑶𝑶 𝒏𝒏 ? 4. Experiments 5. Tips 6. Open Issues 7. Conclusion
feature in big-data era Important to realize high scalability and manageability User demand for partitioning is increasing year by year Use case example: management of log data Splitting log data per date Partitioning Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20221231
and before Table inheritance and CHECK constraint PostgreSQL 10 Declarative partitioning (including syntax support) Partition pruning CREATE TABLE t1(id INT, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); SELECT * FROM t1 WHERE id < 100 Unrelated so eliminated from query plan Partitioned Table Table Table t1 t1_1 t1_2 0 <= id < 100 100 <= id < 200
Partition-wise join/aggregation Run-time partition pruning Allows us to prune partitions based on values that will not be determined until execution time Expands scope of partition pruning and speeds up various workloads HASH partitioning Distributes accesses to child partitions Partitioned Table Table a1 Table a2 id = 1 id = 2 Partitioned Table Table b1 Table b2 id = 1 id = 2 ⋈ Table a1 Table b1 id = 1 id = 1 ⋈ Table a2 Table b2 id = 2 id = 2 ⋈ Append
Indexes for partitioned tables Faster partition pruning Partition pruning previously involved linear search over partition’s metadata Faster algorithms, such as binary search or hashing function, have been utilized for partition pruning since PostgreSQL 11 PostgreSQL 12 Faster COPY for partitioned tables ⋮ What’s next? Today’s topic – Planning for highly partitioned tables proposed for PostgreSQL 18
partitioned tables is challenging In example of log data management, we need to handle about 1000 partitions for 3 years of data Number of child partitions affects partitioning performance Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20220103 Table logs_20220104 Table logs_20241231
highly partitioned tables is slow in current implementation Problem happens when queries have join operations on partitioned tables with many non-pruned partitions Partitioned Table Table Table Table Partitioned Table Table Table Table ⋈ Most of child partitions are relevant Planning is slow!
𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity when planning join operations 𝑛𝑛 is number of child partitions Proposed patch reduces time complexity to 𝑶𝑶 𝒏𝒏 Improves architecture of equivalences Under discussion on pgsql-hackers 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 200 400 600 800 1000 Planning Time (seconds) # of partitions (n) PostgreSQL 17devel Patched Experiment #1: Planning time when joining three partitioned tables
will deep dive into planner to see why it has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity Let’s see how next SQL is planned Table t1 has two child partitions, namely t1_1 and t1_2 Same for t2 and t3 Query simply joins these three tables SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id Partitioned Table Table Table t1 t1_1 t1_2 Partitioned Table Table Table t2 t2_1 t2_2 Partitioned Table Table Table t3 t3_1 t3_2
searches for join orders Query has three possible join orders Here, we ignore swapping of inner and outer tables Question How does planner know that t2.id is equal to t3.id even though clause does not appear in given query? SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id; t1 t2 t3 ⋈ ⋈ t1 t3 t2 ⋈ ⋈ t2 t3 t1 ⋈ ⋈ t1.id = t2.id t1.id = t3.id t1.id = t2.id t1.id = t2.id t1.id = t3.id t2.id = t3.id
finding join clauses that do not explicitly appear in original query If clause (t1.id = t2.id or t1.id = t3.id in this example) is found, we split it into EquivalenceMembers, and EquivalenceClass keeps them in list (ec_members) EquivalenceMember stands for variable and has Relids related to member List will have variables that are known to be equal to each other We can find join clauses that do not explicitly appear in given query like t2.id = t3.id EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) t1.id = t2.id t1.id = t3.id ec_members EM: EquivalenceMember
EquivalenceMembers to handle partitioned tables EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Parent members Child members
𝒏𝒏𝟐𝟐 time complexity? Lookups of EquivalenceMembers when building paths are time-consuming There are 𝑛𝑛 paths to be built (𝑛𝑛 is number of child partitions) Building each path involves EquivalenceMember lookups with 𝑂𝑂(𝑛𝑛) time complexity Total time complexity becomes 𝑂𝑂 𝑛𝑛2 Partitioned Table 𝑛𝑛 paths 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity! Path Path Path Path Path Table Table Table Table Table Involves EquivalenceMember lookups with 𝑶𝑶(𝒏𝒏) complexity
about planning starts with planner() function To build paths for partitioned tables, planner calls set_append_rel_pathlist() We will look into source code of set_append_rel_pathlist() function set_append_rel_pathlist() planner() standard_planner() grouping_planner() query_planner() subquery_planner()
partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Calls this function to build paths for each partition Function takes child relation as its arguments There are 𝒏𝒏 child partitions, so this is called 𝒏𝒏 times
can be illustrated as below Call tree We need to handle equivalences of join operations when building paths set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()
that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } {1, 6} = {t1, t3_1} Contains relations involved in this join operation (In this example, we try to join t1 and t3_1)
that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } Checks if current EquivalenceMember is related to given join_relids {1, 6} = {t1, t3_1}
Search for ec_members is linear search with 𝑶𝑶 𝒏𝒏 time complexity Slow for highly partitioned cases because EquivalenceClasses have many child members EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Relids {1, 6} bms_is_subset()? Match!
𝒏𝒏𝟐𝟐 Very slow! Call tree set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times Equivalence Member Equivalence Member Equivalence Member Equivalence Member 𝑂𝑂 𝑛𝑛 members Linear search Equivalence Member create_index_path()
that linear search over so many child members in EquivalenceClass is time-consuming We propose introducing child members when needed, without need to do time-consuming search EquivalenceClasses no longer have child members in proposed patch Since there are a few parent members, search for EquivalenceMembers is fast Patch does not change plans Patch is under discussion, so there may be (large) changes in future
works What we want to do Find EquivalenceMembers whose Relids is subset of given one Given Relids: How to do this in proposed patch 1. Get parent representation of given Relids Parent representation: 2. If EquivalenceMember matches parent representation, its child members may also match. So, we introduce and iterate over them In this example, EquivalenceMember for t3 matches parent representation, so we introduce its child members (t3_1 and t3_2) {1, 6} = {t1, t3_1} {1, 3} = {t1, t3}
static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3} If member matches parent representation, its child members may also match. We iterate over child members. (Note: actual code is more complicated so as not to break ec_members) {1, 3} = {t1, t3} EM t3_1.id (Relids: {6})
get child members efficiently Parent EquivalenceMember keeps its child members’ Relids as Bitmapset Intersecting this index with given Relids gives us EquivalenceMembers that we actually want Bitmapset operations are fast, so we can introduce child members efficiently EM t3.id (Relids: {3}) (Child members’ Relids: {6, 9}) Relids: {1, 6} join_relids EM t3_1.id (Relids: {6}) Intersect
to almost 𝑶𝑶 𝒏𝒏 Very fast! Call tree – With proposed patch Equivalence Member Equivalence Member Only a few parent members Almost 𝑂𝑂(1) set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()
patch Pros Drastically reduces time complexity of planning Cons May have some negative impacts for smaller sizes due to additional operations We will investigate this at end of this talk
to test proposed patch Three partitioned tables with 𝑛𝑛 partitions (t1, t2, and t3) Example when 𝑛𝑛 was 2 Query joining these three tables Partition-wise join was off in this experiment Varied 𝑛𝑛 and measured planning time SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id CREATE TABLE t1(id INT PRIMARY KEY, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); -- Same for t2 and t3
without partition-wise join Number of partitions was set to 1024 Result Proposed patch worked effectively both with and without partition-wise join 0 0.1 0.2 0.3 0.4 0.5 0.6 PostgreSQL 17devel Patched PostgreSQL 17devel Patched Partition-wise join is off Partition-wise join is on Planning Time (seconds) 1024 partitions case Experiment #2 6.7 times speedup 2.6 times speedup better
varying number of tables being joined Schema was same as in experiments #1 and #2 Query Number of partitions was set to 1024 Partition-wise join was off SELECT * FROM t1, t2, …, tm WHERE t1.id = t2.id … AND t1.id = tm.id Varied number of tables being joined
Benchmark Speedup of patch was demonstrated using Join Order Benchmark Join Order Benchmark: famous benchmark for testing OLAP workloads Experimental results for 1024 partitions case 0 2 4 6 8 10 12 14 16 18 20 Speedup of planning time Query better 1 10.7 times speedup on average Up to 17.5 times speedup
will benefit from proposed patch Tables have many partitions Partition pruning does not work effectively Queries contain joins involving many tables Queries are complicated (having sorts, index accesses, etc.)
of proposed patch Let’s confirm bottleneck was actually reduced by profiling PostgreSQL perf Famous profiler available on Linux Usable for finding bottlenecks Profilers are powerful when developing PostgreSQL Here, we profile 1024 partitions case in experiment #1 In experiment, proposed patch obtained 6.7 times speedup
profile backend process of PostgreSQL 1. Run psql command to establish connection 2. Confirm process ID of backend process 3. Run perf to profile $ ps x PID TTY STAT TIME COMMAND … 99 ? Ss 0:00 postgres: ubuntu postgres [local] idle $ psql psql (17devel) Type "help" for help. postgres=# $ sudo perf record -ag -p 99
that original implementation has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity by printf debug Inserts logging codes in functions to see how they are called In PostgreSQL hacking, elog and ereport are helpful for printf debug Modifying PostgreSQL source code as follows tells us value of ‘count’ LOG, NOTICE, and DEBUG are available as log level For more details, see PostgreSQL official document https://www.postgresql.org/docs/16/error-message-reporting.html elog(LOG, "%d", count);
Ran experiment #1 for 𝒏𝒏 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 partitions case joining three tables We found following from this output EquivalenceClass had 3075 = 3 𝑛𝑛 + 1 members Parent and its children for each table Comparison in linear search was done 18,929,700 times This is approximately 18𝑛𝑛2 Original implementation actually has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 18929700 [printfdebug] Checking an EquivalenceMember (ec_members has 3075 members)
Ran same with proposed patch applied Changes from unpatched result Number of EquivalenceMembers: 3075 = 3 𝑛𝑛 + 1 3 Number of comparisons: 18,929,700 ≈ 18𝑛𝑛2 24,612 ≈ 24𝑛𝑛 We confirmed proposed patch greatly reduced time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 24612 [printfdebug] Checking an EquivalenceMember (ec_members has 3 members)
problems When we suffer some performance problem, what should we do to solve it? Clarifying bottleneck is first step Profilers are helpful to do it I profiled PostgreSQL many times to address this planning performance problem and found bms_is_subset() operation was heavy in query planner Finding fundamental cause is then required Bottleneck that profilers show is not always fundamental cause We need to investigate why bottleneck is actually caused In today’s talk, problem was not that bms_is_subset() was heavy but that linear search over so many EquivalenceMembers was time-consuming Trial and errors are particularly important to address performance issues I did experiments more than 100 times to write first PoC patch
fundamental cause, we will then improve PostgreSQL source code to address performance problem and propose patches When proposing patches, following should be with patches Queries or steps to reproduce the performance problem Important to make sure that other developers can easily investigate problem on their own machines Reproduction queries or steps should be as simple as possible Detailed experimental results Support effectiveness of patches As results are more detailed, patches will be more attractive When are patches effective? How large is performance improvement?
slightly negative impacts on smaller sizes Experiment shows regression is on microseconds order 18 microseconds for 2 partitions case This is much smaller than variance of planning times Avoiding regression is key to committing this patch We need to conduct as many experiments as possible to examine effects of proposed patch better
assertion are also discussed Memory consumption Proposed patch is known to increase memory consumption This seems to be due to Bitmapset-based indexes for eliminating non-relevant child EquivalenceMembers Further investigation is needed to clarify cause Assertion It is very important to avoid adding bugs, and assertions are helpful However, overly excessive assertions slow down regression test time Speeding up debug builds is also necessary
and important feature of PostgreSQL PostgreSQL has improved partitioning performance since its introduction Planning of highly partitioned tables gets slow in current PostgreSQL implementation Handling equivalences of join operations causes this problem Proposed patch improves architecture of equivalences and reduces time complexity from 𝑂𝑂 𝑛𝑛2 to 𝑂𝑂 𝑛𝑛 Experimental results show proposed patch obtains 10x or more speedup on Join Order Benchmark
are on following pages https://commitfest.postgresql.org/48/3701/ https://www.postgresql.org/message-id/flat/CAJ2pMkZNCgoUKSE+_5LthD+Kb [email protected] Any comments are welcome!
thank everyone who has contributed to this patch Especially, David Rowley He has worked with me extensively and deeply at the source code level since I first submitted the patch. He has made numerous contributions and given a lot of kind advice