in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries
new insights and findings • The market size of DWH is increasing year by year • PostgreSQL is required to have the ability to be a backend for BI tools 0 10 20 30 40 2025 2018 Market size (USD) Year 30 Billion 13 Billion https://www.gminsights.com/ industry-analysis/data- warehousing-market
• A series of similar queries lead to a lot of common operations in their query plans Simple example • Assume a user operates the following two queries • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • 𝐴 ⋈ 𝐵 appears in both of the queries and this join operation is a commonality ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality
not execute the same operation more than once • Stores the intermediate results • Reuses them in other queries Previous example • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • We do not need to execute 𝐴 ⋈ 𝐵 twice • Store the result of 𝐴 ⋈ 𝐵 and reuse it in the other query ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality
common table expression (CTE), materialized views These methods require users to rewrite queries by hand We propose methods that automatically share commonalities SELECT * … Rewrite SELECT * … Rewrite SELECT * … Rewrite Rewriting many queries manually is unrealistic!
in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries
to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query
one of the use cases • According to a BI survey, users tend to repeatedly issue similar queries in the refining process [1] • Many users do not submit a single complete query • The users need to refine the queries, so they issue several incomplete queries repeatedly • Example of the refining process • Adding missing WHERE clauses • Commonalities exist in a series of these queries DB Query Query Refine the query and issue it again [1] Vogelsgesang, Adrian, et al. "Get real: How benchmarks fail to represent the real world." Proceedings of the Workshop on Testing Database Systems. ACM, 2018.
a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store Similar join operation appears 8 times in a single query There is no need to execute this join operation 8 times This query lists the number of sold item in the specific time periods
• One for multiple queries • One for a single query These two methods assume read-only workloads • Current methods cannot handle write queries • Handling write queries is one of the major future works of our technology
to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query Caching intermediate results works for multiple queries
share commonalities among queries Our caching method materializes intermediates results in memory and reuses them in subsequent queries DB Commonalities Memory Cache Reuse
Record the number of occurrences of query plan trees 2. Cache results • If a plan tree appears more than some threshold number of times, the caching method materializes the execution result of it into memory 3. Reuse cached results • If the arrived query (partially) matches previous ones and their results are cached, the caching method reuses them instead of executing the query again
twice, so we cache the result of it • Table 𝐴 and 𝐵 will not be cached because they are child nodes of 𝐴 ⋈ 𝐵 Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 2 2 1 2 1 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 # of occurrences 1 1 The result of 𝑨 ⋈ 𝑩 Cache
• The method keeps more frequently referenced caches in memory When trying to cache some results, the caching method releases data whose frequency is less
queries that partially match past ones, while result caching can work only when the same queries arrive Our method is transparent, so users do not need to explicitly manage the cache, while materialize views require users to maintain them
int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) • CREATE TABLE E(x int, y int) 2. Execute the next queries successively • SELECT * FROM A JOIN B ON A.x = B.x JOIN C ON A.x = C.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN D ON A.x = D.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN E ON A.x = E.x LIMIT 1
𝐵 was cached during the execution of the second query The execution time was reduced because the cached result was reused • Three queries are different from each other but the caching method worked • Since the caching method is transparent, there is nothing to do to use it
Contains real data and queries from workbooks in Tableau Public • Represents queries that BI dashboard (Tableau) generates • We used HashTag benchmark in Public BI benchmark • HashTag benchmark analyzes tweets • We sequentially executed four queries in this benchmark and measured their execution time • Four queries are different from each other
SELECT (NOT("t0"."_Tableau_join_flag" IS NULL)) AS "io:1 time:nk", (NOT("t1"."_Tableau_join_flag" IS NULL)) AS "io:2 times:nk", (NOT("t2"."_Tableau_join_flag" IS NULL)) AS "io:20+ times:nk", (NOT("t3"."_Tableau_join_flag" IS NULL)) AS "io:3 times:nk", (NOT("t4"."_Tableau_join_flag" IS NULL)) AS "io:4 times:nk", (NOT("t5"."_Tableau_join_flag" IS NULL)) AS "io:5 times:nk", COUNT(DISTINCT "HashTags_1"."twitter#user#screen_name") AS "usr:Calculation_6330207195516273:ok" FROM "HashTags_1" LEFT JOIN ( SELECT "HashTags_1"."twitter#user#screen_name" AS "twitter#user#screen_name", MIN(1) AS "_Tableau_join_flag" FROM "HashTags_1" Aggregation Join operation
execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Query 1 took almost the same time in both methods
execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Since the same operation appeared twice, the caching method cached its result. Caching took some costs, so the execution time increased.
execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 In queries 3 and 4, the caching method could reuse the cached results, so we obtained a 28% improvement
execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Totally, the execution time was reduced by 11%
the four queries, so we could obtain performance improvements In this benchmark, a user repeatedly issue similar queries while changing aggregated columns or conditional clauses The caching method is effective for this kind of workload DB Similar aggregations with different conditions Query
to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query We propose a shared execution method to speed up a single query
a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
use a simple example in this presentation • Assume the following four tables exist in the database • Each table has integer columns x and y x y 10 20 20 30 𝐴 x y 10 20 20 30 𝐵 x y 10 20 20 30 𝐶 x y 10 20 20 30 𝐷
𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Corresponding plan tree SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15
between A and B is a commonality There are a lot of ways to optimize this query, and one solution is Common Table Expression (CTE) • (Not-inlined) CTE materializes the execution result and reuses it in all occurrences Example of introduction of CTE WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15
really reduce the execution time? Answer • It depends on the workload • Materializing takes some cost • If the cost is high, the performance will not increase • We have to determine whether the optimization is appropriate or not • We prevent bad optimizations by cost evaluation
the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
commonalities from the query plan of the given query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Commonality These structures are regarded as a commonality even if their conditional clauses differ
each occurrence differ The caching method merges them into ORed conditions • Original occurrences filter the CTE with their original conditions SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15 WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15
the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
improve performance • Materializing the result is a little time-consuming task • If reusing does not reduce the execution cost drastically, the total performance might decrease We perform a cost evaluation to overcome this problem
modify the query) Step 2 – Cost Evaluation If the cost with shared execution (𝐶𝑠ℎ𝑎𝑟𝑒) is less than that without shared execution (𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒), it is reasonable to perform shared execution 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒
cost of executing a commonality only once 2. The cost of storing the result into memory 3. The cost of reusing the result 𝐶𝑠ℎ𝑎𝑟𝑒 = 𝑐𝑜𝑠𝑡 𝑒+ + 𝑚𝑎𝑡𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑒+ + 𝑁 × 𝑟𝑒𝑢𝑠𝑒 𝑒+ 1 2 3 Estimated by the planner Proportional to the number of result rows N is the number of occurrences of the commonalities
the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
= 10 OR y = 15) CTE Execute commonalities only once 1 Memory Intermediate results Materialize the results into memory 2 Reuse the materialized results 3 ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15)
• Commonalities are not known in advance • We need to predict future queries • Cost evaluation based on the predication is a future work Shared execution (for a single query) • Commonalities are known in advance • We can easily perform cost evaluation DB Commonalities Query 1 Query 2 DB Commonalities Query The commonality is not known when query 1 arrives We can detect all commonalities from the given query
following four tables (the same ones in the example of the caching method) • CREATE TABLE A(x int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) 2. Execute the following query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15
This query has same join operations select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
shared execution is 7.8 times faster than that without shared execution Sharing commonalities in a single query greatly improved query performance 0 5 10 15 20 With shared execution Without shared execution Execution time (s) better
caching method • Caching method (for multiple queries) lacks cost evaluation • Prediction of future queries and cost evaluation based on it are major future works • Handling write queries • Current methods will collapse when tables change • We have to implement invalidation of the cache whose corresponding data is updated • Managing join order • Breaking the optimal join order is very effective to utilize the caching method • Cache replacement algorithms
to share commonalities Experiments by Public BI Benchmark and TPC-DS revealed that our methods obtained high performance improvements Sharing commonalities among queries is very effective at reducing the query execution time Handling write queries and managing join order are future works Thank you!
queries ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 Our methods work for this plan If the queries are planned as the following join order, 𝐴 ⋈ 𝐵 becomes a commonality However, the commonality does not appear in the following query plan ⋈ ⋈ 𝐵 𝐴 𝐶 ⋈ ⋈ 𝐵 𝐴 𝐷 We cannot apply our methods to 𝑨 ⋈ 𝑩
experiment to test the effect of join order We issued one TPC-DS query repeatedly while varying its “date” column ⋈ ⋈ 𝐴 𝐶 𝐵 date =1999/1/1 Query 1’ ⋈ ⋈ 𝐴 𝐶 𝐵 date =2000/1/1 Query 1 DB Since table A and B do not change, 𝐴 ⋈ 𝐵 is a commonality. However, it does not appear in the optimal plan, so we cannot cache its result
order by using pg_hint_plan extension ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan (We cannot cache 𝐀 ⋈B) Plan whose join order is manually changed (We can cache 𝐀 ⋈ 𝑩)
caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan The performance did not improve because we could not cache 𝑨 ⋈ 𝑩
caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan Execution time increased because the modified join order is not optimal
caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan We obtained a 47% improvement because we cached 𝑨 ⋈ 𝑩
caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan • It is effective to break the optimal join order in order to utilize our caching method • Managing join order automatically is future work