Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HTAP By Accident: Getting More From PostgreSQL ...

HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration

Big Data. Data Science. AI. It's all big business.
Once upon a time we succeeded in these fields by selectively storing, processing and learning from just the right data. This, of course, requires you to know what "the right data" is. We know there are valuable insights in data, so why not store the lot? It's the 21st century equivalent of "there's gold in them thar hills!"

So having spent years stashing away terabytes of your data in PostgreSQL, you want to start learning from that data. Queries. More queries. More complex queries. Lots of real-time queries. Lots of concurrent users. It might be tempting at this point to give up on PostgreSQL and stash your data into a different solution, more suited to purpose. Don't. PostgreSQL can perform very well in HTAP environments and performs even better with a little help.

In this presentation we dive into the current state of the art with regards to PostgreSQL in HTAP environments and expose how hardware acceleration can help squeeze as much knowledge as possible out of your data.

Paul Finlayson Adams

June 26, 2019
Tweet

More Decks by Paul Finlayson Adams

Other Decks in Technology

Transcript

  1. Confidential & Proprietary ©Swarm64 AS, 2019 2 There's Gold In

    Them Thar Hills "The stone age did not end for the lack of stone, and the oil age will end long before the world runs out of oil." Organizations store lots of transactional data. They want to derive value from that data. Fast.
  2. Confidential & Proprietary ©Swarm64 AS, 2019 3 From OLTP To

    HTAP Data engineering today: driven by the right tool for the right job Elasticsearch for search / relevance Hadoop for Map-Reduce CouchDB and MongoDB for document stores Tableau for BI analytics and visualization Why?
  3. Confidential & Proprietary ©Swarm64 AS, 2019 4 The Vs Of

    Big Data Value Veracity Velocity Volume Variety
  4. Confidential & Proprietary ©Swarm64 AS, 2019 6 DBAs Volume &

    Velocity Care about Install & configure databases Monitor performance Do capacity planning They
  5. Confidential & Proprietary ©Swarm64 AS, 2019 7 Analysts Veracity &

    Value Solve business problems by learning from data Convert data into information Help businesses make better decision using information Care about They
  6. Confidential & Proprietary ©Swarm64 AS, 2019 8 HTAP By Accident

    P A U L A D A M S VP Engineering S E B A S T I A N D R E S S L E R Team Lead Solution Engineering
  7. Confidential & Proprietary ©Swarm64 AS, 2019 9 The DBAs Benchmark

    Standardized & comparable: TPC-H* DWH benchmark 22 Queries 8 Tables (2 fact, 6 dimension) Various scale factors (up to PB) Run single or multiple streams (*) TPC-DS is newer but there are fewer (semi-)official data points to compare to
  8. Confidential & Proprietary ©Swarm64 AS, 2019 10 Setup Hardware Dual

    Intel Xeon Gold 6140 384GB RAM 8x960 GB SSD Software CentOS 7.6 PostgreSQL 11.3 Swarm64 DA 2.0 TPC-H 1 TB worth of data Biggest tables: 6bn & 1.5bn rows Configurations Single Node 1 Coordinator + 2 Data Nodes 2 Coordinators + 2 Data Nodes Same as above + Swarm64 DA
  9. Confidential & Proprietary ©Swarm64 AS, 2019 12 Postgres & Analytics

    Postgres One of the world's most trusted and powerful databases Maturity built on decades of community-driven development Well-respected for OLTP workloads Very capable in HTAP and OLAP Well, there is a "but"...
  10. Confidential & Proprietary ©Swarm64 AS, 2019 14 Queries By Example:

    TPC-H Q6 SELECT SUM(l_extendedprice * l_discount) AS revenue FROM lineitem WHERE l_shipdate >= DATE '1993-01-01' AND l_shipdate < DATE '1993-01-01' + INTERVAL '1' YEAR AND l_discount BETWEEN 0.05 - 0.01 AND 0.05 + 0.01 AND l_quantity < 24; Scanning > Parallelism helps, yet limited > Indices may cause scatter-gather Statistics help > They narrow selectivity Typical runtime: 10min
  11. Confidential & Proprietary ©Swarm64 AS, 2019 15 Queries By Example:

    TPC-H Q12 SELECT l_shipmode, SUM(CASE WHEN o_orderpriority = '1-URGENT' OR o_orderpriority = '2-HIGH' THEN 1 ELSE 0 END) AS high_line_count, SUM(CASE WHEN o_orderpriority <> '1-URGENT' AND o_orderpriority <> '2-HIGH' THEN 1 ELSE 0 END) AS low_line_count FROM orders, lineitem WHERE o_orderkey = l_orderkey AND l_shipmode IN ('TRUCK', 'AIR') AND l_commitdate < l_receiptdate AND l_shipdate < l_commitdate AND l_receiptdate >= DATE '1996-01-01' AND l_receiptdate < DATE '1996-01-01' + INTERVAL '1' YEAR GROUP BY l_shipmode ORDER BY l_shipmode; Typical runtime: >10min Expensive finalization > Early data reduction is key JOIN > May scatter-gather Filtering > Data reduction point #1 > Parallelism helps GROUP BY > Data reduction point #2
  12. Confidential & Proprietary ©Swarm64 AS, 2019 19 Analytics On Real

    World Datasets Question How much more generous are passengers being picked up at The Dead Rabbit than those being dropped off? NYC Taxi Billions of rows Pickup location, drop-off location, tip, fare, ...
  13. Confidential & Proprietary ©Swarm64 AS, 2019 20 The Analyst's Query

    SELECT (outbound.tip - inbound.tip) / inbound.tip * 100 AS generosity_increase FROM ( SELECT AVG(tip_amount / fare_amount) AS tip FROM trip_dropoff_locations JOIN trips ON trip_dropoff_locations.id = trips.id WHERE (trip_dropoff_locations.longitude BETWEEN -74.0114803745 AND -74.0105174615) AND (trip_dropoff_locations.latitude BETWEEN 40.7030212228 AND 40.7032184606) AND tip_amount != 0 AND fare_amount != 0) inbound CROSS JOIN ( SELECT AVG(tip_amount / fare_amount) AS tip FROM trip_pickup_locations JOIN trips ON trip_pickup_locations.id = trips.id WHERE (trip_pickup_locations.longitude BETWEEN -74.0114803745 AND -74.0105174615) AND (trip_pickup_locations.latitude BETWEEN 40.7030212228 AND 40.7032184606) AND tip_amount != 0 AND fare_amount != 0) outbound; Range-based scan > Index helps to reduce data volume JOIN > May scatter-gather Serial repetition > Similar query, executes in sequence Filtering > Reduces data volume
  14. Confidential & Proprietary ©Swarm64 AS, 2019 21 Analytics On Real

    World Datasets SELECT (outbound.tip - inbound.tip) / inbound.tip * 100 AS generosity_increase FROM ( SELECT AVG(tip_amount / fare_amount) AS tip FROM trip_dropoff_locations JOIN trips ON trip_dropoff_locations.id = trips.id WHERE (trip_dropoff_locations.longitude BETWEEN -74.0114803745 AND -74.0105174615) AND (trip_dropoff_locations.latitude BETWEEN 40.7030212228 AND 40.7032184606) AND tip_amount != 0 AND fare_amount != 0) inbound CROSS JOIN ( SELECT AVG(tip_amount / fare_amount) AS tip FROM trip_pickup_locations JOIN trips ON trip_pickup_locations.id = trips.id WHERE (trip_pickup_locations.longitude BETWEEN -74.0114803745 AND -74.0105174615) AND (trip_pickup_locations.latitude BETWEEN 40.7030212228 AND 40.7032184606) AND tip_amount != 0 AND fare_amount != 0) outbound; 1min 56s
  15. Confidential & Proprietary ©Swarm64 AS, 2019 23 Scale-Out: When &

    Why? Separation of concerns I/O intensive vs. CPU intensive Bottlenecks on your single node I/O CPU RAM Why is that? Data grows More concurrent users More demanding queries
  16. Confidential & Proprietary ©Swarm64 AS, 2019 24 Scale-Out: Typical Approach

    Coordinator Table metadata Node to connect to & query on Does the compute intesive part Data Nodes Table data Nodes where coordinators connect to Performs scanning and other I/O operations (e.g. filtering)
  17. Confidential & Proprietary ©Swarm64 AS, 2019 26 Scale-Out With PG

    Tools Native scale-out Use the Postgres Foreign Data Wrapper extension (postgres_fdw) Coordinator tables are postgres_fdw tables Connect to data nodes where the data is actually located Use partitions for parallelism Pitfall postgres_fdw not parallelized out-of-the-box, needs a patch
  18. Confidential & Proprietary ©Swarm64 AS, 2019 28 Analytics On Real

    World Datasets Same query, better result? Split computation & I/O Data gathering on data nodes Final computation is done on the coordinator 1min 46s
  19. Confidential & Proprietary ©Swarm64 AS, 2019 30 Tuning Postgres Data

    For Analytics Add Indices They help you on point-lookups, range queries, full text search, ... Upside Access data faster Downside They cost extra storage & CPU They can cause non-optimal I/O patterns Decide for fast reads over fast writes Add Partitions Mostly reduce data on range queries by selecting the right partition Upside Lower data volume & better maintenance (partitions can be plugged out) Downside Parallelism might be limited Changing the partition scheme might be hard
  20. Confidential & Proprietary ©Swarm64 AS, 2019 32 How To Increase

    Parallelism? Postgres limits parallelism... to prevent resource over-allocation to ensure transactional safety, even on a highly loaded system Patched postgres_fdw Parallelize scans on remote tables for higher throughput Workload management Determine and assign resources prior to query execution Monitor system state to acknowledge change Query rewriting Transform query plans to be executed more efficiently
  21. Confidential & Proprietary ©Swarm64 AS, 2019 Optimized Columns ROW- /

    COLUMN-HYBRID BLOCKS UP TO 3 RANGE-INDICES I/O transfer from storage device WHERE ws_order_number BETWEEN 150 AND 15000 AND ws_sold_date_sk BETWEEN 2450820 AND 2452000 WHERE ws_order_number BETWEEN 150 AND 15000 AND ws_sold_date_sk BETWEEN 2450820 AND 2452000 M ulti-part index
  22. Confidential & Proprietary ©Swarm64 AS, 2019 Decompress Pick Rows Pick

    Columns Result FROM SELECT Parallel Plan Optimized Columns WHERE Executed on the HW Accelerator WHERE 34 + Hardware Acceleration
  23. Confidential & Proprietary ©Swarm64 AS, 2019 37 Analytics On Real

    World Datasets Same query, best result? Split computation & I/O Data gathering on data nodes Final computation is done on the coordinator Plus Higher scan & filter parallelism Higher throughput due to compression 21s
  24. Confidential & Proprietary ©Swarm64 AS, 2019 38 Conclusions Postgres is

    great for analytics! Single node performs well Postgres can scale-out natively Hardware and software optimizations allow for greater parallelism and higher throughput Postgres can be a true analytics engine
  25. Confidential & Proprietary ©Swarm64 AS, 2019 39 Got Questions? Come

    and find us in the exhibitor area P A U L A D A M S VP Engineering [email protected] @theRealPAdams S E B A S T I A N D R E S S L E R Team Lead Solution Engineering [email protected] @theDressler