실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
Agenda
1. Append-Only 분산 파일 시스템으로 구성한 데이터 레이크의 단점
2. CDC-based UPSERT를 지원하는 데이터 레이크 구성
(1) View 테이블 이용 방법
(2) Open Table Formats 이용 방법 - Apache Iceberg, Hudi, Delta Lake
3. Modern Transactional Data Lake Architecture
rights reserved. Agenda • Append-Only 분산 파일 시스템으로 구성한 데이터 레이크의 단점 • CDC-based UPSERT를 지원하는 데이터 레이크 구성 § View 테이블 이용 방법 § Open Table Formats 이용 방법 – Apache Iceberg, Hudi, Delta Lake • Modern Transactional Data Lake Architecture
rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML 데이터 분석 CRM IoT WEB Messages CDC Event Streams Data Lake 구축 * DFS: Distributed File System Data Ware house Stream Delivery
rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake 구축 Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight
rights reserved. RDBMS CDC CDC 데이터의 Update/Delete 처리? Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3
rights reserved. View 테이블 기반 UPSERT 처리: Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0
rights reserved. Amazon Kinesis Data Streams Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … Amazon QuickSight Amazon MSK Amazon Redshift Streaming Ingestion M A T E R I A L I Z E D V I E W Auto Refresh Data Source
rights reserved. 변경 데이터 통합 작업의 주기적인 실행 t1 t2 Inserted Data (t1) Amazon S3 Inserted Data (t2) + + a b c d e f Merge & Compaction time Data Size Updated/ Deleted Data (t1) Updated/ Deleted Data (t2)
rights reserved. Real-time Materialized View org_tbl delta_tbl Auto Refresh Streaming Table Permanent Table Materialized View의 한계 Amazon Redshift Data Volume Data Volume Data Volume t1 tN time t2 Data Size Unlimited Data Volume .....
rights reserved. Apache Iceberg s0 Data Snapshots t0 t1 Partition File Location Schema Format Stats Write & Commit time Snapshots: State of table at some time s1
rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • 가입: Insert • 변경: Update • 탈퇴: Delete • 이력 관리: Append Only Amazon Kinesis Data Firehose Data Source Data Pipeline Data Lake User Profile
rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Kinesis Data Firehose S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O
rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake Athena Hudi Iceberg Delta Lake Insert X O X Delete X O X Select O O O AWS Glue Flink / Spark Amazon EMR Open Source Serverless Fully Managed
rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON }
rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Data Lake를 RDBMS처럼 사용하기
rights reserved. Transactional Data Lake: 배치 처리 AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone
rights reserved. Transactional Data Lake: 배치 + 실시간 처리 L A M B D A A R C H I T E C T U R E AWS DMS Amazon Kinesis Data Streams AWS Glue ETL Amazon Athena Amazon S3 Amazon RDS Amazon Redshift / Redshift Serverless Real-Time Materialized View Streaming Table Permanent Tables (Apache Iceberg, Hudi, Delta Lake) Amazon S3 Amazon Kinesis Data Firehose Raw Zone Curated Zone Batch Layer Speed Layer
rights reserved. On-Premise 에서 Transactional Data Lake 구축 Generic database Corporate data center Long Time-to-build High Cost in TCO Deep Expertise Required Security HDFS Kafka Connect Connect Hive / Presto Flink / Spark Streaming
rights reserved. Resources • Transactional Data Lake using Apache Iceberg with AWS Glue Streaming and DMS § https://github.com/aws-samples/transactional-datalake-using-apache-iceberg-on-aws-glue • Building Serverless Business Intelligent System from Scratch § https://serverless-bi-system-from-scratch.workshop.aws/ • Data Pipeline using AWS DMS and Kinesis § https://catalog.us-east-1.prod.workshops.aws/workshops/4da54890-23fc-4b9a-80cd-3a0ca3279b3f/en- US • Amazon Redshift Streaming Ingestion Patterns § https://github.com/aws-samples/redshift-streaming-ingestion-patterns