2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 김성민 Sr. Solutions Architect AWS Amazon Data Firehose와 Apache Iceberg를 이용한 Transactional Data Lake 구축하기
rights reserved. Agenda • CDC-based UPSERT를 지원하는 데이터 레이크 구성 방법 § Open Table Formats 이용 방법 – Apache Iceberg, Hudi, Delta Lake • Transactional Data Lake Architecture • Demo
rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML 데이터 분석 CRM IoT WEB Messages CDC Event Streams Data Lake 구축 * DFS: Distributed File System Data Ware house Stream Delivery
rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake 구축 Amazon Kinesis Data Streams Amazon Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight
rights reserved. RDBMS CDC CDC 데이터의 Update/Delete 처리? Amazon Kinesis Data Streams Amazon Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3
rights reserved. View 테이블 기반 UPSERT 처리: Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0
rights reserved. year=2022/month=01/day=01/hour=00/ p1.parquet p2.parauet year=2022/month=02/day=01/hour=00/ ... year=2022/month=12/day=01/hour=00/ ... year=2023/month=01/day=02/hour=00/ p1.parquet p2.parauet year=2023/month=01/day=02/hour=01/ p1.parquet p2.parauet S3 Glacier Deep Archive S3 Standard View 테이블로 해결하기 어려운 상황 Update/ Delete View Merge-On-Read
rights reserved. S3에서 UPSERT 연산을 어떻게 처리 할 수 있을까? RDBMS Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) Write t1 t2 time Amazon S3 Table data files commit log Merge-On-Read
rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Amazon S3를 RDBMS처럼 사용하기
rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • 가입: Insert • 변경: Update • 탈퇴: Delete • 이력 관리: Append Only Data Source Data Pipeline Data Lake User Profile Amazon Data Firehose
rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile Open Table Formats Payments parquet, orc, avro iceberg, hudi, delta lake Athena Iceberg Hudi Delta Lake Select O O O Insert O X X Delete O X X Amazon Data Firehose
rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile Open Table Formats Payments parquet, orc, avro iceberg, hudi, delta lake AWS Glue Flink / Spark Amazon EMR Fully Managed Athena Iceberg Hudi Delta Lake Select O O O Insert O X X Delete O X X Open Source Serverless
rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake AWS Glue Flink / Spark Amazon EMR Serverless Fully Managed Athena Iceberg Select O Insert O Delete O Amazon Data Firehose Open Source
rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON } Amazon EMR
rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON } Amazon Data Firehose
rights reserved. Transactional Data Lake using Apache Iceberg AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Data Firehose {JSON} {JSON} Amazon Data Firehose
rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Amazon S3를 RDBMS처럼 사용하기
rights reserved. Transactional Data Lake with Open Table Formats AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming CDC Amazon EMR
rights reserved. Transactional Data Lake with Apache Iceberg AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS CDC Amazon Data Firehose