Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Transactional Data Lake using Amazon D...

Building Transactional Data Lake using Amazon Data Firehose and Apache Iceberg

YouTube 영상: https://www.youtube.com/watch?v=uyuZjKeoAS4

Agenda

• CDC-based UPSERT를 지원하는 데이터 레이크 구성 방법
• Open Table Formats 이용 방법 – Apache Iceberg, Hudi, Delta Lake
• Transactional Data Lake Architecture
• Demo

Sungmin Kim

February 19, 2025
Tweet

More Decks by Sungmin Kim

Other Decks in Technology

Transcript

  1. © 2025, Amazon Web Services, Inc. or its affiliates. ©

    2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Confidential and Trademark. 김성민 Sr. Solutions Architect AWS Amazon Data Firehose와 Apache Iceberg를 이용한 Transactional Data Lake 구축하기
  2. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Agenda • CDC-based UPSERT를 지원하는 데이터 레이크 구성 방법 § Open Table Formats 이용 방법 – Apache Iceberg, Hudi, Delta Lake • Transactional Data Lake Architecture • Demo
  3. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture 데이터 분석 시스템 RDBMS Data Insights
  4. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. RDBMS의 Scalability 한계 RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage Query Engine (2) Query Engine (3) Storage interface Scale-Out Scale-Out Primary-Replica Cluster RDBMS (Primary) Scale-Up RDBMS (Replica) Scale-Out Replica Primary Distributed File System RDBMS
  5. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. DFS* Stream Storage Data Lake Data Mart AI/ML 데이터 분석 CRM IoT WEB Messages CDC Event Streams Data Lake 구축 * DFS: Distributed File System Data Ware house Stream Delivery
  6. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CRM IoT WEB Messages CDC Event Streams Data Lake 구축 Amazon Kinesis Data Streams Amazon Data Firehose Amazon Athena Amazon S3 Data Lake Amazon QuickSight
  7. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. IMMUTABLE Objects Distributed CAN NOT Update/Delete In-Place Insert (Append)-Only interface (HTTPS, SDK APIs) Transactional (X) MUTABLE Records Files per tables Update/Delete In-Place Insert/Update/Delete table1 table2 table3 RDBMS Transactional (O) RDBMS vs. S3 (≈ Distributed Object Storage) File System File System File System Amazon S3
  8. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. RDBMS CDC CDC 데이터의 Update/Delete 처리? Amazon Kinesis Data Streams Amazon Data Firehose Amazon Athena Amazon S3 AWS DMS datalake/ year=2023/month=05/day=03/hour=01/ obj1.parquet obj2.parquet … year=2023/month=05/day=03/hour=02/ updated-obj1.parquet … Data Lake Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3
  9. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. View 테이블 기반 UPSERT 처리: Merge-On-Read RDBMS Updated/ Deleted Data Inserted Data View Table Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 I, pk0, c1, c2, t0 D, pk0, c1, c2, t3 I, pk0, c1, c2, t0
  10. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. year=2022/month=01/day=01/hour=00/ p1.parquet p2.parauet year=2022/month=02/day=01/hour=00/ ... year=2022/month=12/day=01/hour=00/ ... year=2023/month=01/day=02/hour=00/ p1.parquet p2.parauet year=2023/month=01/day=02/hour=01/ p1.parquet p2.parauet S3 Glacier Deep Archive S3 Standard View 테이블로 해결하기 어려운 상황 Update/ Delete View Merge-On-Read
  11. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. S3에서 UPSERT 연산을 어떻게 처리 할 수 있을까? RDBMS Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) Write t1 t2 time Amazon S3 Table data files commit log Merge-On-Read
  12. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon S3를 RDBMS처럼 사용하기 Index Field1 (v1, t1) Files binlog Read Field1 (v2, t2) my_table/ date=2023-01-01/ file-1.parquet ...... file-2.parquet ...... commit_log/ 00000.json 00001.json ...... Amazon S3 Write t1 t2 time Table data files Merge-On-Read commit log Insert file-1.parquet Insert file-2.parquet Delete file-1.parquet RDBMS
  13. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Table data files commit log Merge-On-Read Amazon S3 “Table Format” = Layout of Files in Table commit_log date=2023-01-01
  14. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Amazon S3를 RDBMS처럼 사용하기
  15. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A schema, partitions, snapshots list of files and mappings to snapshots tracks data files and statistics © iceberg.apache.org
  16. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Apache Iceberg M E T A D A T A F I L E S T O T R A C K D A T A my_table/ ├── metadata/ │ ├── 00000.metadata.json │ ├── 00001.metadata.json │ ├── 00002.metadata.json │ ....... │ ├── a39f-e190-b871-ac8e5b-m0.avro │ ├── a39f-e190-b871-ac8e5b-m1.avro │ ├── a39f-e190-b871-ac8e5b-m2.avro │ ....... │ ├── snap-1954-1-2e934.avro │ ├── snap-4381-1-255b.avro │ ├── snap-4866-1-8bf57.avro └── data/ ├── date=2023-01-01 │ └── file-1.parquet └── date=2023-01-02 └── file-2.parquet © iceberg.apache.org
  17. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Open Table Formats – Iceberg, Hudi, Delta Lake Apache Iceberg Hudi Delta Lake ACID Yes Yes Yes Partition Evolution Yes No No Schema Evolution Yes Partial Limited Time Travel Yes Yes Yes Merge Yes Yes Yes Compaction API based Manual Automated Data Format Parquet, Avro, ORC, CSV Parquet, ORC Parquet Current Pointer Metastore, File system with version File Timeline commit Transaction log Conflict Resolution Optimistic Optimistic Optimistic Programming Language Java & Python Scala, Java & Python Java & Python
  18. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Modern Transactional Data Lake
  19. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Typical Data Pipeline & Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Payments • 가입: Insert • 변경: Update • 탈퇴: Delete • 이력 관리: Append Only Data Source Data Pipeline Data Lake User Profile Amazon Data Firehose
  20. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile Open Table Formats Payments parquet, orc, avro iceberg, hudi, delta lake Athena Iceberg Hudi Delta Lake Select O O O Insert O X X Delete O X X Amazon Data Firehose
  21. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile Open Table Formats Payments parquet, orc, avro iceberg, hudi, delta lake AWS Glue Flink / Spark Amazon EMR Fully Managed Athena Iceberg Hudi Delta Lake Select O O O Insert O X X Delete O X X Open Source Serverless
  22. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS S3 User Profile iceberg Payments parquet, orc, avro iceberg, hudi, delta lake AWS Glue Flink / Spark Amazon EMR Serverless Fully Managed Athena Iceberg Select O Insert O Delete O Amazon Data Firehose Open Source
  23. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON } Amazon EMR
  24. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. CDC-based UPSERT 를 지원하는 Data Lake AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Operation Changed Data I, pk1, c1, c2, t1 U, pk1, c1, c2, t2 D, pk0, c1, c2, t3 CDC { JSON } Amazon Data Firehose
  25. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake using Apache Iceberg AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS Amazon Data Firehose {JSON} {JSON} Amazon Data Firehose
  26. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo
  27. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Reference Architecture https://github.com/aws-samples/transactional-datalake-using-amazon-datafirehose-iceberg
  28. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary
  29. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. “Table Format” = Layout of Files in Table O P E N T A B L E F O R M A T S Amazon S3 Update/Delete In-Place table1 table2 table3 RDBMS Transactional Amazon S3를 RDBMS처럼 사용하기
  30. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake with Open Table Formats AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS AWS Glue Streaming CDC Amazon EMR
  31. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Transactional Data Lake with Apache Iceberg AWS DMS Amazon Kinesis Data Streams Amazon Athena Amazon S3 Amazon RDS CDC Amazon Data Firehose
  32. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Resources Transactional Data Lake using Amazon Data Firehose Transactional Data Lake Samples
  33. © 2025, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. 감사합니다 © 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.