Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Spec to Implementation: Iceberg REST Catal...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

From Spec to Implementation: Iceberg REST Catalog with Hive Metastore

Avatar for okumin

okumin

May 20, 2026

More Decks by okumin

Other Decks in Programming

Transcript

  1. From Spec to Implementation: Iceberg REST Catalog with Hive Metastore

    @okumin Apache Iceberg Meetup Japan #5 Databricks Office, Tokyo May 20, 2026
  2. About Me - Shohei Okumiya (@okumin) - Apache Hive PMC

    Member - Working at Treasure Data -> Treasure AI
  3. ZooKage - Full-Featured Lakehouse, Locally $ git clone --branch v0.5.0

    https://github.com/zookage/zookage.git $ cd zookage $ ./bin/up
  4. Agenda - What is Apache Hive? - Recap: Apache Iceberg

    Table Format - Overview: Iceberg Catalogs - Comparing Basic Operations: Hive Catalog vs. REST Catalog - How to read - How to write - Advanced REST Catalog Features - Authentication and Authorization - Credential Vending - Metrics Reporting - Server-Side Planning
  5. “Distributed Data Warehouse at Massive Scale” - Hive Metastore: Metadata

    Repository - HiveServer2: SQL Gateway - Hive on Tez, Hive LLAP: Distributed Execution Engine RDBMS HDFS Object Storage HiveServer2 Hive Metastore Hadoop Hive on Tez Kubernetes Hive LLAP Trino, Spark, Flink
  6. What Matters for Today’s Talk - Hive Metastore: Metadata Repository

    - HiveServer2: SQL Gateway - Hive on Tez, Hive LLAP: Distributed Execution Engine RDBMS HDFS Object Storage HiveServer2 Hive Metastore Hadoop Hive on Tez Kubernetes Hive LLAP Trino, Spark, Flink
  7. Iceberg-related Data Iceberg Catalog The current metadata pointer is stored.

    Metadata File The schema or available snapshots are stored. Manifest List The list of Manifest Files included in a snapshot is stored. Manifest File The list of Data Files is stored. Data File The valid records are stored as Parquet or something. Source: Spec - Apache Iceberg™ https://iceberg.apache.org/spec/
  8. What Matters for Today’s Talk When operating an Iceberg table,

    we need to use a Catalog to obtain the metadata file location, Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)
  9. What Matters for Today’s Talk 3 types of metadata-related files

    in a distributed storage, Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)
  10. What Matters for Today’s Talk and files storing actual records

    as Parquet, ORC, or so on. Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)
  11. Iceberg Catalog Implementations Catalog’s roles and responsibilities - It is

    able to resolve the current metadata location by a table name - It is able to update the mapping safely Spark + Hadoop Catalog Spark + JDBC Catalog Spark + Hive Catalog Spark + REST Catalog HDFS, S3 PostgreSQL, MySQL Iceberg REST API Hive Metastore Hadoop FileSystem API JDBC Driver Hive Thrift Client Iceberg REST Client
  12. Iceberg REST Catalog API Implementations Managed Services - Databricks Unity

    Catalog - Cloudera Iceberg REST Catalog - Snowflake Open Catalog - AWS Glue - Amazon S3 Tables - Google Cloud's Lakehouse - Microsoft Fabric OneLake - Dremio Open Catalog OSS - Unity Catalog - Apache Polaris - Apache Gravitino - Lakekeeper - Project Nessie
  13. Hive Metastore is a new one OSS - Unity Catalog

    - Apache Polaris - Apache Gravitino - Lakekeeper - Project Nessie - Apache Hive - Hive Metastore Managed Services - Databricks Unity Catalog - Cloudera Iceberg REST Catalog - Snowflake Open Catalog - AWS Glue - Amazon S3 Tables - Google Cloud's Lakehouse - Microsoft Fabric OneLake - Dremio Open Catalog
  14. Iceberg REST Catalog API backed by Hive Metastore - Hive

    Metastore translates an Iceberg REST API request to Hive Catalog’s method - Any semantics (e.g., what characters can be used as a table name) are identical to Hive Catalog - We can deploy the REST API in HMS or add a standalone server Trino REST Catalog Hive Metastore Iceberg REST API Hive Catalog API Server Iceberg REST API Hive Metastore Thrift API Trino REST Catalog REST API RDBMS RDBMS Thrift RPC REST API Embedded Mode Standalone Mode (>= 4.3)
  15. Hive Catalog vs REST Catalog: Read Path What if we

    run a simple SELECT query using Trino? -- Schema CREATE TABLE test (name VARCHAR); -- Query SELECT * FROM test;
  16. Read Path with Hive Catalog (1) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (1) GetTableRequest (Thrift) (2) SELECT Current Pointer (3) s3://path/to/000-metadata.json (4) Hive Table
  17. Read Path with Hive Catalog (2) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (5) GET s3://path/to/000-metadata.json (6) Schema, List of snapshots (7) GET s3://path/to/snap-abc.avro (8) List of Manifest Files (9) GET s3://path/to/m0.avro (10) List of Data Files (11) GET s3://path/to/000.parquet (12) Records
  18. Read Path with REST Catalog (1) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (1) Load Table (Iceberg REST) (2) SELECT Current Pointer (3) s3://path/to/000-metadata.json
  19. Read Path with REST Catalog (2) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (6) Iceberg Table (4) GET s3://path/to/000-metadata.json (5) Schema, List of snapshots
  20. Read Path with REST Catalog (3) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (7) GET s3://path/to/snap-abc.avro (8) List of Manifest Files (9) GET s3://path/to/m0.avro (10) List of Data Files (11) GET s3://path/to/000.parquet (12) Records
  21. Wrap-up: Read Path - Hive Metastore reads a Metadata File

    on its own - Hive Metastore has a few more optimization chances, e.g., caching Metadata Files (HIVE-29035) Hive Catalog REST Catalog Hive Metastore (Thrift) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Hive Metastore (REST) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet)
  22. Hive Catalog vs REST Catalog: Write Path What if we

    add a single row? Note: An Iceberg client makes a few of Load Table requests to know the current table status. In this slides, the part is omitted -- For simplicity SET SESSION <catalog>.merge_manifests_on_write = false; -- Query INSERT INTO test (name) VALUES ('Alice');
  23. Write Path with Hive Catalog (1) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (8) 200 OK (7) PUT s3://path/to/new-metadata.json (6) 200 OK (5) PUT s3://path/to/snap-xyz.avro (4) 200 OK (3) PUT s3://path/to/new.avro (2) 200 OK (1) PUT s3://path/to/new.parquet
  24. Write Path with Hive Catalog (2) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (9) LockRequest (Thrift) (10) Acquire Lock (11) Lock Acquired (12) LockResponse
  25. Write Path with Hive Catalog (3) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (13) GetTableRequest (Thrift) (14) SELECT Current Pointer (15) Current Pointer (16) Hive Table Retry or abort if the metadata location changes during lock acquisition
  26. Write Path with Hive Catalog (4) Hive Metastore (Thrift) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (17) AlterTableRequest (Thrift) (18) Updater Pointer
  27. Write Path with REST Catalog (1) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (6) 200 OK (5) PUT s3://path/to/snap-xyz.avro (4) 200 OK (3) PUT s3://path/to/new.avro (2) 200 OK (1) PUT s3://path/to/new.parquet
  28. Write Path with REST Catalog (2) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (7) Update Table w/ diff (9) 200 OK (8) PUT s3://path/to/new-metadata.json
  29. Write Path with REST Catalog (3) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (11) Lock acquired (10) Acquire Lock
  30. Write Path with REST Catalog (4) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (13) Current Pointer (12) SELECT Current Pointer Retry or abort if the metadata location changes during lock acquisition
  31. Write Path with REST Catalog (5) Hive Metastore (REST) Metadata

    Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (14) Update Pointer
  32. Wrap-up: Write Path Hive Metastore can abstract more operations, such

    as a table-level lock Hive Catalog REST Catalog Hive Metastore (REST) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Hive Metastore (Thrift) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet)
  33. Authentication and Authorization - How to resolve user names and

    apply the endpoint-level authorization? - OAuth 2, supported by HIVE-29020 - AWS SigV4 - Bearer Token - Table-level access control - Apache Ranger - AWS Lake Formation Catalog Server Policy Store Identity Provider She is Alice Alice can update Table X and Y
  34. OAuth 2.0 + Authorization Plugin • Configure Iceberg REST Catalog

    API endpoints as protected resources • Resolve the username from the OAuth 2.0 token claims • Retrieve the user's privileges from Apache Ranger and enforce table-level permissions Authorization Server E.g., Keycloak Hive Metastore Trino (1) Request Access Token Ranger Policy Store (2) Access Token (3) Request w/ Access Token (4) Token Introspection (5) Claim Set w/ ID source (7) Response (4) and (5) is omitted when validating Access Token as JWT Ranger Plugin Sync (6) Check Privileges
  35. Credential Vending - Iceberg REST Catalog API returns storage credentials

    - HIVE-29228(work in progress) Iceberg Client Catalog Server Storage Table Metadata with Credentials Read, Write, Delete Files
  36. Without Credential Vending Hive Metastore w/ 🔑 Metadata File (JSON)

    Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino w/ 🔑 Hive Metastore has read and write access to the entire /user/hive/warehouse Trino also has read and write access to the entire /user/hive/warehouse
  37. With Credential Vending Hive Metastore Metadata File (JSON) Manifest List

    (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Trino uses credentials vended by Hive Metastore AWS STS (1) LoadTable (2) Assume Role (3) Temporary Credentials 🔑 (4) Iceberg Table w/ Temporary Credentials 🔑 (5) Access S3 w/ Temporary Credentials 🔑 Credentials scoped to the requested table and permitted operations (a subset of GET, PUT, and DELETE)
  38. Metrics Reporting - Iceberg clients can report useful metrics via

    the REST API - Scan Report: The table name, scan conditions, the number of scanned files, etc. - Commit Report: The table name, the number of created or deleted files and records - The REST Catalog enables centralized server-side management of metrics Iceberg Client 1 Catalog Server Iceberg Client 2 Iceberg Client 3 ???
  39. Metrics Reporting - HIVE-29593(>= 4.3): Hive Metastore administrators can implement

    and deploy metrics-handling plugins Trino Hive Metastore Kafka Iceberg Datadog Spark Flink
  40. Server-Side Planning - Resolve the list of Data Files on

    the REST Catalog API - The spec is available - The Java client implementation was shipped with Apache Iceberg 1.11.0 -> We can start implementing and testing it Iceberg Client Catalog Server Snapshot ID, Projection, Predicate, etc. List of Data Files
  41. Recap: Load Table + Client-Side Planning Hive Metastore Metadata File

    (JSON) Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Metadata Pointer PostgreSQL
  42. With Server-Side Planning Hive Metastore Metadata File (JSON) Manifest List

    (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Metadata Pointer PostgreSQL (1) Submit Scan Planning w/ Scan Conditions (2) Scan Metadata (3) Locations of Data Files (4) Read Data Files
  43. Key Takeaways - Apache Iceberg REST Catalog is gaining strong

    momentum - Hive Metastore is actively adding support for the Iceberg REST API - REST Catalog makes it easier to introduce advanced features - Special Thanks - Treasure AI colleagues for their review - Keisuke Suzuki, Masafumi Koba