Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Hive 4 on Treasure Data

Apache Hive 4 on Treasure Data

Treasure Data Tech Talk Apr 24 2024
Learn how we upgraded to Apache Hive 4 and the new features in Apache Hive 4.

Ryu Kobayashi

April 24, 2024
Tweet

Other Decks in Programming

Transcript

  1. About me • Ryu Kobayashi • Query Engine Team •

    Background ◦ Hadoop, Cassandra, Machine Learning ◦ Past publications
  2. © 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive

    history 4 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez
  3. © 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive

    history 5 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez We stopped using the distribution
  4. © 2024 Treasure Data, Inc. Confidential Why did we stop

    using the distribution? 6 • Unable to upgrade Hadoop, Hive and Tez individually • Our own patches and bug fixes ◦ Since various customer queries are running, corner case bugs and more are happening ▪ We will create a patch to fix this. ◦ Difficult to source control the above ◦ Patches must be applied on every upgrade unless reflected upstream
  5. © 2024 Treasure Data, Inc. Confidential Current our Hive version

    7 • Stable: 2.3.2 base • Experimental: 4.0.0-beta1 base ◦ previous version: ▪ 4.0.0-alpha1 base ▪ 4.0.0-alpha2 base • Skiped 3.x.x
  6. © 2024 Treasure Data, Inc. Confidential We do for each

    upgrade 8 • We have something like a Hive plugin and we need to update it. ◦ We update the API to the latest. ◦ We do not use HDFS to store data, we have our own data store called Plazma. ▪ We need to update the metastore API when it is updated.
  7. © 2024 Treasure Data, Inc. Confidential We do for each

    upgrade 9 • Upgrade of Hivemall ◦ The Hivemall podling retired on 2022-09-01 ◦ The Hivemall supports up to Hive 2 ▪ We are using Hivemall's UDF in CDP etc ▪ So we need to upgrade Hivemall. ◦ Currently, we use our own Hivemall for Hive 4
  8. © 2024 Treasure Data, Inc. Confidential How do we test?

    10 • system test ◦ general purpose test • elephant-testing ◦ Automatically test queries that have had issues in the past(about 100 tests) • hive query simulator ◦ Simulate and test previously executed queries
  9. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    18 • Iceberg Integration ◦ Advanced Snapshot management ◦ Branches & Tags support ◦ DML (insert/update/delete/merge) ◦ COW & MOR modes ◦ Vectorised Reads & Writes ◦ Table migration command ◦ LOAD DATA statements support ◦ Partition-level operations support ◦ Improved statistics (column stats support)
  10. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    19 • Hive ACID ◦ Use sequences for TXN_ID generation (performance) ◦ Read-only transactions optimization ◦ Zero-wait readers ◦ Optimistic and Pessimistic concurrency control ◦ Lockless reads
  11. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    20 • Hive Metastore ◦ API optimization (performance) ◦ Dynamic leader election ◦ External data sources support ◦ HMS support for Thrift over HTTP ◦ JWT authentication for Thrift over HTTP ◦ HMS metadata summary ◦ Use Zookeeper for service discovery
  12. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    21 • HiveServer2 ◦ Support SAML 2.0/JWT authentication mode ◦ Support both Kerberos and LDAP auth methods in parallel ◦ Graceful shutdown ◦ Easy access to the operation log through web UI
  13. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    22 • Hive Replication ◦ Optimised Bootstrap Solution ◦ Support for Snapshot Based Replication for External Table ◦ Better Replication Tracking Metrics ◦ Support for Checkpointing during Replication
  14. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    23 • Security ◦ Authorizations in alter table/view, UDFs, and Views created from Apache Spark ◦ Authorizations on tables created based on storage handlers ◦ Critical CVE fixes of transitive dependencies
  15. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    24 • Compiler ◦ Materialized view support for Iceberg tables ◦ Improvements to refresh materialized views ◦ Date/Timestamp fixes and improvements ◦ Anti join support ◦ Split update support ◦ Branch pruning ◦ Column histogram statistics support ◦ Calcite upgrade to 1.25 ◦ HPL/SQL improvements ◦ Scheduled query support
  16. © 2024 Treasure Data, Inc. Confidential Overview of Major Changes

    25 • Miscl ◦ Support for ESRI GeopSpatial UDF's ◦ Added support for Apache Ozone ◦ Support Hadoop-3.3.6 ◦ Supports Tez 0.10.3 ◦ Works with Aarch64 (ARM)
  17. © 2024 Treasure Data, Inc. Confidential Summary 26 • Iceberg

    Integration ◦ Many operations are now available in Iceberg(e.g. DML, Vectorized and more) • Hive Docker Support • Compiler Improvements ◦ Anti-join support ◦ Scheduled query support ◦ Date/Timestamp fixes and improvements ▪ hive.datetime.formatter(default: DATETIME) • DATETIME: java.time.format.DateTimeFormatter • SIMPLE: java.text.SimpleDateFormat ◦ improved CBO rules leading to better query plans
  18. © 2024 Treasure Data, Inc. Confidential Summary 27 • Hive

    CLI Deprecated • Hive on MapReduce Deprecated and Hive on Spark removed • Support Hadoop-3.3.6 • Supports Tez 0.10.3 • New UDFs
  19. © 2024 Treasure Data, Inc. Confidential New UDFs 28 •

    quote ◦ Enclose the string in quotes
  20. © 2024 Treasure Data, Inc. Confidential New UDFs 29 •

    quote ◦ HIVE-685 was created by 24/Jul/09
  21. © 2024 Treasure Data, Inc. Confidential New UDFs 30 •

    deserialize ◦ Returns plain text string of given message which was compressed in compressionFormat and base64 encoded. Currently, Supports only 'gzip' for Gzip compressed and base 64 encoded strings
  22. © 2024 Treasure Data, Inc. Confidential New UDFs 31 •

    cosh and tanh ◦ Returns the hyperbolic cosine of x and hyperbolic tangent of x
  23. © 2024 Treasure Data, Inc. Confidential New UDFs 32 •

    array related ◦ array_slice ◦ array_min ◦ array_max ◦ array_distinct ◦ array_join ◦ array_expect ◦ array_intersect ◦ array_union ◦ array_remove
  24. © 2024 Treasure Data, Inc. Confidential New UDFs 33 •

    typeof ◦ Returns the type of the supplied argument
  25. © 2024 Treasure Data, Inc. Confidential New UDFs 34 •

    ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Add ESRI-based geospatial data and operate it with Hive ◦ NOTE: ESRI data is required for use ◦ Related blog ▪ The blog above has an example to get the number of earthquakes in California ▪ Sample data: • earthquakes.csv • california-counties.json
  26. © 2024 Treasure Data, Inc. Confidential New UDFs 35 •

    ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Some things to keep in note ▪ california-counties.json uses its original format ▪ We need to specify original InputFormat • org.apache.hadoop.hive.ql.io.esriJson.EnclosedE sriJsonInputFormat • This data also includes binary
  27. © 2024 Treasure Data, Inc. Confidential ESRI Geospatial UDFs 36

    • e.g. CREATE TABLE counties ( Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.udf.esri.serde.EsriJsonSerDe’ STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.esriJson.EnclosedEsriJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
  28. © 2024 Treasure Data, Inc. Confidential Hive Docker Support 37

    • This makes it easy to quickly check • DockerHub • Quick start ◦ Launch the HiveServer2 with an embedded Metastore ◦ Launch Standalone Metastore ◦ Accessing Beeline ◦ Accessing HiveServer2 Web UI: export HIVE_VERSION=4.0.0 docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION} docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/' Accessed on browser at http://localhost:10002/
  29. The power to use every bit of privacy-compliant data to

    serve with relevance. © 2024 Treasure Data, Inc. Confidential 39