Apache Hive 4 on Treasure Data

Ryu Kobayashi Apache Hive 4 on Treasure Data

About me • Ryu Kobayashi • Query Engine Team •
Background ◦ Hadoop, Cassandra, Machine Learning ◦ Past publications

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 3 Our
Hadoop and Hive history

© 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive
history 4 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez

© 2024 Treasure Data, Inc. Confidential Our Hadoop and Hive
history 5 CDH3 CDH4 HDP2 Apache Hadoop, Hive and Tez We stopped using the distribution

© 2024 Treasure Data, Inc. Confidential Why did we stop
using the distribution? 6 • Unable to upgrade Hadoop, Hive and Tez individually • Our own patches and bug fixes ◦ Since various customer queries are running, corner case bugs and more are happening ▪ We will create a patch to fix this. ◦ Difficult to source control the above ◦ Patches must be applied on every upgrade unless reflected upstream

© 2024 Treasure Data, Inc. Confidential Current our Hive version
7 • Stable: 2.3.2 base • Experimental: 4.0.0-beta1 base ◦ previous version: ▪ 4.0.0-alpha1 base ▪ 4.0.0-alpha2 base • Skiped 3.x.x

© 2024 Treasure Data, Inc. Confidential We do for each
upgrade 8 • We have something like a Hive plugin and we need to update it. ◦ We update the API to the latest. ◦ We do not use HDFS to store data, we have our own data store called Plazma. ▪ We need to update the metastore API when it is updated.

© 2024 Treasure Data, Inc. Confidential We do for each
upgrade 9 • Upgrade of Hivemall ◦ The Hivemall podling retired on 2022-09-01 ◦ The Hivemall supports up to Hive 2 ▪ We are using Hivemall's UDF in CDP etc ▪ So we need to upgrade Hivemall. ◦ Currently, we use our own Hivemall for Hive 4

© 2024 Treasure Data, Inc. Confidential How do we test?
10 • system test ◦ general purpose test • elephant-testing ◦ Automatically test queries that have had issues in the past(about 100 tests) • hive query simulator ◦ Simulate and test previously executed queries

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 11 Apache
Hive 4 Overview

© 2024 Treasure Data, Inc. Confidential Why did it take
4 months? 14

© 2024 Treasure Data, Inc. Confidential—Internal Use Only 17 Apache
Hive 4 Overview

© 2024 Treasure Data, Inc. Confidential Overview of Major Changes
18 • Iceberg Integration ◦ Advanced Snapshot management ◦ Branches & Tags support ◦ DML (insert/update/delete/merge) ◦ COW & MOR modes ◦ Vectorised Reads & Writes ◦ Table migration command ◦ LOAD DATA statements support ◦ Partition-level operations support ◦ Improved statistics (column stats support)

19 • Hive ACID ◦ Use sequences for TXN_ID generation (performance) ◦ Read-only transactions optimization ◦ Zero-wait readers ◦ Optimistic and Pessimistic concurrency control ◦ Lockless reads

20 • Hive Metastore ◦ API optimization (performance) ◦ Dynamic leader election ◦ External data sources support ◦ HMS support for Thrift over HTTP ◦ JWT authentication for Thrift over HTTP ◦ HMS metadata summary ◦ Use Zookeeper for service discovery

21 • HiveServer2 ◦ Support SAML 2.0/JWT authentication mode ◦ Support both Kerberos and LDAP auth methods in parallel ◦ Graceful shutdown ◦ Easy access to the operation log through web UI

22 • Hive Replication ◦ Optimised Bootstrap Solution ◦ Support for Snapshot Based Replication for External Table ◦ Better Replication Tracking Metrics ◦ Support for Checkpointing during Replication

23 • Security ◦ Authorizations in alter table/view, UDFs, and Views created from Apache Spark ◦ Authorizations on tables created based on storage handlers ◦ Critical CVE fixes of transitive dependencies

24 • Compiler ◦ Materialized view support for Iceberg tables ◦ Improvements to refresh materialized views ◦ Date/Timestamp fixes and improvements ◦ Anti join support ◦ Split update support ◦ Branch pruning ◦ Column histogram statistics support ◦ Calcite upgrade to 1.25 ◦ HPL/SQL improvements ◦ Scheduled query support

25 • Miscl ◦ Support for ESRI GeopSpatial UDF's ◦ Added support for Apache Ozone ◦ Support Hadoop-3.3.6 ◦ Supports Tez 0.10.3 ◦ Works with Aarch64 (ARM)

© 2024 Treasure Data, Inc. Confidential Summary 26 • Iceberg
Integration ◦ Many operations are now available in Iceberg(e.g. DML, Vectorized and more) • Hive Docker Support • Compiler Improvements ◦ Anti-join support ◦ Scheduled query support ◦ Date/Timestamp fixes and improvements ▪ hive.datetime.formatter(default: DATETIME) • DATETIME: java.time.format.DateTimeFormatter • SIMPLE: java.text.SimpleDateFormat ◦ improved CBO rules leading to better query plans

© 2024 Treasure Data, Inc. Confidential Summary 27 • Hive
CLI Deprecated • Hive on MapReduce Deprecated and Hive on Spark removed • Support Hadoop-3.3.6 • Supports Tez 0.10.3 • New UDFs

quote ◦ HIVE-685 was created by 24/Jul/09

deserialize ◦ Returns plain text string of given message which was compressed in compressionFormat and base64 encoded. Currently, Supports only 'gzip' for Gzip compressed and base 64 encoded strings

cosh and tanh ◦ Returns the hyperbolic cosine of x and hyperbolic tangent of x

array related ◦ array_slice ◦ array_min ◦ array_max ◦ array_distinct ◦ array_join ◦ array_expect ◦ array_intersect ◦ array_union ◦ array_remove

typeof ◦ Returns the type of the supplied argument

ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Add ESRI-based geospatial data and operate it with Hive ◦ NOTE: ESRI data is required for use ◦ Related blog ▪ The blog above has an example to get the number of earthquakes in California ▪ Sample data: • earthquakes.csv • california-counties.json

ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Some things to keep in note ▪ california-counties.json uses its original format ▪ We need to specify original InputFormat • org.apache.hadoop.hive.ql.io.esriJson.EnclosedE sriJsonInputFormat • This data also includes binary

© 2024 Treasure Data, Inc. Confidential ESRI Geospatial UDFs 36
• e.g. CREATE TABLE counties ( Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.udf.esri.serde.EsriJsonSerDe’ STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.esriJson.EnclosedEsriJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

© 2024 Treasure Data, Inc. Confidential Hive Docker Support 37
• This makes it easy to quickly check • DockerHub • Quick start ◦ Launch the HiveServer2 with an embedded Metastore ◦ Launch Standalone Metastore ◦ Accessing Beeline ◦ Accessing HiveServer2 Web UI: export HIVE_VERSION=4.0.0 docker run -d -p 10000:10000 -p 10002:10002 --env SERVICE_NAME=hiveserver2 --name hive4 apache/hive:${HIVE_VERSION} docker run -d -p 9083:9083 --env SERVICE_NAME=metastore --name metastore-standalone apache/hive:${HIVE_VERSION} docker exec -it hive4 beeline -u 'jdbc:hive2://localhost:10000/' Accessed on browser at http://localhost:10002/

Apache Hive 4 on Treasure Data

Apache Hive 4 on Treasure Data

Other Decks in Programming

Featured

Transcript