using the distribution? 6 • Unable to upgrade Hadoop, Hive and Tez individually • Our own patches and bug fixes ◦ Since various customer queries are running, corner case bugs and more are happening ▪ We will create a patch to fix this. ◦ Difficult to source control the above ◦ Patches must be applied on every upgrade unless reflected upstream
upgrade 8 • We have something like a Hive plugin and we need to update it. ◦ We update the API to the latest. ◦ We do not use HDFS to store data, we have our own data store called Plazma. ▪ We need to update the metastore API when it is updated.
upgrade 9 • Upgrade of Hivemall ◦ The Hivemall podling retired on 2022-09-01 ◦ The Hivemall supports up to Hive 2 ▪ We are using Hivemall's UDF in CDP etc ▪ So we need to upgrade Hivemall. ◦ Currently, we use our own Hivemall for Hive 4
10 • system test ◦ general purpose test • elephant-testing ◦ Automatically test queries that have had issues in the past(about 100 tests) • hive query simulator ◦ Simulate and test previously executed queries
20 • Hive Metastore ◦ API optimization (performance) ◦ Dynamic leader election ◦ External data sources support ◦ HMS support for Thrift over HTTP ◦ JWT authentication for Thrift over HTTP ◦ HMS metadata summary ◦ Use Zookeeper for service discovery
21 • HiveServer2 ◦ Support SAML 2.0/JWT authentication mode ◦ Support both Kerberos and LDAP auth methods in parallel ◦ Graceful shutdown ◦ Easy access to the operation log through web UI
22 • Hive Replication ◦ Optimised Bootstrap Solution ◦ Support for Snapshot Based Replication for External Table ◦ Better Replication Tracking Metrics ◦ Support for Checkpointing during Replication
23 • Security ◦ Authorizations in alter table/view, UDFs, and Views created from Apache Spark ◦ Authorizations on tables created based on storage handlers ◦ Critical CVE fixes of transitive dependencies
24 • Compiler ◦ Materialized view support for Iceberg tables ◦ Improvements to refresh materialized views ◦ Date/Timestamp fixes and improvements ◦ Anti join support ◦ Split update support ◦ Branch pruning ◦ Column histogram statistics support ◦ Calcite upgrade to 1.25 ◦ HPL/SQL improvements ◦ Scheduled query support
25 • Miscl ◦ Support for ESRI GeopSpatial UDF's ◦ Added support for Apache Ozone ◦ Support Hadoop-3.3.6 ◦ Supports Tez 0.10.3 ◦ Works with Aarch64 (ARM)
Integration ◦ Many operations are now available in Iceberg(e.g. DML, Vectorized and more) • Hive Docker Support • Compiler Improvements ◦ Anti-join support ◦ Scheduled query support ◦ Date/Timestamp fixes and improvements ▪ hive.datetime.formatter(default: DATETIME) • DATETIME: java.time.format.DateTimeFormatter • SIMPLE: java.text.SimpleDateFormat ◦ improved CBO rules leading to better query plans
deserialize ◦ Returns plain text string of given message which was compressed in compressionFormat and base64 encoded. Currently, Supports only 'gzip' for Gzip compressed and base 64 encoded strings
ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Add ESRI-based geospatial data and operate it with Hive ◦ NOTE: ESRI data is required for use ◦ Related blog ▪ The blog above has an example to get the number of earthquakes in California ▪ Sample data: • earthquakes.csv • california-counties.json
ESRI Geospatial UDFs(ST_xxxx UDFs) ◦ Some things to keep in note ▪ california-counties.json uses its original format ▪ We need to specify original InputFormat • org.apache.hadoop.hive.ql.io.esriJson.EnclosedE sriJsonInputFormat • This data also includes binary
• e.g. CREATE TABLE counties ( Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary) ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.udf.esri.serde.EsriJsonSerDe’ STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.esriJson.EnclosedEsriJsonInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';