Apache Pinot is a real-time distributed OLAP datastore that powers a variety of analytics use cases, which usually require executing high-throughput queries with low latency. To ensure data completeness, result correctness, and system performance, Pinot needs to execute background operational tasks – e.g. data compaction, GDPR data purging and reindexing after schema evolution etc. However, these operations can be computationally intensive and can easily impact query performance if executed on the same component as query execution.
Pinot leverages Minion, an Pinot native component built upon Apache Helix’s task framework, to execute those computationally intensive operational tasks, thus offloading workloads from the query execution component and avoiding sacrificing the query performance. The Minion component is designed to be easily extensible and pluggable – in addition to addressing the above issue, Minion is also used to build common data ingestion and backfilling pipelines, saving operators time from building customized and ad-hoc ones.
In this talk, we will deep dive into the Minion component and demonstrate how we leverage it in some typical operations tasks. We will also discuss the challenges faced while operating Minion at scale and how we greatly reduced the operational overheads by improving observability and introducing auto-scaling mechanisms.
To summarize, on one hand, Minion takes most of the operational burden in Pinot, helping real-time analytics run smoothly; on the other hand, Minion gives operators flexibility to perform complex operations that were hard (or even impossible) to perform, providing more delightful analytics product experiences.