popularity and complexity mean that while many organizations eagerly adopt it, they often find themselves at a loss when issues arise. We saw an opportunity to leverage the wealth of data already existing within clusters to help users identify and resolve complex problems. Our journey wasn’t straightforward. Kubernetes is a living, breathing entity with numerous moving parts. Pods spawn and terminate like heartbeats, nodes scale up and down like breaths, and events trigger like nerve impulses. This constant flux generates an enormous amount of data, which, if properly harnessed, can provide invaluable insights. At Komodor, we’re in a unique position. Data from hundreds of Kubernetes clusters flows through our platform continuously. This led us to initiate the “Reliability Insights” project, aiming to analyze this raw data and transform it into actionable intelligence for our users. Through extensive research and experimentation, we developed methods to clean and process this data, uncovering some truly remarkable findings. We identified a dozen types of reliability-related insights, each offering a unique perspective on cluster health and performance. In this talk, we’ll share our journey of discovery. We’ll explore how we combined various data points to reveal hidden issues, discuss the challenges we faced in making sense of Kubernetes’ vast data landscape, and demonstrate how these insights can revolutionize the way teams approach cluster reliability. Join us as we delve into the art and science of extracting meaningful patterns from the seemingly chaotic world of Kubernetes, and learn how this approach can transform your ability to maintain and optimize your infrastructure.