A specter is haunting generative AI: the specter of evaluation. In a rush of excitement at the new capabilities provided by open and proprietary foundation models, it seems everyone from homebrew hackers to engineering teams at NASDAQ companies has shipped products and features based on those capabilities. But how do we know whether those products and features are good? And how do we know whether our changes make them better? I will share some case studies and experiences on just how hard this problem is – from the engineering, product, and business perspectives – and a bit about what is to be done.