quickly, it makes it difficult to decide what to use. Along with that, guidance and documentation is hard to find. Development Applications often require multiple cutting-edge products and frameworks which requires specialized expertise and new tools to stitch these components together. Context Large Language Model doesn't know about your data Evaluation It is hard to figure out which model to use and how to optimize for their use case. Operationalization Concerns around privacy, security, and grounding. Developers lack the experience and tools to evaluate, improve and validate the solutions for their Proof of Concepts, and to scale and operate in production. What slows down GenAI adoption?
Target audiences Assets to share Metrics/evaluations ML models ML Engineers Data Scientists ML Engineers App developers Model, data, environments, features LLM, agents, plugins, prompts, chains, APIs Accuracy Quality: accuracy, similarity Harm: bias, toxicity Correct: groundness Cost: token per request Latency: response time, RPS Build from scratch Pre-built, fine-tuned served as API (MaaS)
a scalable container for making predictions. Additionaly enable Blue/Green deployment with traffic routing control so that A/B testing can be done for the LLM flow. Prompt Engineering Prompt engineering or tuning with instructions describing the tasks that will be performed by the LLM model along with several measures for securities. CI CE and CD Continious Integration, Continious Evaluation and Continous Deployment of the LLM flows to maintain code quality with engineering best practices, comparing LLM performance and promotion to the higher environments. Foundational LLM Selection of the right Foundation Models such as Azure OpenAI models, Llama2, Falcon or any models from HuggingFace. If necessary, a fine- tuned model. Data & Services Enrich LLM models with domain scpecifc grounding data (RAG pattern) or enable in-context learning with use case speicifc examples. Monitor Monitoring performance metrics for the LLM flow, detecting data drifts and communicating the model's performance to stakeholders. Online Evaluation LLM online evaluations are very criticle to understand the performance, potentials risks, etc, where the LLM answer will be evaluated by one or more evaluation mechanism. Experiment & Evaluate Execute the flow (prompt + additional data or services) end-to-end with s ample input data. Evaluate the responses from LLM for large datasets against ground truth (if any) or if answer is relevant as per the context. LLM LifeCycle
complex process. Customers want: • Private data access and controls • Prompt engineering • CI/CD • Iterative experimentation • Versioning and reproducibility • Deployment and optimization • Safe and Responsible AI Design and development Develop flow based on prompt to extend the capability Debug, run, and evaluate flow with small data Modify flow (prompts and tools etc.) No If satisfied Yes Evaluation and refinement No Evaluate flow against large dataset with different metrics (quality, relevance, safety, etc.) If satisfied Yes Optimization and production Optimize flow Deploy and monitor flow Get end user feedback
various language models, APIs, and data sources to ground LLMs on your data. • One platform to design, construct, tune, evaluate, test, and deploy LLM workflows • Evaluate the quality of workflows with rich set of pre-built metrics and safety system. • Easy prompt tuning, comparison prompt variants, and version- controlling. Streamline prompt engineering projects
any framework such as LangChain or Semantic Kernel to build initial flows • Add your own reusable tools • Manage your flows as files on disk • Track run history
integration with pre-built LLMs like Azure OpenAI Service • Built-in safety system with Azure AI Content Safety • Effectively manage credentials or secrets for APIs • Create your own connections in Python tools
data • Use pre-built evaluation flows • Compare multiple variants or runs to pick best flow • Ensure accuracy by scaling the size of data in evaluation • Build your own custom evaluation flows Tune Variant 0 Tune Variant 1 Tune Variant 2 Flow variants Evaluatio n Bulk Test
align with information from the input source. • Relevance: evaluates the extent to which the model's generated responses are pertinent and directly related to the given questions. • Coherence: evaluates how well the language model can produce output flows smoothly, reads naturally, and resembles human-like language. • Fluency: evaluates the language proficiency of a generative AI's predicted answer. It assesses how well the generated text adheres to grammatical rules, syntactic structures, and appropriate usage of vocabulary, resulting in linguistically correct and natural-sounding responses. • Similarity: evaluates the similarity between a ground truth sentence (or document) and the prediction sentence generated by an AI model.