Upgrade to Pro — share decks privately, control downloads, hide ads and more …

API Managment in the AI Ear

API Managment in the AI Ear

Slide desc related to presentation on API management in the AI era for Global Azure Bootcamp Melbourne 2025 edition

Avatar for Nilesh Gule

Nilesh Gule

May 09, 2025
Tweet

More Decks by Nilesh Gule

Other Decks in Technology

Transcript

  1. WHO AM I? NAME WORK SOCIALS PASSIONS Nilesh Gule Avanade

    @nileshgule Photography Cricket Code with Passion, Strive for Excellence
  2. Challenges in managing GenAI APIs • Track Token usage across

    multiple applications • Ensure single app doesn’t consume whole TPM quota • Secure API keys across multiple applications • Distribute load across multiple endpoints • Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance
  3. Provisioned Throughput Units (PTU) • Allows to specify the amount

    of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)
  4. Token Metrics Emitting • Sends Token Merics usage to Applications

    Insights • Provides overview of utilization of Azure OpenAI models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management
  5. Token Rate Limiting • Manage and enforce limits per API

    consumer based on the usage of API Tokens GenAI Gateway Capabilities in Azure API Management
  6. Load Balancer and Circuit Breaker • Helps to spread load

    across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management
  7. Semantic Caching GenAI Gateway Capabilities in Azure API Management •

    Optimize Token usage by leveraging semantic caching • Stores completions for prompts with similar meanings
  8. Summary • Track Token usage across multiple applications • Emit

    Token Metrics policy • Ensure single app doesn’t consume whole TPM quota • Token Limit Policy • Secure API keys across multiple applications • Subscription keys • Distribute load across multiple endpoints • Backend pool load balancing and circuit breaker
  9. Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token

    Limit Policy • LLM Token Limit Policy • Azure OpenAI Emit Token Metric Policy • LLM Emit Token Metric Policy • Houssem Dellai Youtube videos • GenAI Labs • Designing and implementing GenAI gateway solution
  10. Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

    Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://www.youtube.com/@nilesh-gule
  11. Source Code & slide deck Nilesh Gule fork - GenAI

    Labs https://github.com/NileshGule/AI-Gateway GenAI Labs https://aka.ms/apim/genai/labs https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/
  12. Q&A