API Managment in the AI Era

MELBOURNE EDITION

2025 SPONSORS

API MANAGEMENT IN THE AI ERA

WHO AM I? NAME WORK SOCIALS PASSIONS Nilesh Gule Avanade
@nileshgule Photography Cricket Code with Passion, Strive for Excellence

API Management - GenAI Gateway Azure-Samples/AI-Gateway: APIM

Challenges in managing GenAI APIs • Track Token usage across
multiple applications • Ensure single app doesn’t consume whole TPM quota • Secure API keys across multiple applications • Distribute load across multiple endpoints • Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance

Provisioned Throughput Units (PTU) • Allows to specify the amount
of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)

Token Metrics Emitting • Sends Token Merics usage to Applications
Insights • Provides overview of utilization of Azure OpenAI models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management

Token Rate Limiting • Manage and enforce limits per API
consumer based on the usage of API Tokens GenAI Gateway Capabilities in Azure API Management

Load Balancer and Circuit Breaker • Helps to spread load
across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management

Semantic Caching GenAI Gateway Capabilities in Azure API Management •
Optimize Token usage by leveraging semantic caching • Stores completions for prompts with similar meanings

Summary • Track Token usage across multiple applications • Emit
Token Metrics policy • Ensure single app doesn’t consume whole TPM quota • Token Limit Policy • Secure API keys across multiple applications • Subscription keys • Distribute load across multiple endpoints • Backend pool load balancing and circuit breaker

Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token
Limit Policy • LLM Token Limit Policy • Azure OpenAI Emit Token Metric Policy • LLM Emit Token Metric Policy • Houssem Dellai Youtube videos • GenAI Labs • Designing and implementing GenAI gateway solution

Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and
Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://www.youtube.com/@nilesh-gule

Source Code & slide deck Nilesh Gule fork - GenAI
Labs https://github.com/NileshGule/AI-Gateway GenAI Labs https://aka.ms/apim/genai/labs https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/

API Managment in the AI Era

API Managment in the AI Era

Nilesh Gule

More Decks by Nilesh Gule

Other Decks in Technology

Featured

Transcript

MELBOURNE EDITION

2025 SPONSORS

API MANAGEMENT IN THE AI ERA

WHO AM I? NAME WORK SOCIALS PASSIONS Nilesh Gule Avanade

API Management - GenAI Gateway Azure-Samples/AI-Gateway: APIM

Challenges in managing GenAI APIs • Track Token usage across

Provisioned Throughput Units (PTU) • Allows to specify the amount

Token Metrics Emitting • Sends Token Merics usage to Applications

Token Rate Limiting • Manage and enforce limits per API

Load Balancer and Circuit Breaker • Helps to spread load

Semantic Caching GenAI Gateway Capabilities in Azure API Management •

Summary • Track Token usage across multiple applications • Emit

Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token

Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

Source Code & slide deck Nilesh Gule fork - GenAI

Q&A