API Management in the AI Era

Nilesh Gule @nileshgule API Management in AI Era

$whoami { “name” : “Nilesh Gule”, “role” : “Senior Cloud
Solutions Architect at Avanade” “website” : “https://www.HandsOnArchitect.com", “github” : “https://GitHub.com/NileshGule" “twitter” : “@nileshgule”, “linkedin” : “https://www.linkedin.com/in/nileshgule”, “YouTube” : “https://www.YouTube.com/@nilesh-gule” “likes” : “Technical Evangelism, Cricket”, }

Mistral Prompt and Response Tokens

Deepseek - Prompt and Response Tokens

API Management - GenAI Gateway Azure-Samples/AI-Gateway: APIM

Challenges in using GenAI APIs • Track Token usage across
multiple applications • Ensure single app doesn’t consume whole TPM quota • Distribute load across multiple endpoints • Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance

Monitor utilization of models • Sends Token Merics usage to
Applications Insights • Provides overview of utilization of models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management

Token Metrics Emitting – Token Metric Policy GenAI Gateway Capabilities
in Azure API Management

Token Metrics Emitting GenAI Gateway Capabilities in Azure API Management

Enforce limits per consumer • Manage and enforce limits per
API consumer based on the usage of API Tokens GenAI Gateway Capabilities in Azure API Management

Token Rate Limiting – Token Limit Policy GenAI Gateway Capabilities
in Azure API Management

Token Rate Limiting GenAI Gateway Capabilities in Azure API Management

Provisioned Throughput Units (PTU) • Allows to specify the amount
of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)

HA - Load Balanced Pool and Circuit Breaker • Helps
to spread load across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management

Load Balanced Pool and Circuit Breaker GenAI Gateway Capabilities in
Azure API Management

AI Gateway capabilities of Azure API Management AI Gateway Security
& safety • Keyless managed identities • AI Apps & Agents Authorizations -New • Content Safety -GA • Credential Manager Resiliency • Weight load balancing • Priority routing to provisioned capacity models • Backend pools with circuit breaker • Session aware load balancing -GA Scalability • Token rate limits and token quotas • Semantic Caching -GA • Model load balancing • Multi-regional deployments Traffic mediation & control • Azure AI Foundry & Azure OpenAI • OpenAI compatible models -GA • Responses API -GA • WebSocket’s for Realtime APIs • MCP server pass-trough - Soon • Expose APIs as built-in MCP server - Preview Developer velocity • Wizard policy configuration experience • Self-service with the Developer Portal • API Center Copilot Studio connector - Preview • Policy Toolkit Observability • Token counting per consumer • Prompts and completions logging -GA • Built-in reporting dashboard -GA Governance • Policy engine with custom expressions • API Center MCP server registry - Preview • Federated API Management GA GA GA Soon GA GA GA GA Preview Preview GA Preview New

Summary • Track Token usage across multiple applications • Emit
Token Metrics policy • Ensure single app doesn’t consume whole TPM quota • Token Limit Policy • Distribute load across multiple endpoints • Backend pool load balancing and circuit breaker

Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token
Limit Policy • LLM Token Limit Policy • Azure OpenAI Emit Token Metric Policy • LLM Emit Token Metric Policy • Houssem Dellai Youtube videos • GenAI Labs • Designing and implementing GenAI gateway solution

Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and
Strive for Excellence” nileshgule @nileshgule Nilesh Gule NileshGule www.handsonarchitect.com https://www.youtube.com/@nilesh-gule

Source Code & slide deck Nilesh Gule fork - GenAI
Labs https://github.com/NileshGule/AI-Gateway GenAI Labs https://aka.ms/apim/genai/labs https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/

API Management in the AI Era

API Management in the AI Era

Nilesh Gule

More Decks by Nilesh Gule

Other Decks in Technology

Featured

Transcript

Nilesh Gule @nileshgule API Management in AI Era

$whoami { “name” : “Nilesh Gule”, “role” : “Senior Cloud

Mistral Prompt and Response Tokens

Deepseek - Prompt and Response Tokens

API Management - GenAI Gateway Azure-Samples/AI-Gateway: APIM

Challenges in using GenAI APIs • Track Token usage across

Monitor utilization of models • Sends Token Merics usage to

Token Metrics Emitting – Token Metric Policy GenAI Gateway Capabilities

Token Metrics Emitting GenAI Gateway Capabilities in Azure API Management

Enforce limits per consumer • Manage and enforce limits per

Token Rate Limiting – Token Limit Policy GenAI Gateway Capabilities

Token Rate Limiting GenAI Gateway Capabilities in Azure API Management

Provisioned Throughput Units (PTU) • Allows to specify the amount

HA - Load Balanced Pool and Circuit Breaker • Helps

Load Balanced Pool and Circuit Breaker GenAI Gateway Capabilities in

Load Balanced Pool and Circuit Breaker GenAI Gateway Capabilities in

AI Gateway capabilities of Azure API Management AI Gateway Security

Summary • Track Token usage across multiple applications • Emit

Resources • Azure OpenAI Gateway topologies • Azure OpenAI Token

Nilesh Gule ARCHITECT | MICROSOFT MVP “Code with Passion and

Source Code & slide deck Nilesh Gule fork - GenAI

Q&A