multiple applications • Ensure single app doesn’t consume whole TPM quota • Secure API keys across multiple applications • Distribute load across multiple endpoints • Ensure committed capacity in PTUs is exhausted before falling back to PAYG instance
of throughput required in a model deployment. • Granted to subscription as quota • Quota is specific to region and defines the maximum number of PTUs that can be assigned to deployments in the subscription and region • PTU provides • Predictable performance • Allocated processing capacity • Cost savings Understanding costs associated with provisioned throughput units (PTU)
Insights • Provides overview of utilization of Azure OpenAI models across multiple applications or API consumers GenAI Gateway Capabilities in Azure API Management
across multiple Azure OpenAI endpoints • Round-robin, weighted or priority based load distribution strategy GenAI Gateway Capabilities in Azure API Management