Building a Unified API Gateway for Secure and Scalable Cross-Cloud AI Service

Building a Uni fi ed API Gateway for Secure and
Scalable Cross-Cloud AI Service Bo-Yi Wu @ Mediatek https://blog.wu-boy.com/ 2025/07/03 12:00 - 12:30

Outline • Project Overview • Core Feature

Project Overview Why we build the Generative AI Gateway Service

"[VSF"1*.BOBHFNFOU 1PQVMBS--."*(BUFXBZ $MPVE fl BSF"*(BUFXBZ

IUUQTCJUMZMJUFMMNQSPYZ

*OJUJBM1SPKFDU *O 4JOHMF3FRVJSFNFOU#VJMEBO"1*(BUFXBZJOUPFOBCMF DPODVSSFOU VTFSTUPBDDFTT"[VSF0QFO"*TFSWJDFTXJUIPVUIJUUJOHSBUFMJNJUT

("**OGSBTUSVDUVSF0WFSWJFX

Core Feature (LLM Gateway)

Core Feature • API Gateway & Routing Management • Authentication
& Authorization • Quota & Billing Management • Multi AI/ML Service Integration • Monitoring & Logging

API Gateway & Routing Management

// Azure OpenAI Routing r.Any("/openai/*path", azureOpenAIHandler) // AWS Bedrock Routing
r.Any("/bedrock/*path", bedrockHandler) // Google Gemini Routing r.Any("/gemini/*path", geminiHandler) // Azure Cognitive Routing r.Any("/cognitive/*path", cognitiveHandler) // CSES LLM Routing r.Any("/cses/*path", csesHandler)

Weight-based Distribution Mechanism

type endpointconfig struct { weight int // Weight value url
string labels []string // ... other configurations } type dispatcher struct { online []*endpoint // Online endpoint list fallback *endpoint // Fallback endpoint weight int // Total weight } &OEQPJOU$PO fi HVSBUJPO %JTQBUDIFS$PO fi HVSBUJPO

func (d *dispatcher) pick(req *http.Request) *endpoint { // 1. Check
for specific destination if dst := req.Header.Get("X-DST"); dst != "" { for _, e := range d.online { if strings.Contains(e.config.url, dst) { return e } } } // 2. Weighted random selection c := 0 w := rand.Intn(d.weight) // Generate random number for _, e := range d.online { // 3. Special handling for PTU mode if e.config.ptu { if len(ptuVIPs) == 0 || ptuVIPs[req.Header.Get("X-User-Id")] { return e } if e.isOnline() { return e } } // 4. Cumulative weight comparison if c += e.config.weight; w < c { if !e.isOnline() { continue // Skip offline endpoints } return e } } return d.fallback // Return fallback endpoint } 8FJHIU%JTUSJCVUJPO"MHPSJUIN

Health Check Mechanism

type endpoint struct { success uint64 // Successful request count
total uint64 // Total request count disable int32 // Disable flag (atomic operation) } func (e *endpoint) isOnline() bool { return atomic.LoadInt32(&e.disable) == 0 } func (e *endpoint) offline() bool { return atomic.CompareAndSwapInt32(&e.disable, 0, 1) } &OEQPJOU4UBUF.BOBHFNFOU )FBMUI$IFDL.FDIBOJTN

func (e *endpoint) recordStatus(resp *http.Response, ttfb time.Duration, logger *slog.Logger) {
// 1. Determine if request was successful if e.config.judgeFn == nil || (resp != nil && e.config.judgeFn(resp, ttfb)) { atomic.AddUint64(&e.success, 1) if atomic.AddUint64(&e.total, 1) == 61 { e.reset() // Reset counters } } else { atomic.AddUint64(&e.total, 1) } // 2. Calculate success rates for all endpoints endpoints := e.othersFn() rates := make([]float64, 0, len(endpoints)) for _, oe := range endpoints { ptotal := atomic.LoadUint64(&oe.total) psuccess := atomic.LoadUint64(&oe.success) rates = append(rates, float64(psuccess)/float64(ptotal)) } // 3. Statistical analysis - calculate threshold mean := float64(success) / float64(total) stdv := math.Sqrt(stdv / float64(len(rates))) threshold := mean - 1.9*stdv // 1.9 standard deviations if threshold > 0.9 { threshold = 0.9 } } *OUFMMJHFOU)FBMUI"TTFTTNFOU

// Automatically take endpoint offline when success rate falls below
threshold if (!e.config.ptu && rates[j] < threshold) && e.config.testReqFn != nil && e.offline() { logger.Info("endpoint_off", "req", e.config.labels) // Background automatic recovery check go func(threshold float64, testReqFn func() *http.Request) { for i := 0; ; i++ { // Execute test request res := httptest.NewRecorder() e.proxy.ServeHTTP(res, testReqFn()) // Check if success rate has recovered rate := float64(psuccess) / float64(ptotal) if rate > threshold { e.online() // Bring back online logger.Info("endpoint_on", "req", e.config.labels) break } time.Sleep(1 * time.Second) } }(threshold, e.config.testReqFn) } "VUPNBUJD'BJMPWFS

8FJHIU$PO fi HVSBUJPO +VEHF'VODUJPO$PO fi HVSBUJPO cfg.judgeFn = func(resp *http.Response,
ttfb time.Duration) bool { return resp.StatusCode != 429 && // Not rate limited resp.StatusCode < 500 && // Not server error ttfb < deadline // TTFB within deadline } "gpt-4o": NewDispatcher(tr, logger, true, spendFn, []endpointconfig{ aoai(600000, "openaijapaneast", "gpt-4o"), // High weight aoai(60000, "openaieastus2", "gpt-4o"), // Standard weight aoai(60000, "aideopenaiwestus", "gpt-4o"), // Standard weight }),

Features Summary 1. Weighted Random Distribution: Probabilistic distribution based on
endpoint weights 2. Real-time Health Monitoring: Intelligent assessment based on success rates and response times 3. Statistical Threshold: Dynamic health threshold calculation using standard deviation 4. Automatic Recovery: Background continuous testing of of fl ine endpoints with automatic recovery 5. Special Routing Support: Support for PTU mode and speci fi c destination routing 6. Graceful Degradation: Fallback endpoint used when all endpoints are unavailable

Authentication & Authorization

1SPKFDU0XOFS 1SPKFDU.BOBHFS ("*4'(BUFXBZ 6TFS --..PEFM
/FX1SPKFDU'MPXGPS4FSWJDF

'MPXGPS1FSTPOBM"DDFTT5PLFO

Permission Control • Verify JWT Token • Check Model Permission
• Check User Tokens • Cross Check Multiple Domain (Secure Region)

Quota & Billing Management

Core Interface Design • Quota Consumption: Spend • Alert Mechanism:
Alert • Rate Limiting: Limit • Quota Con fi guration: SetCap / GetCap • Rate Limit Con fi guration: SetLimit / GetLimit

type QuotaClient interface { Spend(ctx context.Context, user string, amount int)
(remain int, err error) Alert(ctx context.Context, user string) (cap, amount, alert int, err error) Limit(ctx context.Context, user string) (ok bool, req, cap, gap int64, err error) GetLimit(ctx context.Context, user string) (result []int, err error) SetLimit(ctx context.Context, user string, req, gap int64) (err error) SetCap(ctx context.Context, user string, amount int, ts int64) (err error) GetCap(ctx context.Context, user string) (usage, cap int64, err error) } 2VPUB$MJFOU*OUFSGBDF {user}.cap - Quota limit {user}.YYYYMMDD - Daily usage {user}.alert.YYYYMMDD - Daily alert count {user}.req - Rate limit request count {user}.gap - Rate limit time window {user}.set - Rate limit request records 3FEJT,FZ%FTJHO4USBUFHZ

func (c *RedisQuotaClient) Spend(ctx context.Context, user string, amount int) (remain
int, err error) { if user == "" { return 1, nil // Return remaining 1 for empty username } // Redis Key design k1 := fmt.Sprintf("{%s}.cap", user) // Quota limit key k2 := fmt.Sprintf("{%s}.%s", user, time.Now().Format("20060102")) // Daily usage key ret := [2]int64{0, 0} // Execute atomic operations using Redis transaction resps, err := c.client.DoMulti( ctx, c.client.B().Multi().Build(), // Begin transaction c.client.B().Incrby().Key(k1).Increment(0).Build(), // Read quota limit c.client.B().Incrby().Key(k2).Increment(int64(amount)).Build(), // Increment usage c.client.B().Expire().Key(k2).Seconds(86400*7).Build(), // Set 7-day expiry c.client.B().Exec().Build(), // Execute transaction )[4].ToArray() // Get EXEC result // Parse results for i, resp := range resps[:len(ret)] { if ret[i], err = resp.ToInt64(); err != nil { return 0, err } } // Default quota logic if ret[0] == 0 && c.capping { ret[0] = DEFAULT_CAPPING // Use default quota limit } // Calculate remaining quota if ret[0] > 0 { return int(ret[0] - ret[1]), nil // Quota limit - Used amount } return 1, nil // Always return 1 for unlimited quota } 2VPUB$POTVNQUJPO.FDIBOJTN

Design Points (Spend) • Atomicity: Uses Redis MULTI/EXEC to ensure
atomic operations • Key Naming: {user}.cap and {user}.YYYYMMDD ensure same user's keys are in the same slot in Redis Cluster • Expiration Strategy: Daily usage keys expire after 7 days to prevent unlimited accumulation • Default Quota: Uses default value when quota is 0 and capping is enabled

func (c *RedisQuotaClient) Alert(ctx context.Context, user string) (cap, amount, alert
int, err error) { date := time.Now().Format("20060102") k1 := fmt.Sprintf("{%s}.cap", user) // Quota limit k2 := fmt.Sprintf("{%s}.%s", user, date) // Daily usage k3 := fmt.Sprintf("{%s}.alert.%s", user, date) // Daily alert count ret := [3]int64{0, 0, 0} resps, err := c.client.DoMulti( ctx, c.client.B().Multi().Build(), c.client.B().Incrby().Key(k1).Increment(0).Build(), // Read quota c.client.B().Incrby().Key(k2).Increment(0).Build(), // Read usage c.client.B().Incrby().Key(k3).Increment(1).Build(), // Increment alert count c.client.B().Expire().Key(k3).Seconds(86400/2).Build(), // 12-hour expiry c.client.B().Exec().Build(), )[5].ToArray() // Parse and return results for i, resp := range resps[:len(ret)] { if ret[i], err = resp.ToInt64(); err != nil { return 0, 0, 0, err } } if ret[0] == 0 && c.capping { ret[0] = DEFAULT_CAPPING } cap, amount, alert = int(ret[0]), int(ret[1]), int(ret[2]) return } "MFSU.FDIBOJTN

Alert Mechanism Features • Frequency Control: Alert counter expires in
12 hours to avoid excessive alerts • Triple Information: Returns quota limit, usage amount, and alert count

var RateScript = rueidis.NewLuaScript(` local capgap = redis.call('MGET',KEYS[1],KEYS[2]) local cnt,now,cap,gap
= 0,tonumber(ARGV[1]),tonumber(capgap[1]),tonumber(capgap[2]) if cap == nil or gap == nil then return {true,0,0,0} -- No limit set, allow through end -- Remove expired request records redis.call('ZREMRANGEBYSCORE',KEYS[3],0,now-gap*1000) -- Count requests in current time window cnt = redis.call('ZCARD',KEYS[3]) if cnt < cap then -- Under limit, record new request redis.call('ZADD',KEYS[3],now,now) redis.call('EXPIRE',KEYS[3],gap) return {true,cnt+1,cap,gap} end -- Over limit return {false,cnt,cap,gap} `) -VB4DSJQU*NQMFNFOUJOH 4MJEJOH8JOEPX"MHPSJUIN ,FZ3FRVFTU-JNJU$PVOU ,FZ5JNFXJOEPX ,FZ3FRVFTU3FDPSE4FU

Sliding Window Features • Precise Control: Uses Sorted Set to
record exact timestamp of each request • Auto Cleanup: ZREMRANGEBYSCORE automatically removes expired requests • Flexible Window: gap parameter controls time window size •

Quota Management Methods

func (c *RedisQuotaClient) SetCap(ctx context.Context, user string, amount int, ts
int64) (err error) { key := fmt.Sprintf("{%s}.cap", user) for _, resp := range c.client.DoMulti( ctx, c.client.B().Set().Key(key).Value(strconv.Itoa(amount)).Build(), // Set quota c.client.B().Expireat().Key(key).Timestamp(int64(ts)).Build(), // Set expiry time ) { if err := resp.Error(); err != nil { return err } } return nil } 4FU2VPUB-JNJU

func (c *RedisQuotaClient) GetCap(ctx context.Context, user string) (usage, cap int64,
err error) { k1 := fmt.Sprintf("{%s}.cap", user) k2 := fmt.Sprintf("{%s}.%s", user, time.Now().Format("20060102")) ret := [2]int64{0, 0} resps, err := c.client.DoMulti( ctx, c.client.B().Multi().Build(), c.client.B().Incrby().Key(k1).Increment(0).Build(), // Read quota limit c.client.B().Incrby().Key(k2).Increment(0).Build(), // Read daily usage c.client.B().Expire().Key(k2).Seconds(86400*7).Build(), c.client.B().Exec().Build(), )[4].ToArray() // Process results... return ret[1], ret[0], nil // Return (usage, quota limit) } (FU2VPUB4UBUVT

Key Design Principles (Performance Optimization) • Batch Operations: Uses DoMulti
to reduce network round trips • Atomicity: All critical operations execute within Redis transactions • Expiration Strategy: Automatically cleans up expired data to prevent memory leaks •

Key Design Principles (Fault Tolerance Design) • Default Values: Uses
environment variable defaults when con fi guration doesn't exist • Empty User Handling: Returns safe default values for empty usernames • Error Handling: Complete error handling for every Redis operation

Multi AI/ML Service Integration

Service Integration • Azure OpenAI Service (Multi-Region) • AWS Bedrock
(Claude Series) Integration • Google Gemini Integration • Azure Cognitive Services Integration • Mediatek Internal LLM Services

return spendFn{ Pre: func(user, domain string, req *http.Request, body string)
(estimate int, remain int, err error) { // Token calculation tokens := len(encoder.Encode(body, nil, nil)) // Special token calculation for models like GPT-4 Vision if strings.Contains(model, "gpt-4o") || strings.Contains(model, "vision") { var sb streambody if err := json.Unmarshal([]byte(body), &sb); err == nil { tokens = sb.CountToken(encoder, tokens) } } estimate = tokens * promptrate remain, err = qc.Spend(context.Background(), user, estimate) return }, Post: func(user string, resp *http.Response, body string, preestimate int) (estimate int) { // Calculate actual cost based on response if resp.Header.Get("content-type") == "text/event-stream" { estimate = strings.Count(body, "\ndata: {") * completionrate } else { var resp AoaiResp if err := json.Unmarshal([]byte(body), &resp); err == nil { estimate = completionrate*resp.Usage.CompletionTokens + promptrate*resp.Usage.PromptTokens - preestimate } } return }, } $PTU$BMDVMBUJPOBOE2VPUB.BOBHFNFOU

Monitoring & Logging

Metrics • ttfbHistogram: Measures Time-To-First-Byte (TTFB) for API responses •
latencyHistogram: Tracks complete request duration • successRateGauge: Monitors success rate of API calls • endpointStatusGauge: Tracks endpoint availability (1 = online, 0 = of fl ine) •

Structured Log Format • Structured JSON logging with key-value pairs
• Includes HTTP headers, status codes, and performance metrics • Tracks API versions and destinations • Records estimated costs and priority levels

Thanks

Building a Unified API Gateway for Secure and S...

Building a Unified API Gateway for Secure and Scalable Cross-Cloud AI Service

More Decks by Bo-Yi Wu

Other Decks in Technology

Featured

Transcript