x100 traffic spikes Famous singer’s concert 5M social media followers 5K fans on 1 LINE OpenChat “Hot chat” 200K API requests / 1 min on 1 hot chat Send message API requests
push 3. Fetch events 1 event 5K fetch events x 5,000 Send Message Message Reaction Message Mark As Read Chat Status Update Member Status Update Note & Post Update Events on 1 chat
server Storages LINE client LINE client LINE client Publish server 5. Fetch events 1. Send message 3. Publish event 4. Server push 2. Store on storages
shard MySQL Redis OpenChat server MySQL › Response timeout on storage’s 1 shard Chat related query / 1 min Slow query / 1 min Char related request / 1 min Response timeout / 1 min +200 ~ 300% +200 ~ 300%
{ boolean consume = bucket.get(chatId).tryConsume(); if (!consume) { hotChatStorage.set(chatId, N seconds); } } Fetch events Chat A Chat B Hot chat threshold
chat A 5K members Shard 1 New event New event New event New event … Server push on hot chat A 5K members New event New event New event 1 second 1 second Shard 1 5K fetch events Throttling X%
AUTO_INCREMENT handling in InnoDB https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html Insert chat member query on MySQL “INSERT … SELECT” query is “Bulk inserts” AUTO-INC table-level lock
https://dev.mysql.com/doc/refman/5.6/en/innodb-auto-increment-handling.html MySQL QPS AUTO-INC table lock CPU usage JOINED … … … AUTO-INC table-level lock Insert chat member query on MySQL CPU usage 100%
lock Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED . . . State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates
uses table-level lock MySQL query cache table lock . . . Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2 State = JOINED State = JOINED State = JOINED Member count: 3 Member count: 4 Member count: 5 3 joins 3 updates . . .
count Chat id 100 2 Add chat member count table › Time complexity of get chat member count query: O(N) -> O(log N) Chat member Chat id 100 State = LEAVED State = JOINED State = JOINED Query cache Member count: 2
LINE client LINE client LINE client Publish server › Limit the upper bound of join spikes without dependency, delay by local cache Limit X join / Y second on single chat Join throttling (can be delayed) Local Cache
load on storages with various patterns - It’s hard to predict hot chat patterns and bottlenecks Detection & Handling - Monitor requests by API, country, app type, chat - Prepare throttling beforehand and improve bottlenecks later - Local cache and dynamic configuration is an effective way to handle hot chat Isolation - Sharding is not enough, need circuit breaker and bulkhead on each storage’s shards - Hot chat is only 0.1%. need to choose solution that fits well for 0.1% hot chat What we’ve learned
Hot Chat Detection & Handling Isolation Fetch spikes on hot chat Join spikes on hot chat Hot chat detection & throttling Improve MySQL bottlenecks & apply join throttling Circuit breaker, bulkhead Choose solution that fits well on 0.1% hot chat
chat throttling with other methods that reducing server side load Hot key problem More traffic, More features, More hot chat patterns Hot chat storage dynamic isolation