Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy

Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy
Hiroki Sakamoto Senior Software Engineer - LY Corporation

Have you ever dealt with Petabyte scale of metrics? 2

Hiroki Sakamoto — Senior Software Engineer@LY Corp Observability Engineering Team
@taisho6339 @taisho6339 3

Observability is getting expensive 4 As data increases, several issues
happen • Cost • Scalability • Capacity

Our Storage 5 Several OSS Petabyte scale of Metrics Object
Storage

Background 6

7 Prometheus Metrics Agent OTel Collector User clients

8 Prometheus Metrics Agent OTel Collector User clients Ingestion API
Time-Series DB

Time-Series DB Query API Prometheus Grafana User clients

Time-Series DB Query API Prometheus Grafana User clients IMON Flash

What is Metrics? 11 Metadata Sample cpu_usage { pod="app-0", environment="prod",
node="node-x" } T: 1697530930 V: 80 T: 1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51

12 Client Query API Metadata Database Sample Database PromQL

13 Client Query API Metadata Database Sample Database PromQL Metric
IDs Retrieve target metrics IDs with given PromQL

14 Client Query API Metadata Database Sample Database PromQL Retrieve
target metrics IDs with given PromQL Retrieve samples with the IDs & time-range Metric IDs Samples

15 Client Query API Metadata Database Sample Database PromQL Metric
IDs Samples Evaluate PromQL with the samples and return results Retrieve samples with the IDs & time-range Retrieve target metrics IDs with given PromQL

16 In-Memory Layer for data within 1d Metadata Persistent Layer
for data after 1d Sample Custom-built Custom-built Elasticsearch Cassandra

Number of Metrics 1 Billion Sample Data Size with Replication
1 PB Ingested Sample Size / a day 2.7TB Ingested Samples / a day 1.8 trillion

Cassandra was the bottleneck for us • Cost ◦ Expensive
due to 1PB samples • Scalability ◦ Take 6h to scale-out only a Node ◦ Repair never completes • Capacity ◦ Not allowed to obtain more Nodes 18

New Storage is required for Sample 19

Why not use Object Storage? • Cost-effective • Storage concerns
are NOT an issue • Sufficient Capacity and Scalability • Real-world samples (Cortex, Mimir, Thanos) 20

Object Storage 21 Cassandra on k8s Maintenability Maintenability Scalability Scalability
Storage cost Storage cost Performance Performance VS

22 In-Memory Layer for data within 1d Persistent Layer 1
for data 1d ~ 2w Custom-built Cassandra Persistent Layer 2 for data 2w ~ S3-compatible Object Storage New!

23 1. Data Structure 2. Distributed Write 3. Distributed Read
How to construct DB on Object Storage

1. Data Structure 24

Requirements 25 S3-compatible Object Storage T: 1697530930 V: 80 T:
1697563930 V: 92 T: 1697566930 V: 76 T: 1697569930 V: 64 T: 1697572930 V: 51 Time Range: 2024-08-04 12:00 - 17:00 ID: 1, 9, 200, 320 Input Output

Data Sharding is important • 1B metrics ◦ Inevitable to
merge multiple samples using a rule • For concurrency ◦ Efficient write processing ◦ Efficient read processing 26

Data Sharding Strategy • 1 Bucket ◦ 1 week Time-Window
• 1 Directory ◦ 4 hours Time-Window ◦ Tenant ◦ Shard factor: Metrics ID % numShards 27

28 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data)
------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples

29 1 Week Data of Bucket shard-1_from-timestamp_to-timestamp (4h of Data)
------------------------------------------- 0x001 | ID:1 of Samples ------------------------------------------- 0x014 | ID:10 of Samples ------------------------------------------- 0x032 | ID:20 of Samples ------------------------------------------- 0x036 | ID:32 of Samples ------------------------------------------- Same shard of samples Index ------------------------------------------- ID = 1 | 0x001 ------------------------------------------- ID = 10 | 0x014 ------------------------------------------- ID = 20 | 0x032 ------------------------------------------- ID = 32 | 0x036 -------------------------------------------

2. Distributed Write 30

31 In-Memory DB Data Node 1 Data Node 150 Batch
Node 1 Batch Node 16 How to write samples to Cassandra Retrieve 4h of data

Node 1 Batch Node 16 Cassandra How to write samples to Cassandra Retrieve 4h of data Compress & Save Inserted Rows — ID=1 : compressed samples in 4h ID=2 : compressed samples in 4h ID=3 : compressed samples in 4h …

Node 1 Batch Node 16 How to write samples to Object Storage S3-Compatible Object Storage Retrieve 4h of data How?

Node 1 Batch Node 16 How to write samples to Object Storage Shard Aggregator1 Shard Aggregator32 Compress & Aggregate S3-Compatible Object Storage Retrieve 4h of data

New process - Shard Aggregator • Aggregate samples according to
the sharding strategy • Allow scale-out when increasing number of shards • Persist samples once receiving samples for resiliency (WAL) 35

Start using k8s for new services • Infrastructure abstraction •
Self-Healing • Unified Observability • Unified deployment flow 36

37 Batch Node 1 Batch Node 16 Shard Aggregator1 Shard
Aggregator32

Aggregator32 Set shard factor in gRPC Header

Aggregator32 Route to corresponding Pod using the header Set shard factor in gRPC Header

Aggregator32 LevelDB LevelDB LSM-Tree Set shard factor in gRPC Header Route to corresponding Pod using the header Persist samples in local DB

LevelDB 41 Batch Node 1 Batch Node 16 Shard Aggregator1
LevelDB Shard Aggregator32 Export aggregated samples Set shard factor in gRPC Header Route to corresponding Pod using the header

42 LSM-Tree (LevelDB, RocksDB) B+Tree (etcd.io/bbolt) Write Performance Vary in
cases Read Performance Vary in cases Choose correct Key-Value Store

Optimizations on LSM-Tree Since only read once when uploading •
Disabled compaction • Disabled page cache as possible (fadvise) 44

Optimizations on LSM-Tree Fsync once in multiple requests for better
performance 45 kernel space (page cache here) Even though a Pod is killed, Dirty page cache remains

Write Performance • With 32 Shard Aggregator Pods ◦ Take
40 mins to aggregate & write 450GB every 4 hours ◦ Consume only 3GB memory on each Pod ◦ No outage so far 46

3. Distributed Read 47

48 Query API How?

49 Query API Storage Gateway

New process - Storage Gateway • Communicate directly with Object
Storage • Return samples stored in Object Storage • Cache data ◦ Reduce RPS for Object Storage ◦ Return results faster 50

Request for Samples 51 Query API Storage Gateway

Request for Samples 52 Query API Storage Gateway Download Index
Identify byte locations in the sample file

53 Query API Storage Gateway Download samples with Byte-Range request
Return Samples Request for Samples Download Index Identify byte locations in the sample file

54 Query API Storage Gateway What about Cache?

Distributed Cache with bbolt & Envoy • etcd-io/bbolt ◦ On-disk
B+Tree Key-Value store ◦ Better read performance ◦ Page cache works well • Envoy ◦ L7 LB to route requests to fixed Pods ◦ Active health-check supported ◦ Maglev supported and optimized on even distribution 56

57 Query API Storage Gateway 1 Storage Gateway 32 Split
a query into multiple small ones by 4h of shard

58 Storage Gateway 1 Storage Gateway 32 Route a shard
request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

59 Storage Gateway 1 Storage Gateway 32 Download Index &
Samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

60 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Cache
downloaded indices & samples Route a shard request to a fixed Pod by Maglev Query API Split a query into multiple small ones by 4h of shard

61 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Query
API Return each result

62 bbolt Storage Gateway 1 bbolt Storage Gateway 32 Query
API Merge all results Return each result

63 But, still slow…

64 Download Index Decode Index Identify byte location Download Sample
Return Pinpoint the bottleneck with trace & profile Grafana Tempo Pyroscope Consume too much time from profiling and tracing

65 Download Index Decode Index Identify byte location Download Sample
Return Index is too big to decode or download Cry icons created by Vectors Market - Flaticon: https://www.flaticon.com/free-icons/cry

66 Index of Index

67 Download Index Decode Index Reduce the index size to
be dealt with Identify byte location Download Sample Return

Read Performance • With 64 Storage Gateway Pods ◦ Comparable
Performance to Cassandra ▪ 2ms at p99 for 4h data ▪ 6s ~ 9s at p99 for 1 month data ◦ Cache 1.9TB 68

Obtain Unlimited Capacity 69

70 Storage Gateway Shard Aggregator Default Storage Bring Your Own
Buckets! User A’s Storage User B’s Storage

Petabyte scale is NOT an issue anymore Thanks Everyone in
the Commnunity 71 Distributed Write Level DB Nginx Distributed Read bbolt Envoy Obervability

What can we do for the community? 72 Introduced Loki
in our org 2021 Contributed to Loki 2022 Success of this project leveraging knowledge of Loki 2023 - 2024 Contribute to Community Future Always seeking opportunities of contributions

Scaling Time-Series Data to Infinity: A Kuberne...

Scaling Time-Series Data to Infinity: A Kubernetes-Powered Solution with Envoy

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript