Cloud Native Scalability for Internal Developer Platforms

Hiroshi Hayakawa, LY Corporation Cloud Native Scalability for Internal Developer
Platforms

2 About Me • Working for LY Corporation - An
internet company that offers various services, including communication, internet portals, media, and commerce …etc, primarily in Japan. • Contributing to CNCF platform engineering community group • Author of books on Kubernetes • DIY keyboard enthusiast Hiroshi Hayakawa | @hhiroshell

3 Agenda 1. Background 2. Scalability Journey in Our Internal
Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion

5 Our IDP: Internal PaaS for Web Applications • With
just a simple command, an app starts, and an endpoint is exposed • Hosts applications from various projects (multi-tenancy) $ lypctl create app hello-world –n mytenant-sandbox --image=example-registry/sample/helloworld-go:latest --port=8080 $ lypctl get app hello-world –n mytenant-sandbox NAME ENDPOINT READY REASON AGE hello-world https://hello-world.mytenant.app.dev.yahoo.co.jp True 12s $ curl https://hello-world.sandbox.app.dev.yahoo.co.jp Hello World!

6 Our IDP: Architecture Overview Deploy an application Select a
workload cluster and run the application Cluster Scheduler Developer End User Requests are routed to correct cluster through name resolution Control Plane Cluster Workload Clusters Logical Cluster Build and manage Platform Engineer

7 The Scaling History of the IDP Development 690 Tenants
29,000 Applications 112,000 Pods Number of Applications LA Exclusively available to nominated users 5+ years GA Widely available in the organization * Including dev environments Phases Introduction Growth Maturity Migration from the former platform

8 LA Exclusively available to nominated users 5+ years Migration
from the former platform GA Widely available in the organization Phases The Scaling History of the IDP * Including dev environments Development Number of Applications Introduction Growth Maturity How can we scale Kubernetes clusters? How can we achieve operational scalability? How can we run metrics pipelines stably? How can we ensure controllers handle a huge amount of reconciliations? 690 Tenants 29,000 Applications 112,000 Pods

10 5+ years Phases The Scaling History of the IDP
* Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users Migration from the former platform GA Widely available in the organization How can we scale Kubernetes clusters?

11 Single Huge Cluster vs. Multiple Clusters Developer Application Pod
System Component Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod System Component Pod Developer System Component Pod System Component Pod Application Pod Application Pod System Component Pod Application Pod Application Pod System Component Pod Application Pod Application Pod Cluster Scheduler 11

12 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management
Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster

Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster But we already have enough experience in managing multiple clusters.

Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster We can expect predictable and safe scaling.

Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters (Operate Kubernetes resources through the control plane cluster resources) (Additional solutions are needed for networking across workload clusters) Single Huge Cluster

Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters (Operate Kubernetes resources through the control plane cluster resources) (Additional solutions are needed for networking across workload clusters) Single Huge Cluster We don’t have these requirements for the PaaS

19 Distribution of Resource Consumption per Application • It polarizes
into a few massive applications and countless tiny ones. 2500 [core] / 2.5 [TB Mem] 640 [core] / 100 [GB Mem] 360 [core] / 128 [GB Mem] ... … … … … …

20 Distribution of Resource Consumption per Application • It polarizes
into a few massive applications and countless tiny ones. 2500 [core] / 2.5 [TB Mem] 640 [core] / 100 [GB Mem] 360 [core] / 128 [GB Mem] ... … … … … …

21 Scheduling Strategies to Avoid Noisy Neighbors • Isolate Massive
Applications into dedicated clusters Workload clusters shared with multiple tenants Workload clusters dedicated to specific tenants Cluster Scheduler

22 Pool, Silo, and Tenant Context • Pool: - Resources
shared across multiple tenants - The tenant should be isolated at the upper layer • Silo: - Resources dedicated to a single tenant - It offers a physical boundary between other tenants, and the tenant is essentially identified with the underlying resource • Tenant Context: - Information that identifies the tenant when a workload runs or operates, represented as tokens or other elements https://www.oreilly.com/library/view/building-multi-tenant-saas/9781098140632/

23 Lessons Learned ✓ Single Huge Cluster and Multiple Clusters
each have their pros and cons. - The decision should be based on the requirements and skill set of the platform team. - Both are well-proven approaches. ✓ The isolation strategy should align with the resource characteristics of the hosted applications. - In the case of a multiple clusters model, a mixed strategy of silo and pool can work in preventing the noisy neighbor problem.

24 5+ years Phases The Scaling History of the IDP
* Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users Migration from the former platform GA Widely available in the organization How can we achieve operational scalability?

25 Initial Onboarding Flow Authorization System Ticketing System CD Git
{ } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Register roles Platform Engineer Developer

{ } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Platform Engineer Developer Register roles

{ } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Register roles Platform Engineer Developer

{ } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Platform Engineer Developer Register roles

{ } Namespace etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Custom script Create namespace Authorization Platform Engineer Developer Register roles

30 Self-Service Onboarding Experience 1/2 Authorization System Ticketing System CD
Git { } Namespace Custom controller etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Create namespace Register roles Platform Engineer Developer Authorization

31 Self-Service Onboarding Experience 2/2 Authorization System Onboarding GUI CD
Git { } Namespace Custom controller etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Register roles Developer Create namespace Authorization

32 Lessons Learned ✓ Perceive workflows by dividing them into
parts and gradually replace each one with automated systems. Don't replace everything at once. ✓ Implement necessary automation at the appropriate timing. ※ MVP - not “bike to car” (https://www.linkedin.com/pulse/mvp-bike-car-fred-voorhorst/)

33 5 years+ Phases The Scaling History of the IDP
* Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users GA Widely available in the organization Migration from the former platform How can we run metrics pipelines stably?

34 Pipelines for Container Resource Metrics - Before • Lack
of tenant context in metrics causes backend overflow kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node default Tenant A Tenant B Developer Developer Metrics Agent (daemonset) Metrics Agent (daemonset)

35 Pipelines for Container Resource Metrics - Before • Lack
of tenant context in metrics causes backend overflow kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node default Tenant A Tenant B Developer Developer Metrics Agent (daemonset) Metrics Agent (daemonset) I can’t identify tenants from kubelet metrics…

36 Pipelines for Container Resource Metrics - After • Tenant
contexts allow the backend to leverage its multi-tenant capabilities. Metrics Agent (daemonset) kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node Developer Developer default Tenant A Tenant B Metrics Agent (daemonset) The new plugin empowered me to do that.

37 Pipelines for Application Specific Metrics - Before • The
inability to dynamically identify tenant hinders the scaling of agents. MQ Platform Metrics Backends per Tenants Developer Developer default Tenant A Tenant B Metrics Agent (deployment) container container container Kubernetes Node container container container container container Kubernetes Node Metrics Agent (deployment)

38 Pipelines for Application Specific Metrics - Before • The
inability to dynamically identify tenant hinders the scaling of agents. MQ Platform Metrics Backends per Tenants Developer Developer default Tenant A Tenant B Metrics Agent (deployment) container container container Kubernetes Node container container container container container Kubernetes Node Metrics Agent (deployment) Metrics in my tenant are overwhelming …

39 Pipelines for Application Specific Metrics - After • Dynamic
tenant identification allows the agents to scale. Metrics Agent (daemonset) container container container MQ Platform Metrics Backends per Tenants Kubernetes Node Developer Developer default Tenant A Tenant B container container container container container Kubernetes Node Metrics Agent (daemonset)

40 Lessons Learned ✓ Dynamically handling tenant contexts increases scalability.
- Allows for leveraging multi-tenant capabilities on integrated components. - Enables seamless scaling of system components. ✓ You should consider how to dynamically identify and handle tenant contexts with the components you are using.

41 5 years+ Phases The Scaling History of the IDP
* Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users GA Widely available in the organization Migration from the former platform How can we ensure controllers handle a huge amount of reconciliations?

42 Custom controller Kubernetes Node Scalability of Custom Controllers •
To avoid reconciliation conflicts, it needs to run as a single instance. Custom controller etcd Reconcile a massive amount of resources *) or run in an active-standby configuration 42

43 Kubernetes Node Scalability of Custom Controllers • Scale up
the single controller. Custom controller etcd Reconcile a massive amount of resources ✓ Deploy on a dedicated node ✓ Allocate amount of CPU and Memory ✓ Tune the controller param MaxConcurrentReconciles 43

44 Lessons Learned? ✓ Scaling out custom controllers is challenging.
✓ Only an addiction-like method is available, really?

46 Beyond the journey’s end… Deploy an application Select a
workload cluster and run the application Cluster Scheduler Developer End User Requests are routed to correct cluster through name resolution Control Plane Cluster Workload Clusters Logical Cluster Build and manage Platform Engineer Is this scalable enough?

47 3 Scalability Bottlenecks in the Control Plane Cluster •
We are now planning to solve each problem! Custom controller Kubernetes Node Custom controller etcd Reconcile a massive amount of resources Data volume limit of etcd Memory consumption of kube-apiserver in list requests Custom controllers cannot scale out 47

48 Streaming List Responses • Reduces kube-apiserver’s memory consumption significantly
for list requests by returning responses as a stream - Introduced as a default-on beta in Kubernetes v1.33 etcd Client List request 48 *) https://kubernetes.io/blog/2025/05/09/kubernetes-v1-33-streaming-list-responses/

49 Distributed Resources with the Aggregation Layer • Distribute the
resources placed in the control plane cluster for developers' reference across the workload clusters • Make them accessible through the aggregation layer etcd etcd etcd etcd Control Plane Cluster Workload Clusters Developer

50 Scale Out Custom Controllers with Sharding • Allows controllers
to scale out by distributing and coordinating the resources to be reconciled among multiple controllers. • Utilize an OSS project called Kubernetes Controller Sharding. Custom controller etcd Reconcile assigned resources Custom controller Custom controller 50 *) https://github.com/timebertt/kubernetes-controller-sharding

52 Lessons & Key Takeaways • Sorry! I've introduced too
many individual takeaways. Please check each "Lessons Learned" page. - P23, P32, P40, P44 • “The journey is the reward. Not the destination.”

53 Cloud Native Platform Engineering • Cloud Native technologies serve
as crucial building blocks in creating IDPs - A variety of middleware and a robust OSS ecosystem centered around Kubernetes * CNCF graduated projects

54 Any Questions?

Cloud Native Scalability for Internal Developer...

Cloud Native Scalability for Internal Developer Platforms

More Decks by hhiroshell

Other Decks in Technology

Featured

Transcript