Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Native Scalability for Internal Developer...

Cloud Native Scalability for Internal Developer Platforms

This is a slide for KubeCon Japan 2025.

Discription:

Platform Engineering enables developers to focus on business value-aligned tasks by providing internal developer platforms (IDPs) that automate non-essential tasks. Kubernetes is widely used as a foundation for IDPs thanks to its scalability and flexibility.

However, Kubernetes was designed as a general workload orchestrator, not a platform component. As a result, IDP builders must integrate additional Cloud Native technologies and customizations, which can create scalability bottlenecks. At LY Corporation, his team has developed a Kubernetes-based, multi-tenant IDP running over 140K pods, and they faced such scalability challenges.

In this session, he will discuss scalability bottlenecks faced in the IDP, including observability pipelines, access control, etc. He will also explore scaling strategies for IDPs and how they address real-world scalability issues. By the end of this session, you will gain deeper insights into scalability challenges from a platform builder’s perspective.

Avatar for hhiroshell

hhiroshell

June 17, 2025
Tweet

More Decks by hhiroshell

Other Decks in Technology

Transcript

  1. 2 About Me • Working for LY Corporation - An

    internet company that offers various services, including communication, internet portals, media, and commerce …etc, primarily in Japan. • Contributing to CNCF TAG App Delivery • Author of books on Kubernetes • DIY keyboard enthusiast Hiroshi Hayakawa | @hhiroshell
  2. 3 Agenda 1. Background 2. Scalability Journey in Our Internal

    Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion
  3. 4 Agenda 1. Background 2. Scalability Journey in Our Internal

    Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion
  4. 5 Our IDP: Internal PaaS for Web Applications • With

    just a simple command, an app starts, and an endpoint is exposed • Hosts applications from various projects (multi-tenancy) $ lypctl create app hello-world –n mytenant-sandbox --image=example-registry/sample/helloworld-go:latest --port=8080 $ lypctl get app hello-world –n mytenant-sandbox NAME ENDPOINT READY REASON AGE hello-world https://hello-world.mytenant.app.dev.yahoo.co.jp True 12s $ curl https://hello-world.sandbox.app.dev.yahoo.co.jp Hello World!
  5. 6 Our IDP: Architecture Overview Deploy an application Select a

    workload cluster and run the application Cluster Scheduler Developer End User Requests are routed to correct cluster through name resolution Control Plane Cluster Workload Clusters Logical Cluster Build and manage Platform Engineer
  6. 7 The Scaling History of the IDP Development 690 Tenants

    29,000 Applications 112,000 Pods Number of Applications LA Exclusively available to nominated users 5 years GA Widely available in the organization * Including dev environments Phases Introduction Growth Maturity Migration from the former platform
  7. 8 LA Exclusively available to nominated users 5 years Migration

    from the former platform GA Widely available in the organization Phases The Scaling History of the IDP * Including dev environments Development Number of Applications Introduction Growth Maturity How can we scale Kubernetes clusters? How can we achieve operational scalability? How can we run metrics pipelines stably? How can we ensure controllers handle a huge amount of reconciliations? 690 Tenants 29,000 Applications 112,000 Pods
  8. 9 Agenda 1. Background 2. Scalability Journey in Our Internal

    Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion
  9. 10 5 years Phases The Scaling History of the IDP

    * Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users Migration from the former platform GA Widely available in the organization How can we scale Kubernetes clusters?
  10. 11 Single Huge Cluster vs. Multiple Clusters Developer Application Pod

    System Component Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod Application Pod System Component Pod Developer System Component Pod System Component Pod Application Pod Application Pod System Component Pod Application Pod Application Pod System Component Pod Application Pod Application Pod Cluster Scheduler 11
  11. 12 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster
  12. 13 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster
  13. 14 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster But we already have enough experience in managing multiple clusters.
  14. 15 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster
  15. 16 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters Single Huge Cluster We can expect predictable and safe scaling.
  16. 17 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters (Operate Kubernetes resources through the control plane cluster resources) (Additional solutions are needed for networking across workload clusters) Single Huge Cluster
  17. 18 Single Huge Cluster vs. Multiple Clusters Resource Efficiency Management

    Cost Less Extra Care for Cluster Add-ons Native Kubernetes Experience Workload Isolation Pod-to-Pod Networking Multiple Clusters (Operate Kubernetes resources through the control plane cluster resources) (Additional solutions are needed for networking across workload clusters) Single Huge Cluster We don’t have these requirements for the PaaS
  18. 19 Distribution of Resource Consumption per Application • It polarizes

    into a few massive applications and countless tiny ones. 2500 [core] / 2.5 [TB Mem] 640 [core] / 100 [GB Mem] 360 [core] / 128 [GB Mem] ... … … … … …
  19. 20 Distribution of Resource Consumption per Application • It polarizes

    into a few massive applications and countless tiny ones. 2500 [core] / 2.5 [TB Mem] 640 [core] / 100 [GB Mem] 360 [core] / 128 [GB Mem] ... … … … … …
  20. 21 Scheduling Strategies to Avoid Noisy Neighbors • Isolate Massive

    Applications into dedicated clusters Workload clusters shared with multiple tenants Workload clusters dedicated to specific tenants Cluster Scheduler
  21. 22 Pool, Silo, and Tenant Context • Pool: - Resources

    shared across multiple tenants - The tenant should be isolated at the upper layer • Silo: - Resources dedicated to a single tenant - It offers a physical boundary between other tenants, and the tenant is essentially identified with the underlying resource • Tenant Context: - Information that identifies the tenant when a workload runs or operates, represented as tokens or other elements https://www.oreilly.com/library/view/building-multi-tenant-saas/9781098140632/
  22. 23 Lessons Learned ✓ Single Huge Cluster and Multiple Clusters

    each have their pros and cons. - The decision should be based on the requirements and skill set of the platform team. - Both are well-proven approaches. ✓ The isolation strategy should align with the resource characteristics of the hosted applications. - In the case of a multiple clusters model, a mixed strategy of silo and pool can work in preventing the noisy neighbor problem.
  23. 24 5 years Phases The Scaling History of the IDP

    * Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users Migration from the former platform GA Widely available in the organization How can we achieve operational scalability?
  24. 25 Initial Onboarding Flow Authorization System Ticketing System CD Git

    { } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Register roles Platform Engineer Developer
  25. 26 Initial Onboarding Flow Authorization System Ticketing System CD Git

    { } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Platform Engineer Developer Register roles
  26. 27 Initial Onboarding Flow Authorization System Ticketing System CD Git

    { } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Register roles Platform Engineer Developer
  27. 28 Initial Onboarding Flow Authorization System Ticketing System CD Git

    { } Namespace etcd User Role Namespace Request new namespace Apply namespace Control Plane Cluster Custom script Create namespace Platform Engineer Developer Register roles
  28. 29 Initial Onboarding Flow Authorization System Ticketing System CD Git

    { } Namespace etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Custom script Create namespace Authorization Platform Engineer Developer Register roles
  29. 30 Self-Service Onboarding Experience 1/2 Authorization System Ticketing System CD

    Git { } Namespace Custom controller etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Create namespace Register roles Platform Engineer Developer Authorization
  30. 31 Self-Service Onboarding Experience 2/2 Authorization System Onboarding GUI CD

    Git { } Namespace Custom controller etcd User Role Namespace Request new namespace Apply namespace Deploy applications Control Plane Cluster Register roles Developer Create namespace Authorization
  31. 32 Lessons Learned ✓ Perceive workflows by dividing them into

    parts and gradually replace each one with automated systems. Don't replace everything at once. ✓ Implement necessary automation at the appropriate timing. ※ MVP - not “bike to car” (https://www.linkedin.com/pulse/mvp-bike-car-fred-voorhorst/)
  32. 33 5 years Phases The Scaling History of the IDP

    * Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users GA Widely available in the organization Migration from the former platform How can we run metrics pipelines stably?
  33. 34 Pipelines for Container Resource Metrics - Before • Lack

    of tenant context in metrics causes backend overflow kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node default Tenant A Tenant B Developer Developer Metrics Agent (daemonset) Metrics Agent (daemonset)
  34. 35 Pipelines for Container Resource Metrics - Before • Lack

    of tenant context in metrics causes backend overflow kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node default Tenant A Tenant B Developer Developer Metrics Agent (daemonset) Metrics Agent (daemonset) I can’t identify tenants from kubelet metrics…
  35. 36 Pipelines for Container Resource Metrics - After • Tenant

    contexts allow the backend to leverage its multi-tenant capabilities. Metrics Agent (daemonset) kubelet container container container MQ Platform Metrics Backends per Tenants kubelet container container container Kubernetes Node Kubernetes Node Developer Developer default Tenant A Tenant B Metrics Agent (daemonset) The new plugin empowered me to do that.
  36. 37 Pipelines for Application Specific Metrics - Before • The

    inability to dynamically identify tenant hinders the scaling of agents. MQ Platform Metrics Backends per Tenants Developer Developer default Tenant A Tenant B Metrics Agent (deployment) container container container Kubernetes Node container container container container container Kubernetes Node Metrics Agent (deployment)
  37. 38 Pipelines for Application Specific Metrics - Before • The

    inability to dynamically identify tenant hinders the scaling of agents. MQ Platform Metrics Backends per Tenants Developer Developer default Tenant A Tenant B Metrics Agent (deployment) container container container Kubernetes Node container container container container container Kubernetes Node Metrics Agent (deployment) Metrics in my tenant are overwhelming …
  38. 39 Pipelines for Application Specific Metrics - After • Dynamic

    tenant identification allows the agents to scale. Metrics Agent (daemonset) container container container MQ Platform Metrics Backends per Tenants Kubernetes Node Developer Developer default Tenant A Tenant B container container container container container Kubernetes Node Metrics Agent (daemonset)
  39. 40 Lessons Learned ✓ Dynamically handling tenant contexts increases scalability.

    - Allows for leveraging multi-tenant capabilities on integrated components. - Enables seamless scaling of system components. ✓ You should consider how to dynamically identify and handle tenant contexts with the components you are using.
  40. 41 5 years Phases The Scaling History of the IDP

    * Including dev environments Development Number of Applications Introduction Growth Maturity 690 Tenants 29,000 Applications 112,000 Pods LA Exclusively available to nominated users GA Widely available in the organization Migration from the former platform How can we ensure controllers handle a huge amount of reconciliations?
  41. 42 Custom controller Kubernetes Node Scalability of Custom Controllers •

    To avoid reconciliation conflicts, it needs to run as a single instance. Custom controller etcd Reconcile a massive amount of resources *) or run in an active-standby configuration 42
  42. 43 Kubernetes Node Scalability of Custom Controllers • Scale up

    the single controller. Custom controller etcd Reconcile a massive amount of resources ✓ Deploy on a dedicated node ✓ Allocate amount of CPU and Memory ✓ Tune the controller param 43
  43. 44 Lessons Learned? ✓ Scaling out custom controllers is challenging.

    ✓ Only an addiction-like method is available, really?
  44. 45 Agenda 1. Background 2. Scalability Journey in Our Internal

    Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion
  45. 46 Beyond the journey’s end… Deploy an application Select a

    workload cluster and run the application Cluster Scheduler Developer End User Requests are routed to correct cluster through name resolution Control Plane Cluster Workload Clusters Logical Cluster Build and manage Platform Engineer Is this scalable enough?
  46. 47 3 Scalability Bottlenecks in the Control Plane Cluster •

    We are now planning to solve each problem! Custom controller Kubernetes Node Custom controller etcd Reconcile a massive amount of resources Data volume limit of etcd Memory consumption of kube-apiserver in list requests Custom controllers cannot scale out 47
  47. 48 Streaming List Responses • Reduces kube-apiserver’s memory consumption significantly

    for list requests by returning responses as a stream - Introduced as a default-on beta in Kubernetes v1.33 etcd Client List request 48 *) https://kubernetes.io/blog/2025/05/09/kubernetes-v1-33-streaming-list-responses/
  48. 49 Distributed Resources with the Aggregation Layer • Distribute the

    resources placed in the control plane cluster for developers' reference across the workload clusters • Make them accessible through the aggregation layer etcd etcd etcd etcd Control Plane Cluster Workload Clusters Developer
  49. 50 Scale Out Custom Controllers with Sharding • Allows controllers

    to scale out by distributing and coordinating the resources to be reconciled among multiple controllers. • Utilize an OSS project called Kubernetes Controller Sharding. Custom controller etcd Reconcile assigned resources Custom controller Custom controller 50 *) https://github.com/timebertt/kubernetes-controller-sharding
  50. 51 Agenda 1. Background 2. Scalability Journey in Our Internal

    Developer Platform 3. Beyond the Journey’s End… (More and More Scalability!) 4. Conclusion
  51. 52 Lessons & Key Takeaways • Sorry! I've introduced too

    many individual takeaways. Please check each "Lessons Learned" page. - P23, P32, P40, P44 • “The journey is the reward. Not the destination.”
  52. 53 Cloud Native Platform Engineering • Cloud Native technologies serve

    as crucial building blocks in creating IDPs - A variety of middleware and a robust OSS ecosystem centered around Kubernetes * CNCF graduated projects