and durability tied to node and storage • Application downtime if node or storage fails • Data loss if disk fails • Application placement not flexible • Need more resource planning and reservations Specialized use cases only • Distributed filesystems and datastores • Fault tolerant to node failures by maintaining data replicas • Ex: Cassandra, MongoDB, GlusterFS, Ceph, etc. • Caching • Persistence for avoiding cold restarts Use Cases
constraints • Manually schedule individual pods, or build custom operator/scheduler Hostpath volumes have many downsides • Unmanaged volume lifecycle • Possible path collisions from multiple pods • Too many privileges: can specify any path in the system • Hostpath disabled by administrators • Spec not portable across clusters/environments • Build custom volume manager to allocate and reserve disks Current Challenges
workloads • Need to build custom infrastructure to manage disks and schedule application • Many users choose not to run stateful workloads in Kubernetes • Cannot use Kubernetes to manage all workloads Reduced application developer velocity • More time building supporting infrastructure • Less time building application Cannot leverage Kubernetes features • Horizontal scaling • Advanced scheduling • Portability between clusters, environments, and storage providers Impact to Users
Enables portability across clusters, environments and storage providers • Consistent K8s volume management • Managed lifecycle • Managed access isolation between pods • Reduces user privileges Make scheduler aware of local disk placement constraints • Lowers barrier to entry: reduces need for custom operators and schedulers • Consistent application management across environments and storage providers • Integration with Kubernetes features • StatefulSets, horizontal scaling, pod affinity and anti-affinity, node taints and tolerations, etc. Project Goals
• Detailed design: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/local- storage-pv.md Volume plugin: “local” volume • Can only be used as a Persistent Volume • Volume must already be formatted with a filesystem • Volume can be a mount point or shared directory 1.7 and 1.8 Alpha Recap
• Like a node selector • Scheduler filters nodes by evaluating the PV’s node affinity against the node’s labels • Local PVs are reduced to a single node • Assumes PV is already bound External static provisioner for local volumes • NOT a dynamic provisioner • Manages the local disk lifecycle for pre-created volumes • Runs as a DaemonSet on every node • Discovers local volumes mounted under configurable directories • Automatically create, cleanup, and destroy local Persistent Volumes 1.7 and 1.8 Alpha Recap
• Directory-based and mount-based local volumes • Back-to-back pod mounts, simultaneous pod mounts • Local volume lifecycle with static provisioner • Negative cases with invalid path, invalid nodes 1.7 and 1.8 Alpha Recap
• Doesn’t consider pod scheduling requirements (ie, CPU, pod affinity, anti-affinity, taints, etc) • Cannot specify multiple local volumes in a single pod spec • Problem with zonal storage too Volume binding evaluates one Persistent Volume Claim at a time • Cannot specify multiple local volumes in a single pod spec • Problem with zonal storage too External provisioner cannot correctly detect volume capacity for new volumes mounted after provisioner has started • Needs mount propagation Alpha Limitations
https://github.com/kubernetes/community/pull/1140 • Plugin to implement Block interface • Static provisioner to discover block devices and create block PVs • Static provisioner to cleanup block devices with configurable cleanup methods Local provisioner security hardening (@verult) • Design: https://github.com/kubernetes/community/pull/1105 • Current design gives each provisioner pod PV create/delete permissions for all PVs, including those on other nodes • New design splits local provisioner into single master with PV permissions and DaemonSet workers that only report and operate on its local volumes 1.9 Planned Features
• For all PVs, not just local volumes • PVC binding is delayed until a pod is scheduled • Decision of which PV to bind to is moved to the default scheduler • Scheduler initiates binding (PV prebind) and provisioning • Binding transaction and PV lifecycle management remains in the PV controller • Impact to custom schedulers, controllers, operators, etc. that use PVCs • Deprecation needs to be announced well ahead of time • Users can pull new binding library into their implementations 1.9 Planned Features
• Design issue: scheduler can be bypassed • Possible solutions need prototyping to determine viability and scope • A few ugly workarounds possible • Need to work with scheduler team to come up with a long term solution 1.9 Planned Features
the correct node/zone where Pod wants to be scheduled • Local volume dynamic provisioning (LVM?) PV taints and tolerations • Give up (and lose!) my volume if it’s inaccessible or unhealthy • Local volume health monitoring Inline PV • PV with the lifetime of a Pod • Not enough root disk capacity for EmptyDir • Use dedicated local disk for IOPS isolation Future