Native Computing Foundation • Docker 1.12+ • by Docker Inc. • Compose + Swarm is kind of legacy, so they will not be included in this talk • Mesos • by Apache Software Foundation • only with Marathon, DC/OS is not included (the scope of later is larger)
and conventions • like a “Spring Framework” in container eco-system • Design • master • api-server, scheduler, controller-manager • node • kubelet, kube-proxy • independent binaries • Pros: modular, transparent, manageable • Cons: a little bit complex to setup (1.4 is much better now) • network & volume plugins • driven by control loops
powered by swarmkit • SwarmtKit Design • build-in data store • manager • several components build into one binary • control loop driven • worker • use pull model to connect with manager WARNING: SwarmKit is currently a primitive project, expect change of this part
• Create object in raft based memory store • github.com/coreos/etcd/raft for consensus • github.com/hashicorp/go-memdb for in-memory object storage • state, cluster, node, service, task, network … $ docker service create API Store SwarmKit Manager
Task: “start a container” etc • Reconcile loop for Service objects • Control Theory again Orchestrator API Store Orchestrator Service (replica=2) Task Task check if replica=2 or not SwarmKit Manager
allocate volumes in the future) • VIP and ports for Service • IP for all endpoints (veth pairs) in the network the task is attached to Orchestrator Dispatcher Scheduler API Store Allocator SwarmKit Manager Network Create
search in heap to find the best node which meets the constraints && has lightest workloads • ReadyFilter, ResourceFilter, ConstraintFilter Orchestrator Dispatcher API Store Scheduler Allocator SwarmKit Manager
to run big data job • core idea: fine-grained resource sharing • Mesos Design • Master + Slave + Zookeeper • two level scheduling • scheduler + executor = framework • need to use frameworks like Marathon for orchestration and management • containerizer • multiple container runtime & image support (>=1.0)
Mesos master Mesos slave MPI executor Mesos slave MPI executor task task Resource offer Pick framework to offer resources to *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
Mesos master Mesos slave MPI executor Mesos slave MPI executor task task Pick framework to offer resources to Resource offer Resource offer = list of (node, availableResources) E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
Mesos master Mesos slave MPI executor Hadoop executor Mesos slave MPI executor task task Pick framework to offer resources to task Framework-specific scheduling Resource offer Launches and isolates executors *Animate: Operating Systems and Systems Programming Lecture 24 Anthony D. Joseph https://cs162.eecs.berkeley.edu/
loops driven (but in single binary) two level scheduling Coordination etcd build-in raft Zookeeper Container Runtime multiple single, but has potential for more OCI runtimes multiple Container Image Docker Image, ACI, more in future Docker Image Docker Image, ACI, more in future Docker Daemon no need need no need
to understand & debug fewer round trips hard to do backup/restore, migration, monitoring/audit easy to do performance tuning lack of mgmt API like:etcd admin guide
to do next through out the automated workflow” • workload management • secret management • configuration management • scale and autoscaling • stateful workload • … and more
—replicas=5 … • $ docker service scale SERVICE=REPLICAS • $ docker service update [OPTIONS] SERVICE • rolling update • 30+ update options are supported • —container-label-add value • —container-label-rm value • --env-add value • --env-rm value • —image string • …
stored in etcd • consumed by ENV or volume • Docker SwarmKit • under discussion: https://github.com/docker/swarmkit/issues/1329 • Mesos + Marathon • only in DC/OS • stored in ZooKeeper, exposed as ENV in Marathon • Another similar feature is Configuration Management
Metrics: • user defined endpoint, e.g. http://localhost:9100/metrics • share same metric data structure with CNCF projects like Prometheus • Docker SwarmKit • not yet: https://github.com/docker/swarmkit/issues/486#issuecomment-219133613 • Mesos + Marathon • a stand-by `marathon-autoscale.py` • autoscales application based on the utilization metrics from Mesos
Balance • Docker SwarmKit • Load Balancer • ipvs NAT mode • External Access • Routing Mesh • Name Service • embedded DNS server • for service and task Container 2 Container 1 ipvs Gossip to update the iptables & ipvs rules port mapping iptables iptables outside traffic (when service created with -p) internal traffic ipvs • Two kinds of sandboxes • ingress: on every worker • container: on workers where task lives • Two networks are needed • ingress overlay • user-defined overlay DNS: svc->vip ingress sandbox
no need LB iptables random mode ipvs NAT mode HAproxy External Access nodeIP:port, Ingress, external IP/LB Routing Mesh (ingress overlay) same as expose HAproxy to public Update watch etcd gossip marathon_lb.py & template
but why? • Multi-Scheduler • pod1: scheduler1, pod2 : scheduler2 • QoS tiers • anyone remember the core idea of Borg? • Guaranteed (requests == limit) • Burstable (requests < limit) • Best-Effort (no request & limit) • More Borg features are on the way • equivalence class, pod level resource boundary … Burstable Pod
in plan) • Multi-Scheduler • Mesos is designed to run multiple frameworks (schedulers) • Strategy • Two level scheduling (the killing weapon of Mesos) • Twitter scale … • fine-grained resource sharing (like Borg) • QoS tiers • of course • And much more • task eviction, data locality, max-min fairness, priority, offer reject, Delay Scheduling • and Big Data of course
Right Way” • $ hyper run mysql • $ hyper run --link mysql wordpress • $ hyper fip attach 22.33.44.55 wordpress • But Hyper.sh is powered by Kubernetes • and also maintain Kubernetes features
is what’s backing Hyper.sh: • HyperContainer runtime • Multi-tenant network based on Neutron • Custom Cinder plugin with Ceph backend • Custom HAproxy based Service • Kubernetes is truly extensible and configurable
individual developer/org, trying to find something that is friendly and just works • I use Docker SwarmKit • I have a “Twitter scale” cluster to manage or I am a Big Data user • I need Mesos • But if what I need is a infrastructure layer to build my systems on top of it in right way • Kubernetes is the choice