computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” 1 1NIST Definition of Cloud Computing 3 / 46
word scheduling is believed to be originated from a latin word schedula around 14th Century, which then meant papyrus strip, slip of paper with writing on it. In 15th century, it started to be used as mean timetable and from there was adopted to mean scheduler that we currently use in computer science. Scheduling in computing, is the process of deciding how to allocate resources to a set processes. 2 2Source : Wikipedia 4 / 46
is at the heart of the modern computers. Can not afford ineffective resource management at cloud-scale. New challenges/opportunities due to Virtualization Consumption patterns New workloads Scheduling, it turns out, comes down to deciding how to spend money.3 3Towards a cloud computing research agenda. K. Birman et al. SIGACT’09 5 / 46
notation, scheduling can be expressed as Map < VM, PM >= f (Set < VM >, Set < PM >, context) context can be Performance Model Heterogeneity of Resources Network Information 6 / 46
come up with function f ? That, Saves energy in data center while, maintaing SLAs Improves network scalability and performance Saves battery of mobile devices Saves cost in multi-cloud environment 8 / 46
come up with function f ? That, Saves energy in data center while, maintaing SLAs Improves network scalability and performance Saves battery of mobile devices Saves cost in multi-cloud environment 9 / 46
effectively saving the power consumed by the data center. Consolidate virtual machines effectively based on the resource usage. Maximize utilization of physical machines and put them to standby mode migrating VMs on to other physical machines. 13 / 46
Ecpu, Emem, Edisk, Ebw > where Ex = x used by VM max x capacity of PM (1) Based on multiple resources viz. CPU, memory, disk and network as a single measure, U given as, U = α × Ecpu + β × Emem + γ × Edisk + δ × Ebw where, α, β, γ, δ ∈ [0, 1] And, α + β + γ + δ = 1 14 / 46
similarity Method 1 - Based on dissimilarity (lower the better) between RV of the incoming VM and RVPM. similarity = RVvm(PM) · RVPM RVvm(PM) RVPM Method 2 - Based on similarity (higher the better) between RV of the incoming VM and PMfree. similarity = RVvm(PM) · PMfree RVvm(PM) PMfree 15 / 46
for all VM ∈ VMs to be allocated do for all PM ∈ Running PMs do similarityPM = calculateSimilarity(RVvm(PM), RVPM) add similarityPM to queue end for sort queue ascending/descending using similarityPM for all similarityPM in queue do targetPM = PM corresponding to similarityPM if U after allocation on target PM < (Uup − buffer) then allocate(VM, target PM) return SUCCESS end if end for return FAILURE end for 16 / 46
2: if U > Uup then 3: VM = VM with max U on that PM 4: Allocation Algorithm(VM) 5: end if 6: if Allocation Algorithm fails to allocate VM then 7: target PM = add a standby machine to running machine 8: allocate(VM, target PM) 9: end if 17 / 46
EC2, TCP/UDP throughput experienced by applications can fluctuate rapidly between 1 Gb/s and zero. Abnormally large packet delay variations among Amazon EC2 instances. 4 4 G. Wang et al. The impact of virtualization on network performance of amazon ec2 data center. (INFOCOM’2010) 21 / 46
to millions of requests Network traffic at higher layers pose signifiant challenge for data center network scaling New applications in data center are pushing need for traffic localization in data center network 22 / 46
VMs based on their traffic exchange patterns How to place? -placement algorithm to place VMs to localize internal datacenter traffic and improve application performance 24 / 46
group of VMs that has large communication cost (cij ) over time period T. cij = AccessRateij × Delayij AccessRateij is rate of data exchange between VMi and VMj and Delayij is the communication delay between them. 25 / 46
0 c12 · · · c1n c21 0 · · · c2n . . . . . . . . . cn1 cn2 · · · 0 cij is maintained over time period T in moving window fashion and mean is taken as the value. for each row Ai ∈ AccessMatrix do if maxElement(Ai ) > (1 + opt threshold) ∗ avg comm cost then form a new VMCluster from non-zero elements of Ai end if end for 26 / 46
to migrate? VMtoMigrate = arg max VMi |VMCluster| j=1 cij Where can we migrate? CandidateSeti (VMClusterj ) = {c | where c and VMClusterj have a common ancestor at level i} − CandidateSeti+1(VMClusterj ) 32 / 46
to migrate? VMtoMigrate = arg max VMi |VMCluster| j=1 cij Where can we migrate? CandidateSeti (VMClusterj ) = {c | where c and VMClusterj have a common ancestor at level i} − CandidateSeti+1(VMClusterj ) Will the the effort be worth? PerfGain = |VMCluster| j=1 cij − cij cij 32 / 46
do Select VMtoMigrate for i from leaf to root do Form CandidateSeti (VMClusterj − VMtoMigrate) for PM ∈ candidateSeti do if UtilAfterMigration(PM,VMtoMigrate) <overload threshold AND PerfGain(PM,VMtoMigrate) > significance threshold then migrate VM to PM continue to next VMCluster end if end for end for end for 34 / 46
to traditional placement approaches like Vespa [1] and previous network-aware algorithm like Piao’s approach [2]. Extended NetworkCloudSim [3] to support SDN. Floodlight The server properties are assumed to be HP ProLiant ML110 G5 (1 x [Xeon 3075 2660 MHz, 2 cores]), 4GB) connected through 1G using HP ProCurve switches. Traces from three real world data centers, two from universities (uni1, uni2) and one from private data center (prv1). 35 / 46
world data centers, two from universities (uni1, uni2) and one from private data center (prv1). Property Uni1 Uni2 Prv1 Number of Short non-I/O-intensive jobs 513 3637 3152 Number of Short I/O-intensive jobs 223 1834 1798 Number of Medium non-I/O-intensive jobs 135 628 173 Number of Medium I/O-intensive jobs 186 864 231 Number of Long non-I/O-intensive jobs 112 319 59 Number of Long I/O-intensive jobs 160 418 358 Number of Servers 500 1093 1088 Number of Devices 22 36 96 Over Subscription 2:1 47:1 8:3 36 / 46
localization) helps in Network scaling. VM Scheduler should be aware of migrations. Think rationally while scheduling, you may not want all the migrations. 41 / 46
Scheduling of Virtual Machines in Cloud Data Centers. Dharmesh Kakadia, Radheyshyam Nanduri and Vasudeva Varma. Unpublished manuscript. 2. MECCA: Mobile, Efficient Cloud Computing Workload Adoption Framework using Scheduler Customization and Workload Migration Decisions. Dharmesh Kakadia, Prasad Saripalli and Vasudeva Varma. In MobileCloud ’13. 3. Energy Efficient Data Center Networks - A SDN based approach Dharmesh Kakadia and Vasudeva Varma. In I-CARE’12. 4. Optimizing Partition Placement in Virtualized Environments. Dharmesh Kakadia and Nandish Kopri. Patent P13710918. 5. Network-aware Virtual Machine Consolidation for Large Data Centers. Dharmesh Kakadia, Nandish Kopri and Vasudeva Varma. In NDM collocated with SC’13. 6. MultiStack. http://MultiStack.org 46 / 46
observation over a period of time, to avoid unstable behavior. Predict utilization on destination machine, to avoid SLA violation and unstable behavior. Use Buffers - to help guard against wrong decisions. Percentage (not absolute) utilization means algorithms work unchanged for heterogeneous data centers. Pick least recently used machine while scale up - all machines used uniformly - avoids hotspot. Difference between Uup and Udown should be sufficiently large to avoid jitter effect. 2 / 39
Scale-up Threshold, Uup [0.25, 1.0] Scale-down Threshold, Udown [0.0 to 0.4] buffer [0.05 to 0.5] Similarity Threshold [0, 1] Similarity Method Method 1 or 2 Number of physical machines 100 Specifications of physical machines Heterogeneous Time period for which resource usage of VM is logged for exact RVvm calculation, ∆ 5 minutes 3 / 39
should not be too high or too low (optimal around 0.70-0.80) high Uup means a lot more SLA violation If Uup is low, Scale-up algorithm will run more than necessary machines. 4 / 39
using SDN counters 2: for each Switch s in S such that Utilization(s) ¡ threshold θ over time t do 3: if canMigrate(s, S-s)) then 4: pFlows = prioritizeFlows(s) 5: incrementalMigration (pFlows) 6: Poweroff (s) 7: end if 8: end for 9 / 39
of Mobile apps will use cloud back-end services. 5 cloud-enabled Apps Dropbox, Evernote, Instagram, ... Siri, Google Voice, ... Kindle, ... Traditional Apps GIMP Firefox Games 5http://www.gartner.com/newsroom/id/2463615 13 / 39
powerful, but rich applications are more and more hungry for resources. Cloud has infinite resources. Cloud is programmable. Always ON. Only a handful apps are leveraging cloud. 14 / 39
cloud-aware, but can be migrated. Can we create a Mobile cloud framework that leverage cloud resources, Without making app cloud-aware Without annoying user Adaptive Personalized Works autopilot mode 15 / 39
ci ) mi mi : cost of running the application on mobile device (0 – 1) ci : cost of running the application on cloud device (0 – 1) Performance Gain, Gain = (wi × fi ) wi wi : weight of i the feature gain, normalized to unity 23 / 39
loss function learned in an online setting Used vowpal wabbit 6 : fast online learning toolkit Features : High level features App features Network features Other Apps Device static features Cloud provider features 6 https://github.com/JohnLangford/vowpal_wabbit/ 24 / 39
features that are concerned to user. Includes battery status, date and time, user location (moving/stable), etc. Application features : capture application usage habits including frequency of usage of the application, stretch of usage, use of local and remote data, etc. Network Status : network condition between cloud and mobile device. Includes bandwidth, latency and stability. Resource usage by other applications running on device : combined vector of all individual applications. 25 / 39
hardware and software configuration of the device. cpu frequency cpu power steps operating frequency, etc. Cloud Configuration: This captures characteristics of the cloud provider. monetary cost provider performance statistics 26 / 39
mobile device Linux traffic control utility (tc) is used to simulate various network condition Used OpenStack as IaaS cloud provider Property Value Cloud Operating System Ubuntu 12.04(kernel 3.2) Cloud VM configuration 4 GB, 2.66GHz Device Operating System Android 4.2 Device Configuration 1GB, 1.5 GHz 27 / 39
(and only superficial) inter-operability. Each cloud is very different (Architecture/SLA/Abstraction/API/...). Likely to stay like this, due to conflict of interests. Can lead to lock-in, Data-loss, Cost increase. Many new applications have bursty nature. 33 / 39
Think as OS for Multiple Clouds. To identify problems and evaluate solutions to multicloud platform. More challenging than data center scheduling. Big data as the first use case. 34 / 39
run across multiple cloud providers Priority based Job scheduling for minimizing cost and completion time Performance optimization with storage integration Client Tools More frameworks (Spark, Hive, Pig, Oozie, Drill, MLlib,..) Other Schedulers (Autoscaling, Spot-instances, Job profile based) 39 / 39