Você está na página 1de 2

BATS: Budget-Constrained Autoscaling for Cloud

Performance Optimization

A. Hasan Mahmud Yuxiong He Shaolei Ren


Florida International University Microsoft Research Florida International University

1. INTRODUCTION violate budget constraint, and vice versa). Recent efforts


In recent years, cloud computing has experienced an ex- on autoscaling for optimizing the performance under long-
ponential growth, providing a wide range of services such as term budget constraints have primarily focused on evenly
scientific computing and web hosting. Cloud service users distributing budgets across time or predicting the long-term
desire two major benefits: good performance and low opera- future workloads, neither of which applies to highly-dynamic
tional costs. Autoscaling is a key property of the cloud sys- delay-sensitive workloads in practice [4, 5].
tems that aims to achieve this goal. Using autoscaling, users In this paper, in view of the practical difficulty in accu-
can dynamically request more resources (e.g., virtual ma- rately predicting long-term future workloads, we develop an
chines, or VMs) and pay more when their demand increases, online autoscaling system, called BATS (Budget-constrained
and vice versa. In this paper, we propose to optimize cloud AuToScaling), that dynamically scales VM instances to op-
application performance by leveraging autoscaling while sat- timize the delay performance while satisfying user budget
isfying a long-term budget constraint (e.g., monthly or yearly constraint in the long run. The core of BATS is an online
budget). Such budget constraints are commonly applied to autoscaling algorithm, which only requires the past and in-
businesses, universities and governments, which typically al- stantaneous workload information to make effective scaling
locate annual IT operational budgets at the beginning of decisions. The key idea of our algorithm is to keep track
each fiscal year [6]. We focus on delay-sensitive cloud appli- of the budget deficit online and incorporate it into the on-
cations (e.g., web services) where application performance line autoscaling decision: if the actual VM expenses exceed
is measured by the delay of responses. While optimizing de- the expected cost thus far, BATS will request fewer and/or
lay performance under a budget constraint is important for smaller VM instances subject to the minimum delay per-
cloud users, it is practically challenging. Requesting more formance requirement, such that the budget deficit can be
VMs at the current time will reduce the available budget decreased. Thus, the budget deficit tracked online serves as
for future uses and hence, optimally scaling VM acquisi- a feedback mechanism to guide BATS towards satisfying the
tions requires the complete offline information (e.g., future long-term budget constraint.
workload demand) over the entire budgeting period, which The contribution of our work is that we provide a full-
is very difficult, if not impossible, to obtain in advance, es- fledged autoscaling solution to optimize delay performance
pecially considering highly dynamic workloads and possible while meeting users’ long-term budget constraints that widely
traffic spikes due to breaking events. exist in practice. In particular, we offer a provably-efficient
The default autoscaling mechanisms offered by major cloud autoscaling algorithm, a user-friendly automated system im-
service providers, such as Amazon EC2 and Windows Azure, plemented on Windows Azure, and comprehensive perfor-
typically scale VM instances based on resource utilization mance evaluation showing advantages of the proposed so-
indicators such as CPU and memory usage: e.g., adding lution compared with state of the art using both system
a new VM instance or switching to a bigger VM instance implementation and simulation.
that has more virtual CPU cores when the current CPU
utilization exceeds a certain user-specified threshold [1, 3].
However, when considering the budget constraint, it is dif-
2. SYSTEM IMPLEMENTATION
ficult to decide the optimal resource usage threshold a pri- We describe the software architecture of BATS autoscaler.
ori, because the threshold value depends on the user bud- It has three main modules: (1)Monitor : gathering differ-
get and the workloads during the entire budgeting period ent performance metrics or resource utilization indicators of
(e.g., setting a low threshold increases performance but may the hosted cloud application, such as number of web server
connections, (2) Scheduler : deciding the optimal number
of VMs for the cloud application based on BATS algorithm
Permission to make digital or hard copies of part or all of this work for personal or
with a goal of minimizing delay while meeting the budget
classroom use is granted without fee provided that copies are not made or distributed constraint, and (3) Scaler : executing the scheduler deci-
for profit or commercial advantage, and that copies bear this notice and the full ci- sion and submits the scaling request to Windows Azure.
tation on the first page. Copyrights for third-party components of this work must be The Scheduler has two submodules that uses the output
honored. For all other uses, contact the owner/author(s). Copyright is held by the from the Monitor : a Predictor that predicts the workload
author/owner(s).
arrival using an auto-regressive model and a Performance
SIGMETRICS’14, June 16–20, 2014, Austin, Texas, USA.
ACM 978-1-4503-2789-3/14/06.
Watcher that detects the current workload status. Perfor-
http://dx.doi.org/10.1145/2591971.2592019. mance Watcher ensures that autoscaler meets the delay per-

563
formance goal in case of workload prediction error or sudden 1.5
workload spike. While we use Windows Azure as an exam- BATS 10 BATS

Average Delay (sec)


EqualSC EqualSC
ple, the modular design of our autoscaler is also applicable

Total Cost ($)


OptOffline 8 OptOffline
for other cloud platforms such as Amazone EC2 by modify- 1 ReactSC 6 ReactSC
ing the Scaler and Monitor modules.
4
2
3. EXPERIMENTAL RESULT 0.5

First, we discuss our experimental setup. We consider a 0


10 20 30 40 10 20 30 40
time slotted model and generate workload arrivals based on Time Slot Time Slot
the MSR (Microsoft Research) workload trace to drive our (a) Average delay. (b) Total Cost.
experiment. To limit lengthy experiment, the budgeting pe-

Instantaneous Delay (sec)


riod in our study consists of 48 time slots and the duration 2.5 0.8

Instantaneous Cost ($)


BATS BATS
of each time slot is 1 hour. The default budget for our exper- EqualSC EqualSC
2 0.6
iment is $8.5 while the cost for one VM instance per hour is OptOffline OptOffline
$0.02. We deploy the RUBiS [2] benchmarking application 1.5 ReactSC ReactSC
0.4
into the web role of Azure Cloud Service and scale the un-
1
derlying resources by varying the number of VM instances. 0.2
We use extra-small VM instances. The performance goal is 0.5
to achieve an average delay of 520ms while the maximum 0
10 20 30 40 10 20 30 40
tolerable average delay is 1500ms. To make the optimal Time Slot Time Slot
scaling decision (i.e., number of VM instances) at each time (c) Instantaneous delay. (d) Instantaneous Cost.
slot, BATS needs to determine the resulting delay for the
incoming workload under various scaling configuration (i.e., Figure 1: Comparing BATS with other algorithms.
number of VM instances). We model delay of RUBiS work-
load using a delay lookup table. Specifically, we vary the
number of clients from the workload generator and obtain average delays. This demonstrates the importance of pro-
average delay for each different scaling configuration. Thus, actively predicting the near-future (e.g., hour-ahead) work-
we build the delay lookup table for various combinations of loads as used by BATS, thereby highlighting the limitations
workload arrival rates and numbers of VM instances. of reactive autoscaling rules. Moreover, Fig. 1(b) shows that
We now describe three benchmarks to compare BATS: the cost saving of BATS is 10% compared to ReactSC. It is
(1) EqualSC: evenly dividing the available budget across mainly because ReactSC ignores the budget constraint and
all the time slots and obtains the number of VM instances always makes scaling decisions such that resulting average
that can be reserved for the entire budgeting period based delay equals 520ms. ReactSC also violates $8.5 budget con-
on discounted pricing (20% for reserved instances [3]), (2) straint. We also observe that the average delay of BATS
ReactSC: widely adopted autoscaler [1, 3] that constantly is very close to OptOffline (with a difference less than 4%),
monitors the average workload arrival rates measured over while Fig. 1(b) shows that the cost is almost the same. Thus,
the last 5 minutes and uses the delay lookup table to deter- the results demonstrate the effectiveness of BATS.
mine the minimum number of VM instances such that the Finally, we summarize the simulation study that demon-
resulting delay is less than a specified threshold (520ms), strates the benefits and robustness of BATS on a wide range
and (3) OptOffline: optimally dividing the whole budget of scenarios. Even with high workload prediction errors,
among time slots with perfect workload arrival information the performance of BATS is still robust: e.g., for 20% pre-
over the entire budgeting period. diction error, the resulting average delay only increases by
Next, we compare the performance of BATS with bench- 1.4%. Moreover, by leveraging both horizontal and vertical
mark algorithms and show cumulative average delay and cu- scaling, BATS can further improve delay performance while
mulative cost in Fig. 1(a) and Fig. 1(b), respectively. The satisfying the budget constraint. BATS can also incorpo-
cumulative average value for a time slot t is the correspond- rate reserved instances in its operation to lower operational
ing average value of time slot 0 to t. We show the average cost and improve delay performance. The simulation study
delay performance and cost comparisons for each individ- for other workload traces also demonstrates that BATS per-
ual time slot in Fig. 1(c) and Fig. 1(d), respectively. As forms better even under fairly smooth workloads.
shown in Fig. 1(a), BATS reduces delay by 34% compared to
EqualSC while achieving the same budget constraint, even 4. REFERENCES
though EqualSC receives discounted pricing. This is mainly [1] Amazon ec2. http://aws.amazon.com/autoscaling/.
because EqualSC evenly divides the budget across each time [2] Rubis. http://rubis.ow2.org/.
slot and reserves 11 VM instances without considering the [3] Windows azure. http://www.windowsazure.com/.
workload variation. As a result, when there is a workload [4] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski.
spike (e.g., in the 4th time slot), the delay becomes very Cost-and deadline-constrained provisioning for scientific
large, as shown in Fig. 1(c). The average delay reduction workflow ensembles in iaas clouds. In SC, 2012.
[5] M. Mao, J. Li, and M. Humphrey. Cloud auto-scaling with
of BATS is 10% compared to ReactSC. The degrading per-
deadline and budget constraints. In GRID, 2010.
formance of ReactSC comes from the long lagging time: it [6] S. VanRoekel. The FY14 President’s IT budget: Innovate,
takes up to 5 minutes to detect the system status change deliver, protect, https://cio.gov/
(e.g., workload variation) and even after detection, it takes the-fy14-presidents-it-budget-innovate-deliver-protect/.
up to 10 minutes to acquire a new VM instance. During the
lagging time, all the incoming workloads experience longer

564

Você também pode gostar