Você está na página 1de 5

NOTES ON CLOUD AVAILABILITY

DRAFT v01.00.01

Yakov Simkin, AWS SAA.

May 8, 2018

ISSUE: Public Cloud Availability claimed via SLA is at or lower than 5-NINES which is much lower than
Carrier-Grade requirements.

OVERVIEW:

If we want to design a carrier-grade network, one design criteria to include is an overall


availability of at least 99.999 percent (five-nines), which corresponds to a down time of no more
than about 5 minutes per year.

Carrier-grade means extremely high availability. In fact, the requirement for availability in
commercial telephony networks is 99.999 percent. In other words, a carrier-grade network
should be operational at least 99.999 percent of the time. This corresponds to a down time of no
more than about five minutes per year and is known as five nines reliability.

Node Availability Each node on the network should provide at least 99.999 percent availability.
One way to gauge the availability of a node is for the vendor to provide Mean Time Between
Failure (MTBF) values for each component of a given node as well as for the node as a whole.
Often, we will find that MTBF values are in tens of years, which means that we could expect
decades to pass before a given component fails. Although this sounds like an impressive
duration, do not be overly impressed. The MTBF statistics for a complete node will be much
lower than the lowest MTBF value for a single component. Assume, for example, that a gateway
has five cards, with MTBF values of 5 years, 10 years, 10 years, 20 years, and 20 years. Then in
a given 100- year period, we can expect a total of 50 failures (20 10 10 5 5). Thus, our overall
MTBF is two years

Contrary to Carrier-grade telecom systems, public clouds SLA provides 3-4 NINES Availability,
while a lot of claims made that cloud is built /designed for High Availability(HA).
Is this promise accurate in light of several multi-hour outages of AWS, Azure,.. in the past. See
article below by James Alan Miller.

https://searchstorage.techtarget.com/feature/Many-enterprises-arent-prepared-for-a-cloud-service-
outage James Alan Miller
A cloud service outage is no small matter. The cloud is as essential to business operations as it is
to the modern IT toolkit. Minutes down, let alone hours or even days, can have a profound
impact on your bottom line.

A cloud service outage can affect customer satisfaction and revenue, and -- depending on how
much you rely on the cloud -- workload testing, DevOps and data access, among other areas. It
can impede the ability of a business to comply with standards and regulations, which can lead to
fines and penalties. Compliance has taken on significant urgency as the May 25 deadline for the
European Union's General Data Protection Regulation approaches.

While cloud service providers of all stripes are responsible for getting their infrastructures,
including storage, up and running as fast as possible after an outage, the story doesn't always
play out the way a customer might expect, or want. This is particularly true for cloud-based data,
applications and other workloads.

Market research firm Vanson Bourne surveyed 600 IT and 600 business decision makers in a
study commissioned by Veritas Technologies in July and August 2017. Veritas presented the
results in its 2017 "The Truth in Cloud Report," released earlier this spring, which highlighted
the persistence of a number of misconceptions about the cloud. These misperceptions include
whether the service provider or customer is responsible for protecting applications and data from
a cloud service outage.

The report revealed more than half of those surveyed believe their cloud service providers are
primarily responsible for handling a cloud service outage. About four-fifths of respondents said
their cloud service providers should be held accountable for safeguarding their applications and
data against cloud outages. In reality, this isn't exactly the case.
Vanson Bourne for Veritas Technologies

Take a close look at your service agreement, and you will likely find areas where your
organization is more responsible in the event of an outage than you might think. According to
Veritas, these agreements usually only address the infrastructure layer of the cloud service. That
is, the cloud service provider is responsible for getting its infrastructure up and running again
should an outage occur.

Recovering from a cloud service outage is often about more than infrastructure, though. And
getting data and workloads up and running is a different story. Here, the onus may fall partly or
even fully on the customer. Veritas cited excerpts from an unspecified major cloud service
provider's customer server agreement to make this point. The first excerpt referred to how there's
no guarantee service will be uninterrupted and error-free. The second made the point that the
provider can't be held accountable for lost revenue, profits, customers and business. And the
third highlighted how the provider has no liability when it comes to business interruptions.

Claims of HA are made by public cloud vendors but actual SLA / contract goes even below 98%
( for 25%).
Paying back penalty is good but it would not drive customers in if the6blook for carrier grade
availability. So what makes believe AWS S3 provides HA?
How could you support statement on slide 15 “The high availability concepts used in Clusters
make Cloud Computing Highly Available “ is SLA tells that it is actually not?
Of course Telephone switches do actually provide High Availabilty , that was part of contract in
our customers as well, but I don’t see any support for HA in Cloud base system except on -
premises.

Use of term “ designed for high availability” confuscate the real measure - actual availability
which could be promised in the contract like it is done for the telecom systems. It seems like
marketing ploy to avoid use of actual availability. This is not real parameter, agree?

The other part of question is why would not S3 support 5NINES if storage vendors provide 6-
NINES as I pointed below?
Is the network a barrier? Then why would not AWS exclude it from SLA and specify 5NINES
for internal cloud network?

The reason is that it does not make sense to support HA for all services or traffic. It only makes
sense to provide HA for critical services like VoIP ( and VoLTE) or Public Safety
communications like POC.

Você também pode gostar