The IT Guide to Cloud Uptime


When you move mission-critical or even moderately critical applications to the cloud, uptime (and performance) is everything.

The cloud seems simple – after all, it acts as one big server and storage array in the sky, right? At a certain level, that is true. But this view obscures real complexity around: choosing the right cloud service, having a proper WAN, and managing and maintaining your cloud connections. At the same time, you need enough bandwidth so your latency-sensitive applications, such as VoIP, perform properly.

A half a decade ago, it seemed cloud services were going down left and right. Amazon Web Services was out for five days; Yahoo Mail had problems as did VMware’s Cloud Foundry development cloud. There are still sporadic issues on the cloud provider side, but at a much lower rate than five years ago. Even so, the largest enterprises find this unacceptable, and with their large budgets can afford a second provider that they can fail over to.

CloudHarmony, a Gartner company, tracked the major cloud service providers for the entire 365 days of 2015. AWS came out on top, besting Google and Microsoft.

“Through 365 days of monitoring last year, CloudHarmony recorded 56 outages at AWS across four major services – virtual compute and storage, plus content delivery network and domain name service – for a total downtime of about two hours and 30 minutes,” the company found. “By comparison, Microsoft Azure and Google Cloud Platform each had more than five-fold the downtime. Azure experienced 71 outages totaling 10 hours and 49 minutes in services, while 167 outages across 11 hours and 34 minutes were recorded in Google’s cloud.”

And 2016 so far has had a number of problematic outages. Here are a few:

  • In January, a Verizon cloud data center had a power outage, which created chaos for JetBlue airways which was forced to cancel flights and rebook passengers. This was a major black eye for the airline.
  • Also in January, Twitter had an eight hour outage. While most don’t see Twitter as mission critical, it can be a major part of organizations public relations and marketing efforts.
  • That same month, Microsoft Office 365 experienced an outage that for some customers lasted over a week. The main issue was with email, arguably the most import of the Microsoft cloud apps.
  • COM, the leading cloud CRM, had an outage in March that kept some European customers locked out for some ten hours. In May, there was another problem that deleted four hours-worth of user data. It took days to fix the problem.

Despite these glitches, the uptime record of cloud providers is now actually very good.

Picking a Provider

Very few organizations have the money to contract with two providers to create redundancy. So the one provider you choose needs to be the best you can afford. Here research, such as that from CloudHarmony, is helpful.  There are some other issues to consider:

  • SLAs: If you are running serious applications you should have an SLA – one with real teeth so you are actually compensated for downtime or sub-par performance.
  • References: References are important and you should strive to get them. Try to get references from organizations of similar size and need as your own. At the same time, do your own research on how the provider’s network is architected, applications hosted, the approach to security, and what measures they take to insure performance and uptime.
  • Location: The location of the cloud provider is important, but the truly critical issue is the proximity of the hop nearest to your organization or location. That will help determine how fast your application runs.
  • Visibility: What tools does the provider offer that will allow you to see how your application is operating and data moving?

All this information can be useful in choosing the most reliable cloud vendors. But while there are still provider problems, the vast majority of cloud downtime has to with the networks that connect cloud users to their cloud – no matter who the provider is. The area your organization controls the most is your own network. Trying to keep remote workers’ connections to the cloud running smoothly is a more complex matter.

Beefing Up Your Network

To gain adequate performance, you can increase your organization’s WAN bandwidth. But you can also buy hardware that boosts speed such as traffic shaping, and WAN optimization and acceleration. This is called cloud boosting.  The last two functions can be acquired as a cloud service. Of course, these measures primarily help maintaining the cloud service to your organizations’ main location, but also benefits remote employees if they access cloud services via your VPN.

Depending on your budget, there are several ways to make your network redundant. One key approach is to have redundant WAN or internet links. These can set up in two key ways. The easiest approach is to have the secondary link inactive until the primary link goes down. This is called active/inactive. Here your secondary link can be slower, and thus less expensive, since it is only used in the case of emergency.

You can also set the links up as active/active, making the most use of available bandwidth. When one link fails, the other active link takes over all of the traffic.

You can also have redundant NICs for your servers, and even backup routers or switches. This is all based on budget and your need for absolute uptime.

Cloud Network Monitoring and Management – the Key to Uptime

The most vulnerable element in your cloud set up is the network. If AWS only had two and a half hours of downtime in all of last year, you have bigger things to worry about than their downtime. Your downtime due to your network is the big concern.

Managing that network is your responsibility. To monitor for problems, both current and pending, you need visibility across your internal network, as well as any WAN or internet connections you might have.

Problems can come from a number of sources. It could be a router or NIC issue, or another network infrastructure component causing the issue.

Traditional monitoring tools tend to operate in silos. They are often focused on on-premises networks, specific applications – including cloud apps, OSes, virtual servers, or bits of network gear such as routers and switches. This makes it tough to pinpoint the root cause of cloud networking problems.

Meanwhile unified cloud monitoring and management provides deep visibility into the network, and the management aspect provides remediation. By spotting problems in the network that compromise cloud uptime, IT can take quick action, keeping that network healthy so end users don’t even know there was an issue.

This visibility also lets IT understand the overall health of the entire network, including applications, virtual servers, routers, etc., and make the proper upgrades to make it more reliable. If IT regularly runs into a problem with a network element, they can replace it. And if downtime is due to having a single connection out to the cloud that fails, IT can set up a backup connection.

Monitor and Fix your Cloud with Traverse

The cloud presents special network management challenges for IT. That’s because internal IT doesn’t have full control of the provider’s cloud infrastructure or a full view of all the network pieces that support these cloud applications and services.

And while IT struggles to monitor and manage the cloud, it still needs to take care of internal networks and even hybrid cloud configurations. Kaseya Traverse is a full-featured network monitoring solution designed to holistically monitor performance across on-premises, cloud and hybrid infrastructure.

With Kaseya Traverse, IT staff can view even the most complex infrastructure based on service-level views. This service-oriented view enables fast root cause analysis, so network and service problems, especially regarding the cloud, are quickly resolved and don’t hold operations up.

Learn more about Kaseya Traverse here.

Posted by Doug Barney
Doug Barney was the founding editor of Redmond Magazine, Redmond Channel Partner, Redmond Developer News and Virtualization Review. Doug also served as Executive Editor of Network World, Editor in Chief of AmigaWorld, and Editor in Chief of Network Computing.
5 Tips for Successful Remote Workforce IT Management

5 Tips for Successful Remote Workforce IT Management

An unprecedented crisis is gripping the world today. At the time of this writing, hundreds of thousands are battling aRead More

IT Glue A Kaseya Company

Kaseya Connect Global: IT Glue Offers Limited Release of Network Glue for Powerful Network Discovery, Documentation, and Diagramming

Announced at Kaseya Connect Global, Network Glue automates capturing, displaying, and refreshing network device information inside of IT Glue accounts. NetworkRead More

Connect Global IT - Fred Voccola Keynote

Kaseya Connect Global: Day 1 Recap

What a day it was – building on the Pre-Conference energy (packed training sessions! an exciting Customer Success Council includingRead More

Kaseya Powered Services

Connect IT Global: Kaseya Launches Powered Services 2.0, Levels the Competitive Playing Field for MSPs Who Adopt

Best-in-class MSPs outperform their peers in multiple dimensions – but while MSPs often are technical experts, only the best-in-class haveRead More

Download the 2022 IT Operations Survey Report - Click Here
2022 Benchmark Survery Results