Use Site Reliability Engineering to Address Cloud Instability Skip to main content

Use Site Reliability Engineering to Address Cloud Instability

Cloud platforms, as a remotely managed service, come with a service-level agreement (SLA) that guarantees an uptime percentage or your money back. These SLAs, and the shifting of responsibility of infrastructure maintenance from your organisation or colocation provider to the clouds in use in your organisation, have prompted an expectation that cloud services will “just work”, even though reality often falls short of that.

Computing infrastructure has become faster and cheaper over time, but a server today is not meaningfully more reliable than a decade ago, because the root causes of outages are often environmental or the result of third-party error.

Some outages over the past two years have been eyebrow-raising in their origin, effect or circumstances.

The fire that destroyed OVHcloud’s Strasbourg SBG2 facility in March 2021 was the result of a faulty repair to an uninterruptable power supply. Cooling systems failed to keep pace with the London heatwave in July, leading to outages at Google Cloud Platform and Oracle Cloud Infrastructure. Although not cloud-specific, the 2020 Nashville bombing damaged a significant amount of telecoms equipment, leading to regional outages.

Given a rise in global temperatures owing to climate change – and a rise in political temperatures – the potential for climate- or extremism-related outages is real.

Of course, comparatively mundane factors also lead to outages, such as bad software deployments, software supply chain problems, power failures and networking issues ranging in severity from tripped-over cables to fibre cuts. Naturally, no discussion of outages would be complete without a mention of DNS and BGP-related outages, which were cited as the root cause of incidents at Microsoft Azure, Salesforce, Facebook and Rogers Communications over the past two years.

Engineer like a storm is coming

If your application is mission-critical, deployment and instrumentation should reflect that. Consider where the single points of failure are – deploying only to one region in a single cloud provides no redundancy. The use of a content delivery network (CDN) can provide cached versions of pages in the event of an outage, which provides utility for serving relatively static content, though the use of a CDN alone will not maintain full feature availability.

Deploying to multiple regions in a single cloud is the lowest-friction means of ensuring availability, although architecting a scalable application whose constituent components can be distributed involves significant engineering time and infrastructure cost. Operating and maintaining individual service units – including data stores – that are deployed to geographically separate facilities is a significant endeavour that needs thoughtful planning and institutional support to accomplish.

Arguments could be made here for multicloud: operating parallel infrastructure to eliminate a single point of failure is enticing, but expensive, complex and repetitive, requiring institutional knowledge of two different cloud platforms and accommodating both as equals in every step of your production processes.

Similarly, compelling arguments could be made in these circumstances for hybrid cloud, but this too is complex. Some of this complexity can be managed through initiatives such as AWS Outposts, Azure Stack Hub and IBM Cloud Satellite, which provide consistent operating environments across public and private infrastructure.

Using these offerings as the sole hedge against outages is short-sighted – it exchanges reliability problems for complexity problems, introducing a new avenue from which outages could occur.

You need site reliability engineering

By adopting site reliability engineering (SRE) to create scalable and reliable systems, it is possible to usefully embrace complexity and increase reliability with careful planning, clearly articulated roles and well-defined incident management processes.

Site reliability engineers are generally tasked with reducing “toil” – repetitive, manual work directly tied to running a service – as well as defining and measuring reliability goals: the service-level indicators and service-level objectives that are tied to the SLAs of a cloud or infrastructure provider. Measuring these, and application performance generally, is achieved with observability tools, which provide the ability for site reliability engineers and other troubleshooters to ask questions about an environment without knowing what needs to be asked prior to an incident.

Although there are different approaches to implementing SRE – and by extension, defining the responsibilities of reliability engineers – there is a distinction between engineers and platform teams. Platform teams are tasked with building out the infrastructure in an IT estate; site reliability engineers are multidisciplinary roles tasked with ensuring reliability in the infrastructure, applications and tooling used by an organisation to deliver a product or service to customers.

Assume the worst, but hope for the best

The ubiquity of cloud platforms leads to visibility among consumers that datacentre operators do not have – services such as Downdetector illustrate the relationship between cloud outages and outages of the consumer brands that use those cloud platforms. Downdetector, and internally, observability tools, provide a real-time understanding of cloud outages that may not be reflected in the service status pages of a cloud platform.

The supplier-provided dashboards require manual intervention to acknowledge a service degradation or outage, making them an editorial product, not an automated real-time view of the service status of a cloud platform. That is not to imply wrongdoing – there are useful reasons to limit information, particularly to avoid tipping off threat actors about the degree to which a service is stressed by an attack.

Cloud platform operators are, naturally, working to improve reliability and reduce the effect of outages. Microsoft’s introduction of Azure Availability Zones to logically separate infrastructure in the same datacentre region is one attempt to improve overall reliability, and IBM’s work to strengthen platform reliability has reduced major incidents by 90% in a year.

Disruptions in cloud platforms, network hiccups – for infrastructure or users – and the unpredictable effects of software changes or “code rot” all mean there is practically no way to guarantee perfect uptime of an application. But thoughtful planning and resource allocation can reduce the severity of incidents. Proactively engineering for instability requires upfront investment, but this is preferable to emergency firefighting.

James Sanders is a principal analyst, cloud and infrastructure, at CCS Insight.

Comments

Popular posts from this blog

🔍 Inside the Valuation of a $1.9M Amazon FBA Business: What Buyers Need to Know

🔍 Inside the Valuation of a $1.9M Amazon FBA Business: What Buyers Need to Know Thinking of buying or selling an online business? Let’s dissect a real-world example that offers powerful insights. Recently, we reviewed the valuation of an Amazon FBA business in the health & wellness niche , specifically selling premium acupressure products . With solid financials and market positioning, this business offered a textbook example of how online brands are valued in today’s marketplace. Revenue (TTM) : $1.9 million Annual Net Profit : $482,254 Average Order Value (AOV) : $124 Business Age : 4 years Sales Channel : Primarily Amazon FBA 💡 What Made This Business Stand Out? 1. Strong Profit Margins With close to half a million in yearly profit, the business runs on an estimated 25% net margin — higher than average for physical product brands, especially those relying on Amazon FBA. 2. High Average Order Value At $124 AOV, ...

Digital Business M&A Is Evolving—Here’s How You Can Capitalize (Web Hosting & Digital Products Edition)

🚀 Digital Business M&A Is Evolving—Here’s How You Can Capitalize (Web Hosting & Digital Products Edition) If you're in the web hosting or digital product space—whether you're running a niche hosting service, selling plugins/themes, managing YouTube tutorials, or monetizing SaaS tools—then this update could change your next move. Flippa, the leading marketplace for buying and selling online businesses, has released a powerful new M&A report covering the latest 12 months of digital dealmaking activity. It’s packed with data and trends that can help you understand where the money’s flowing—and why . 👉 Read the full report on Flippa 👉 Join Flippa today to buy/sell with my referral link 🔍 What’s Hot in Online Business Sales (2024–2025) 💼 $100K+ Deals Are the New Normal Serious buyers are focusing on established businesses—especially those with recurring revenue like hosting services or subscription-based SaaS. If you’ve built ...

Digital Marketing & Business Skills: Advance Your Career | elifeandwork

Mastering the Modern Skills Landscape: Your Guide to Digital Marketing, Business Acumen, and Career Advancement I. Introduction: Navigating the Evolving World of Work The landscape of work is in constant flux. Technological advancements, globalization, and evolving consumer behaviors are reshaping industries and demanding a new set of skills. In this dynamic environment, staying stagnant is not an option. To thrive, individuals need to embrace continuous learning and cultivate a versatile skillset that bridges the gap between technical expertise and strategic thinking. "The only constant in life is change." - Heraclitus This timeless wisdom resonates deeply in today's professional world. This blog post serves as your compass, guiding you through the essential skills needed to not only navigate but excel in this evolving landscape. We'll explore the power of digi...