100% system uptime, like a perfect circle, is more of a theoretical concept than a physical reality in system architecture. For this reason, we traditionally measure system uptime by the number of “nines” or its “class of nines.” For example, a database that operates without interruption 99.9% of the time has “3 nines” (“class three” of reliability).
That may sound pretty good…after all, it would mean the system is only down for 1.44 minutes each day. But to give you some perspective, a minute of downtime would equate to about $66,000 for Amazon—and that was in 2013. With Class Three reliability, those 1.44 minutes of downtime would have meant losses of almost $100,000 per day, every day of that year (for over $34 million in all). Wild, I know.
This is why “High Availability” (HA) is sought after by anyone who works with software, IT, and system architecture. In some situations, any downtime whatsoever can lead to catastrophic impacts. We like to see as many nines as possible, with “five nines” often held up as the optimal goal:
The Five Nines of System Uptime
|Reliability Class||System Uptime||Yearly Downtime||Monthly Downtime||Daily Downtime|
|90%||36.53 days||73.05 hours||2.40 hours|
|99%||3.65 days||7.31 hours||14.40 minutes|
|99.9%||8.77 hours||43.83 minutes||1.44 minutes|
|99.99%||52.60 minutes||4.38 minutes||8.64 seconds|
|99.999%||5.26 minutes||26.30 seconds||0.86 seconds|
So, what does it take to get as close to five-nines of availability as possible for your system?
Two Styles of Redundancy
The more you can eliminate single points of failure, the better. The common example would be load balancing between multiple servers. If you’ve got two servers, and one goes down, the traffic gets sent over to the other one. But there are really two kinds of redundancy:
- Passive: This is a system with enough excess capacity to handle a decline in performance. Let’s say an office has twelve fluorescent lights overhead. That’s plenty of excess lumens to adequately light the room for the remainder of the day if one blows out.
- Active: Basically, there are equivalent systems in the design that can be accessed to prevent a decline in performance. If one system fails, it can be bypassed to use a working system. Think of packing a spare parachute (in case the first one doesn’t open). Internet routing and other complex systems use this style of redundancy, redirecting the load to working elements. At OpenWater, we did this with our Hull Swap and Tugboat tools.
(By the way, this counted as two.)
Instead of worrying about what would happen if a hurricane knocks out the power to your server room, invest in redundant servers, backups, and slaves that are spread out across multiple geographic zones. This way, regional outages have less chance of impacting system availability.
Virtual backup servers, on the cloud, are incredibly hard to break (at least physically). They can still be digitally overloaded, but their processing power doesn’t necessarily exist in one physical location that needs to worry about local downtime.
More and more applications are pushing in the direction of micro-services. In other words, rather than one self-contained system, each function can be broken down into individual applications and then integrated with each other. If one piece goes down, the rest of the system experiences only partial impact.
All of these can be distilled down to the old saying, “don’t put all your eggs in one basket.” System uptime can only benefit from more redundancy up and down the levels of your stack—passively, actively, physically, virtually, and internally. Availability goes up when the impact of a single point of failure goes down—it’s as simple as that.