Uptime and Nines
2020-11-03 12:35:00
We’ve talked about putting “service” into Service Level Agreements but there's another word we need to talk about and that's uptime.
Uptime is usually defined in nines -- the nines refer to your percent up time… one nine is 90% uptime, two nines would be 99% and so on. Most companies don't do one nine because you could be down a month a year, which works out to about two and a half hours a day. Customers are not going to accept that level of downtime.
On the other extreme what they'd love would be something like nine nines which means you're only going to be down for 31 milliseconds a year. This is an impossible standard to meet so you have to choose something between what your customers will accept and what you can actually deliver.
When we think about the actual outages, we have to understand what constitutes down. We know completely down it doesn't work at all, but you have things that impact performance enough to be considered “down” even when functional tests may still be working.
Slow response times If it takes too long for the payload to come back it could be as good as down because transactions are not being completed.
Incorrect Responses If you have unexpected results like getting xml back instead of json or getting the wrong payload back that still constitutes an outage even though the system appears to be up.
Security Breaches While everything may be working perfectly, if there’s a breach we have to shut the system down -- even if we don't shut it down, we still have to escalate notify and basically follow all the same processes that we would if we were down
These unplanned outages all count towards violations in that agreement but not all outages violate the SLA. You can write into your contracts a planned outage for things like planned maintenance. Your SLA could say that the services is going to be down every third Saturday. It can say that as long as you have prior notice you can shut it down for a period of time and not count against your your SLA.
We also should consider “unusual use patterns” or “customer generated outages.” If you are getting super slow responses because you're using your service in a unexpected way, it may be a nuisance but isn't necessarily a violation of what was promised in up time.
Remember, when we make promises about uptime we have to understand time allowed in nines, we have to understand what counts as a downtime, and we need to allow for planned outages in our agreements and our communications with our customers.