Nobody likes to be dragged out of bed at two in the morning because the system went down. Whether it’s a marketing site or a full suite of applications that drive your business, we strive for uptime and fight downtime.
Of course, there are sometimes arguments like, “well no one is really looking at that site so it’s not that urgent,” or “it’s just a prototype so we aren’t committed to keeping it live.” These are bad arguments because they aren’t addressing the right question which is “What did our reasonably customers expect?”
Setting that expectation is key and, unfortunately, it’s usually done after the fact.
A Service Level Agreement is set to support the business rules. We all want 100% reliability and we should design for that, but the SLA is about reality – things break and we have to prepare our customers for that fact, understand what we consider “broken” and define how we handle it when it does break.
Identify Business Requirements
We often start with IT DevOps tools when talking about managing an SLA, but in reality we have to understand the business needs to drive the architecture of the system, what metrics we look at for “uptime” and only then can we think about how we monitor our systems.
Business requirements should include:
- Audience: Who is going to be looking at this and how critical is it that they need 24/7 access and response time?
- Return on Investment (ROI): If we have to build a multi-datacenter installation with a 24/7 response team, are we going to see our money back on that investment?
- Outage Loss Calculation: How much does it really cost to be down? If ecommerce, how much money do you make for every minute or hour that you can’t take those orders, and will you get those orders back if it does go down?
- Non-Tangible Losses: Of course even if you do get those orders later, what negative effect does your outage have? Some companies have the kind of relationship that it doesn’t really matter, we all laugh and stay friends, but usually you’re going to erode your business if you erode your uptime.
- Business Remedy/Penalties: You may have actual contractual obligations to provide uptime for online services and those contracts may define penalties for not meeting those obligations. Even if it’s just one customer who can hit you with penalties, that cost may be too great to bear for shaky system that goes down regularly.
Define SLA Metrics
Now that we have a clear understanding of what we’re obligated to do, we can define the actual metrics that measure whether we’re meeting those obligations. This includes:
- Uptime: Defining your uptime in “nines” is common, but the main goal is to understand how much downtime is acceptable (an hour a day? a month? a year?) and how you measure that
- Response times: It doesn’t have to be “down” to be unusable; if it takes 2 minutes for a page to load and people are bailing, you’re as good as down. Different audiences have different needs so defining “slow” is important in defining “functionally down” and avoiding subjective arguments for what we think is “slow.”
- User Experience: Just like “slow” can be subjective, if the customer feels that the system isn’t working as advertised, it can still be considered “down.” Getting odd responses, unable to follow certain links, modals being hidden from view (but maybe accessible through monitors) all could start the phone ringing and complaints flooding in.
- Failover Experience: When we build Highly Available systems we expect them to failover seamlessly – that is, the customer didn’t even know they went from one server to another. Unfortunately, that rollover isn’t always perfect. If the business requirements say they shouldn’t lose their work mid-session, then that failover needs to be designed and monitored to support a clean transition.
NOW we can talk about monitoring. Now we’ve defined the business goals and expectations, we’ve defined the actual numbers that we need to be using for our performance indicators, and we understand what the customer should expect. The actual Operations toolset should include:
- Alert Triggers: When we look at the metrics we know we need to monitor more than a 200 OK response; we need to have triggers on all those metrics like response times, functional flows, failovers, etc.
- Response Process: That 2AM alert often results in a firedrill where people are trying to figure out who to call and when to escalate. While boring, we need to make sure we define clear response procedures with clear steps to identify the issue, up to date contact info (including timezones so we know where it’s 2AM), how to get everyone updated, and clear understanding of whether this violates the SLA or not.
- Disaster Recovery: We know we should have regular backups (how regular needs to be defined by the business) and we should have an easy way to spin up everything from scratch in the worst case scenario. While I’ve seen plenty of recovery processes that work, all too often it’s because of the heroic effort of one or two people and a bit of luck. The process needs to be easy and regularly tested so you can get back to bed and no one really knows it was a disaster.
- User Notifications: This, again, needs to be defined by the business rules because we have a tendency to not want to admit mistakes and failure. Sometimes an outage isn’t really an outage and we don’t have to publish it on an uptime report. Sometimes we have contractual obligations to a few key clients who we notify but don’t bother with a public announcement. Sometimes we need to do a press release… The point is we need it defined ahead of time and there needs to be a clear stakeholder responsible for managing that customer communication around outages.
A lot of this stuff seems super basic and we usually, in our guts, feel we have it covered. But as legislation starts adding rules to what we define as “ours” and “our customer’s” the responsibility increases. And as our customers become more savvy and demanding, being prepared for failure means the difference between “Oops” and “Ops.”