When we say "SLA" we often forget that the s stands for "Service."
A Service Level Agreement means that we have agreed to provide a certain level of service to our customers. As long as we're meeting that agreement, we're building trust with our customers -- and trust is the core of what we do.
You set the level of service in your agreement; you can start small. 7-eleven was originally open from 7 to 11, they weren't open 24/7. Once they built competency to be able to open 24/7, they changed their Service Level Agreement from an "uptime" of 16 hours a day to 24 hours a day.
We can write whatever we feel comfortable with in our agreements, provided the customers will accept it. We have to think about those different areas of our SLA starting not with the technical part, but with the business delivery.
First we need to consider the audience who are we going to be delivering this service to, what their expectations are, whether we have enough of an audience or engagement to have enough consumption to get a return on investment so we can consider if it will pay for itself.
Then we need to take a look at what is it going to cost if it goes down. This isn’t just the obvious costs in labor or materials to fix the problem, there are also the non-tangible costs such as reputation and lost customers. There are direct penalties which sometimes come from the contractual agreement, and sometimes from bodies of law like GDPR, NYDFS or CPAA.
Once we know those rules, we can build metrics to make sure we're enforcing those rules.
Uptime and Response Times
We need to monitor our uptime to make sure that it’s available to the level we promised. This isn’t just an up/down check, the response times need to be fast enough to be consumable in real-world use – if a service is too slow it might as well be down.
The user experience is part of that contract and be testing needs to make sure the consumer is getting the right payloads back in the right formats when they make a request. We need to understand our failover experience -- if a node goes down how long does it take for the next one to come up? Are we losing transactions? The very processes we have in place to avoid an outage could give the appearance of an outage to the consumer.
Alerting and Escalating
Once we know these things we can set up our monitors to tell us that it's going wrong. By now, everything has been defined so we know how the service is failing to deliver and we have to be able to alert the business and internal stakeholders of the outage. It needs to be clear when to escalate, who escalates, and who decides whether or not a notification needs to go to the customers. Separately, the process should be clear for when and who escalates internally. If the simple recovery processes don’t work, we need to have our complete disaster recovery in place to get back up running, because once it’d down, the clock is ticking against that SLA.
A Service Agreement is about trust. It's about doing what you say you're going to do. Most importantly, it's about building and maintaining things that you actually can maintain to the level that you've promised.
Make sure that you're doing what you've agreed to, know that you can measure it so you know you're actually doing it know how to fix it if it goes wrong, and know how to make it right for the customer so that you maintain that level of trust.