Towards Operational Excellence
I want to express my gratitude to my colleagues and friends Ricardo Sueiras, Matt Fitzerald, and Boaz Ziniman for their valuable feedback.
Since I published my blog series Towards Operational Excellence, I received a relatively large amount of feedback and requests. One, in particular, stood out:
“Can you share an operational excellence review template?”
Operational Readiness Review
In this blog post, I will share with you my “lightweight” (but not so lightweight) Operational Readiness Review (ORR) template.
An ORR is a rigorous, evidence-based assessment that evaluates a particular service’s operational state and is often very specific to a specific company, its culture, and its tools. Yet, ORRs all have the same goal: help you find blind spots in your operations.
This template, which I hope will help you get started, is based on my two-decades of experience writing application software, deploying servers, and managing large-scale architectures. I have refined it over the years, helping customers operating software systems in the AWS cloud.
This ORR template is by no mean a complete one. Instead, treat it as a starting point for you and your company to get the ball rolling. The most important thing is to make you think about the different aspects of software operations to minimize the risks of failure once the code hits production.
How to use this ORR template?
As mentioned previously, this is not THE template — it is A template — so treat it more as a mechanism for regularly evaluating your workloads, identifying high-risk issues, and recording your improvements.
More importantly, make it yours. Add your own experience to it. Adapt it to your culture, to your needs.
Can you have the right answers to all questions?
Very unlikely at first, but over time it should be your goal. Again, it is more a learning path to support continuous improvement. Having ORR reviews makes it easy to save point-in-time milestones and track improvements to your operations.
Who should do an ORR?
ORR should preferably be done with the entire service team: the product owner, the technical product manager, backend and frontend developers, designers, architects, etc. Everyone who was involved in one way or another with the service. The more diversity, the better. We want to avoid confirmation bias as much as possible.
When should you do an ORR?
A formal ORR should be done before the initial service launch and after any significant technological change. It should be repeated periodically (about once per year) to ensure that things haven’t drifted away from operational expectations but instead improved over time.
How does an ORR differ from an AWS Well-Architected review?
While there are some overlaps, the AWS Well-Architected review provides customers and partners a means to evaluate architectures and implement designs that can scale over time. It describes the key concepts, design principles, and architectural best practices for designing and running workloads in the cloud. ORR addresses and focuses on the operational aspect of a particular service.
Operational Readiness Review Template
The ORR template is organized as follows:
1 — Service Definition and Goals
2 — Architecture
3 — Failures and Impact
4 — Risk Assessment
5 — Metrics and Alarms
6 — Testing
7 — Deployment
8 — Operations
9 — Disaster Recovery
NOTE 1: As you may have noticed, I didn’t include security in there! And for a good reason — security must have it’s own, in-depth, review.
1 — Service Definition and Goals
Describe what your service does from the customer’s point of view.
Describe your operational goals for the service.
What is the SLA of the service?
What are the business scaling drivers correlated with your services? (e.g. number of users, sales, marketing, ad-hoc, …)
Are you conducting an in-depth security review of your service?
2 — Architecture
Describe the architecture of your service. Call out the critical functionalities. Identify the different components of the system and how they interact with one another.
Describe each component of your system.
Does your service support auto-scaling ? Describe the mechanisms and expectations.
Does your architecture handle a sudden surge of traffic?
What parts of your architectural design reduces the blast radius of failures? (discuss bulkheads, cells, shards, etc.)
Do you have any single-points of failure? If you do, explain why and what is done to minimize the impact of failure.
Explain the different database and storage choices.
List all customer-facing endpoints, explain what each does and what components and dependencies they have.
List all dependencies that your service takes.
What is the anticipated request volume for each component and dependencies of your system?
3 — Failures and Impact
Explain how your service will be impacted based on the failure of each of your components and dependencies.
What is the failure mode for each of the components? (fail-open vs. fail-closed)
Explain the impact on customer experience for the failure of each component and each dependency.
What are the limits imposed on your service by your dependencies? How are these limits tracked?
Do you communicate your scaling requirements to teams that own services you’ve taken dependencies on?
Does your service impose limits on customer resources?
Can you increase limits without making a deployment?
Can you increase limits on a per-customer basis?
Describe the resilience to failure of each of your components (discuss in particular multi-AZ, self-healing, retries, timeouts, back-off, throttles, and limits put in place)
Can the service tolerate an availability zone (AZ) failure without impact?
Can your service sustain production traffic with one AZ down? (ref. static stability)
What is the retry/back-off strategy for each of your dependencies?
What happens when your customers hit limits and get throttled? Can they raise them? How?
4 — Risk Assessment
What are your operational risks?
What scalability concerns are you worried about?
What features did you cut to meet your deadline?
What are the top three things that you believe will catch fire first?
Do you keep track of your dependencies and their criticality? Do you review them regularly?
Do you understand the cost/economics relationship of the service to scaling?
5 — Monitoring, Metrics & Alarms
How do you measure and monitor the end-to-end customer experience?
Do you monitor for single-customer experience?
Do you alarm on poor overall customer experience?
Do you alarm on poor single-customer experience?
How do you trace customer requests in your system?
What are you alarming on? List all of your alarms, with period and threshold, and the severity of each.
Are you dashboard clear? Does everyone know what to look at?
Are there metrics you monitor that don’t have alarms? Which? Why?
What kind of health-checks does your system monitor? (discuss in particular if it is shallow or deep if it uses cache, async vs. sync, etc., and the risks associated)
Do you monitor each external dependency and alarm on failure conditions?
Do you monitor your dependency usage and remaining allowance?
Do you monitor the hosts for disk failure?
Do you monitor disk space utilization?
Do you have log-rotation in place?
Do you monitor for host CPU and memory utilization?
Do you monitor for certificate expiration?
Do you monitor the latency of synchronous and asynchronous calls?
Do you auto-cut tickets on alarms?
6 — Testing
Describe the overall test strategy should follow.
When do you run tests? Do you have tests before and after conducting code review? Do they run automatically, or are developers running tests manually?
Do you test using “fake” accounts?
What’s the percentage of public-facing APIs covered by tests?
Do you test your dependencies? What assumptions do you make on these?
How do you verify that your service’s monitoring and alarming function as expected?
7 — Deployment
How does your deployment procedure work? Lists actions and estimated time in the deployment pipeline.
What are the manual touch-points in your system? Why aren’t they automated? What are the risks associate with each of the touch-points?
What is your procedure to define and approve a change in production?
Do you have a mandatory code review for each change? How do these changes get approved? Do you have several people approving changes?
How do you roll back a change?
Do you test the rollback procedures before deployment?
How do you deploy the configuration to different stages?
Do you error-and-syntax check your configuration before deployment?
What are the dependencies for deployment?
Does your deployment support immutability ? Does your deployment update/upgrade software in-place?
Do you perform load testing before deploying to production?
8 — Operations
Describe how the on-call rotation for your service looks like.
Do you have easily available and complete links to the documentation for the service?
Do you have well defined, documented, and accessible recovery procedures?
Describe the escalation path in the event of an outage (include timing expectations).
Does the trouble-ticketing system integrates with the monitoring system?
Does the paging system integrates with the monitoring system?
9 — Disaster Recovery
Do your on-calls have full access to connect to, debug, and configure the service?
Are you preventing/discouraging your team from using full-admin access roles except when absolutely necessary?
Do you have read-only roles for your team to use for non-critical situations?
Do you have up-to-date escalation policies easily accessible by anyone in the company?
How do you keep the escalation policies up-to-date?
Do you have platform-wide locks that prevent or delay routine tasks in case of an active disaster?
Do you have a well-defined process for DR situations? (e.g., war rooms, isolation, calls, internal & external communication)
Are you practicing your disaster recovery procedure?
Do you have measured and verified RTO and RPO?
Are your DNS TTLs set to sane values?
Do you have verified and tested tools deployed to query logs to measure the impact on customers?
Do you have a process for identifying the causes of outages? (e.g., postmortem, correction-of-error, etc.)
Do you back up critical data?
Do you practice backup restoration regularly?
Do you regularly practice fail-overs?
Can your on-call team enable throttles to protect the service from user load?
Can your on-call team increase limits in case of emergencies?
Do you update run-books as the service changes?
That’s all for now, folks. If you want to download, fork, or suggest some changes, this template is on my GitHub account here. Please contribute and help me improve it.
I hope you’ve enjoyed this post. I would love to hear what works and what doesn’t work for you, so please don’t hesitate to share your feedback and opinions. Thanks a lot for reading :-)
— Adrian