TL;DR notes from articles I read today.
Designing resilient systems beyond retries: architecture patterns and engineering chaos
- Incorporate idempotency: an idempotent endpoint returns the same result given the same parameters with no side effects or any side effects are only executed once (this makes retries safer). If an operation has side effects but cannot distinguish unique calls, add an idempotency key parameter which the client must supply for a safe retry (else retry is prevented).
- Use asynchronous responses for ‘deferable work’: instead of relying on a successful call to a dependency that might fail, return a successful or partial response to the client from the service itself. This ensures downstream errors don’t affect the endpoint and reduces the risk of latency and resource use, with retries in the background.
- Apply chaos engineering to test resiliency as a best practice: deliberately introduce latency or simulate outages in parts of the system so it fails and you can improve on it. However, minimize the ‘blast radius’ of chaos experiments in production - in action, it should be the opposite of chaotic:
- Define a steady state. Your hypothesis is that the steady state will not change during the experiment.
- Pick an experiment that mirrors real-world situations: a server shutting down, a lost network connection to a DB, auto-scaling events, a hardware switch.
- Pick a control group (which does not change) and an experiment group from the backend servers.
- Introduce a failure in an aspect or component of the system and attempt to disprove the hypothesis by analyzing metrics between control and experiment groups.
- If the hypothesis is disproved, the affected parts are in need of improvement. After making changes, repeat your experiment until confidence is achieved.
- Automate your chaos experiments, including automatically disabling the experiment if it exceeds the acceptable blast radius.
Full post here, 6 mins read
Continuous testing - creating a testable CI/CD pipeline
-
For continuous testing, focus on confidence, implementation, maintainability, monitoring and speed (CIMMS):
- For greater confidence, pair testers with developers as they write code to review unit tests for coverage and to add service tests for business logic and error handling.
- To implement, use tools that support rapid feedback from fast running of repeatable tests. For service-level tests, inject specific responses/inputs into Docker containers or pass stubbed responses from integration points. For integration tests, run both services in paired Docker containers within the same network. Limit full-environment tests.
- Ensure tests are maintained and up to date. Create tests with human-readable logging, meaningful naming and commented descriptions.
- To monitor, use testing tools that integrate into CI/CD pipeline tools to make failures/successes visible and even send emails out automatically. In production, labeling logs to trace a user’s path and capturing system details of the user environment allows easier debugging.
- For speed, keep the test suite minimal. Let each test focus on only one thing and split tests to run in parallel if need be. Segregate to test only for changed areas and ignore those with no cross-dependencies.
Avoid automating everything. Run manual exploratory tests at each stage to understand new behaviours and determine which of those need automated tests.
When pushing to a new environment, test environmental rollback. Reversing changes should not impact users or affect data integrity. Test the rollout process itself for production and run smoke tests. Continue to monitor by running known error conditions and ensure monitoring captures those with sufficient information for easy debugging.
Full post here, 7 mins read
Get these notes directly in your inbox every weekday by signing up for my newsletter, in.snippets().