I have an existing code base for which I need to replace a low-level layer and I want to think about an error handling strategy for this new layer.
For handling interfaces to external systems, it makes sense to reject unexpected inputs and carry on. I think it also makes sense that errors in inputs from a high-layer to a lower layer are thrown as exceptions upward from the lower layer. But what about internal inputs/outputs that are passed between code within a layer?
Before I go into what I think fail-fast programming is, I need to cover design by contract.
Design by Contract
Design by contract is an approach to reducing the number of bugs in code. The idea is to specify pre-conditions, post-conditions, and invariants that must hold for the code to execute correctly. Below I define these terms and give a code example. Note that the code examples are a bit silly and certainly don't represent best coding practices but hopefully explain the concept simply.
Pre-conditions
A pre-condition is a condition that must be true before executing a code block. The code block won't execute as expected if the pre-conditions are false. A pre-condition can express what is required of the inputs or expectations about the internal state of a class.
class Haystack {
private enum Objects {
Needle = 0, Pen = 1, Pencil = 2
}
private List<Objects> _buried;
internal Haystack() {
_buried = new List<Objects> {Objects.Needle};
}
internal int FindAndRemove(int obj) {
// Pre-condition: obj must be 0, 1, 2
int found = -1;
for (int ii = 0; ii < _buried.Count; ++ii) {
if ((int) _buried[ii] == obj) {
found = 1;
break;
}
}
// Needles are too dangerous to remove!
if (found == 1 && obj != (int) Objects.Needle) {
_buried.Remove((Objects) obj);
}
return found;
}
}
Post-conditions
A post-condition is something that must be true once a code block has executed.
class Haystack {
private enum Objects {
Needle = 0, Pen = 1, Pencil = 2
}
private List<Objects> _buried;
internal Haystack() {
_buried = new List<Objects> {Objects.Needle};
}
internal int FindAndRemove(int obj) {
int found = -1;
for (int ii = 0; ii < _buried.Count; ++ii) {
if ((int) _buried[ii] == obj) {
found = 1;
break;
}
}
// Needles are too dangerous to remove!
if (found == 1 && obj != (int) Objects.Needle) {
_buried.Remove((Objects) obj);
}
// Post-condition: found is either 1 or -1
return found;
}
}
Invariants
An invariant always holds true after a certain point in the code (typically after initialisation code is executed).
class Haystack {
private enum Objects {
Needle = 0, Pen = 1, Pencil = 2
}
private List<Objects> _buried;
internal Haystack() {
_buried = new List<Objects> {Objects.Needle};
// Invariant: _buried is not null
// Invariant: a needle is always buried
}
internal int FindAndRemove(int obj) {
int found = -1;
// Invariant: _buried is not null
for (int ii = 0; ii < _buried.Count; ++ii) {
// Invariant: ii is less than the length of _buried
if ((int) _buried[ii] == obj) {
found = 1;
break;
}
}
// Needles are too dangerous to remove!
if (found == 1 && obj != (int) Objects.Needle) {
_buried.Remove((Objects) obj);
}
// Invariant: a needle is always buried
return found;
}
}
Fail-fast Strategy
A fail-fast strategy means the program crashes immediately when a pre-condition, post-condition or invariant is false. Instead of using comments to document the condition, "asserts" are used to crash the program. For example, a condition can be tested ...
Trace.Assert(_buried != null);
Trace.Assert(obj >= 0, $"obj is {obj}");
Trace.Assert(obj <= 2, $"obj is {obj}");
... which crashes the program with a stack trace and a helpful line number. If a message is included, the message is output.
I have seen this strategy create very reliable code in a very large code base that has run 24/7 for over 10 years.
There are some pretty big commitments that need to be made for this to result in reliable code - namely extensive testing. Automated test cases are the ideal but they need to be of a good quality and properly maintained. Code coverage metrics must be collected and analysed to ensure adequate code coverage.
Pros
It's hard to ignore a crash.
Developers implicitly become testers of all fail-fast code. A sufficiently trained developer with adequate time will at least raise a bug, if not investigate it.
Traces of crashes help developers zero in as close as possible to the cause of the crash.
Code is simpler because the code to check conditions is often less verbose than using try/catch blocks.
Code is better documented and actively enforced. The documentation can't become out of date because if it does, it crashes the program.
Code that has pre-conditions, post-conditions, and invariants leads to less bugs getting introduced when code is later modified. This is especially true when developers only read a fragment of code before making a change.
There is no error recovery code. Ideally if the system is critical, there is a fault-tolerance mechanism at the system level that triggers to recover from the fault.
Cons
It's hard to ignore a crash.
Under pressure, developers may choose to remove pre-conditions, post-conditions, and invariants if a root cause for a crash cannot be found.
Testing may be inadequate leading to more bugs getting through to production. This is obviously a very big con because crashes in production are generally considered bad!
Catch-all Strategy
A catch-all strategy tries to continue on in the face of fault conditions. This usually involves surrounding all code blocks with try/catch statements.
Pros
The system is unlikely to crash in production.
Error conditions can be recovered if exceptions are handled and do more than simply log an error message.
Cons
The system can silently malfunction in production. When the system silently malfunctions in test and in development environments, it is a lost opportunity to catch and fix bugs.
It is harder to track down the cause of a malfunction without good log messages. If the bugs cannot be re-produced, turning on tracing will not help.
Most developers will only look at their own log messages and not look for other error messages. Unless developers are very disciplined, log files become very noisy with lots of errors even though the system looks like it is behaving leading to what I like to call "log-error-blindness".
Testing error recovery code littered throughout the code base is difficult.
Optional Fail-Fast
Another possibility is that fail-fast can be optionally switched on or off either at build time or at runtime.
Pros
Fail-fast can be disabled in critical production environments.
Fail-fast can be enabled in internal or external test environments and developer environments.
Most of the benefits of fail-fast can be realised without risking stability in production.
Cons
Running different code in production compared to test and development potentially introduces the risk of error conditions emerging in production that don't occur in development and test. For example, side-effects in condition testing could lead to faults when condition testing is turned off.
Fail-fast with Error Recovery
Fail-fast can be implemented with special exceptions thrown up to a level that can handle and try to recover from the error. Helper methods can be implemented to more succinctly document and enforce detection of internal error conditions.
Pros
The dream is that you get the best of all strategies.
Cons
Testing error recovery in code is difficult. The system could become even more unreliable if a fault occurs in the error recovery code.
Testing error recovery code at the system level can be difficult unless an error can be triggered in some way. This con can be mitigated if there is a system-level fault-tolerance mechanism (rather than trying to handle the error in code) that is easily testable. But this can also become a con again if the error condition persists leading frequent engagement of the fault-tolerance mechanism which itself can cause instability.
Discussion
There are probably other approaches. If you can think of any others, please let me know in the comments.
I don't think there is a one-size-fits-all implementation strategy for design by contract. I think in my case, the new layer will use a strategy of throwing exceptions up to higher layers when pre-conditions, post-conditions and invariants are violated. It will then be up to the higher layers to decide what to do.