The realm of quality is a much misunderstood one and conclusions are drawn without understanding its ability to protect a product. It should not be sacrificed in order to meet deadlines or bring that product to market -- what if it work as intended? Many of us have seen bugs appear after a release, and probably admit that there wasn't enough time to test for it. Testing is a major part of a quality plan and the workloads of many, so being able to explain why it is necessary could very well alter the opinion in favor of it.

That is why I want to present some possibly controversial statements on ideas around testing. Some of these points are not the first thing that comes to mind thinking of how testing is performed but are incredibly important to understand. I hope this article can bring you insight when the hard questions are being asked, and put the power back in the quality process.

It is separated into three parts.

The first part deals with how to run tests and what they are.

Testing is mainly used as simply checking for a pass/fail state, but it is just one part of a larger method.

Summary: Application testing is an alert mechanism for the rate of change between the system and a set of defined actions performed upon it. It is but one part of the scientific method: most kinds of software testing, however, begin with expectations rather than predictions.

Testing in itself, is just an experiment, where it executes a trial to determine if a prediction will occur based on earlier observations. It is but just one part of the scientific method (additional Wikipedia article). In software engineering, a check, or assertion, is code that ensures if certain effects from an action meet given criteria. It is usually completed with a pass or fail state. Questions are answered in the format of,

Under X circumstances, does the set of Y actions cause Z outcome to occur?

The benefit of these answers define a time when our predictions are met successfully. If, at a later point, a change causes those checks to fail expectations, we are alerted that a change caused an unhandled event In other words, a test demonstrates the rate of change between when a system's observed behavior did or did not expectations under certain conditions, as compared to measuring any possible outcomes that can occur.

Making the conclusion that a failing test found an issue in the application is inaccurate. There could very well be a bug, but there could also be an issue with the test not accounting for new change. A failing test means that the path it expected was not followed. That is why viewing them as an alert mechanism to an unhandled change is more accurate.

Take for example comparing a trial involving a chemical reaction with software functional testing. The early phases of a chemical trial may involve recording outcomes without requiring explicit behaviors. What is desired would be defined later on, based on those observations. Building an app, however, may have requirements defined at the earliest phases of development. Many types of software testing need those checks to be in place early since they begin with a relation to an known behavior, like a product requirement.

This is all for asserting upon knowable items. There are other categories that are named as "testing," but their lifecycle begins much earlier. They initially pose questions that are open-ended and outcome-oriented. Performance testing measures the ability of a system to perform under duress so it can ask questions like, “can it process data in X time when it is stressed?” Chaos testing helps examine the stability of a system to ask, “Will the system remain available if X happens to it?” Frontend accessibility testing observes behaviors to predict if the application can be fully utilized in a non-standard format.

Each of these categories demonstrate that testing is not made for just the present. When a test is added, it is made to be repeatable. If the system is behaving today, how can we know that it will behave tomorrow?

Testing a feature is an approximation of a situation.

Summary: There are many ways to reach an outcome with some being closer to how an actual situation could be observed. Therefore, one must estimate efficiently how to cause and detect it.

For example, imagine checking a theme switcher button in a web app. Our test asks, "Did clicking the button once change the theme?" There are many ways to answer this depending on which actions we take and upon what we assert. Even at a functional level, we could try

clicking the button
performing a touch action for a mobile device
invoking the click listener directly
forcing the emit of the click event

Since the underlying click actions were originally tested within their source code, we will assume that a click is the easiest path. The other ways are also correct, however. What changes is how much we "wrap" around the thing we want to test -- the more “wrapping code” we remove, the less we rely on other functionality that also needs to be correct in order to observe the effects from the true "source" of our test.

It can be even complicated when writing assertions, because they map our predictions as an answer to our question. We could check for

a data ID on an element that denotes "light" or "dark" theme
the color of an element that is affected by theme
stubbing the click listener to test if the method is called

In feature testing, these assertions are much less direct because we attempt to map them as answers to a business requirement. A requirement might state that an element's color shall be light gray or dark gray based on the mode, and another might state that a Theme element is updated when a button is clicked. We would need to answer the question with a test that best represents the user interaction.

As scenarios become more complex and integrated in highly-connected environments, the paths towards asserting an outcome become much more variable. Therefore, an engineer should the most direct way that evaluates the scenario in the context of its environment. In this case, clicking the button and checking if the color changed is probably the easiest path -- but is it the most accurate and stable? Or does another path provide better insight?

A test is arbitrary unless sufficiently described.

Summary: The title of a test needs to succinctly describe its actions or else it loses value.

Test code, like application code, is essentially grouped functions that execute a static flow. It runs our entire trial from start to finish. Yet, unlike application code, these flows are situational. Remember from earlier that a test function can be used as an alert mechanism. It is imperative, then, that test code is described succinctly in terms of what is being checked.

Unlike application code, test code needs to define its situation, or scenario. Application code has established paradigms for which it can be separated into understandable and maintainable parts, meaning smaller functions, which helps to explain each individual action with high specificity. However, because test functions are a culmination of a chain of actions, it can be difficult to ascertain what it does. Its name explains its value.

Naming a test function as clickingTheSubmitButtonLogsInAnExistingUser rather than clickSubmitTest adds succinct definition. Extracting value from clickSubmitTest is unknowable: its actions, its outcome, or anything past the notion that it is a test — it communicates little as compared to the first option.

Luckily, there are some suggestions that help define test code. The Arrange-Act-Assert pattern explains how to arrange a test, and frameworks like JavaScript’s Jest or Mocha, Ruby’s RSpec, and Python’s Pytest allow for organizing tests into contexts. These contexts build a hierarchy of explaining each function at a human-readable level (assuming the tests are written to be, well, human-readable). BDD frameworks, like Gherkin, can describe both a scenario and its actions in human-readable format, and require a higher degree of using effective language.

The second part here deals with where tests are run. Everything deals with an environment is — for a test, it is the place that provides the conditions of its initial state — our “control” situation.

Higher degrees of manageability is directly related to greater assumptions of reality.

Summary: Testing closer to the source code involves assumptions of a situation that may not match what actually happens.

Let’s start by stating that with more externalities comes more possibilities to assume. That means that testing closer to the source leads to less variability and more reliable tests. The highest degree of control is over what you can change directly.

Moving up in environments, a test’s initial state becomes more variable. Each external item introduced is incorporated into the initial state for our test. These include other systems and applications, but also the amount of users allowed to access them. Each user also has their own unique circumstances that form the conditions surrounding their own environment. When testing, we assume parts of the initial state to be accurate without actually knowing that they are. Even unit testing on one system may match a second system, since they are not the same environment themselves.

Now, as external items are integrated, that degree of control is reduced and variability is heightened. Because we normally cannot directly control any external items outside of the application, we must make assumptions about their behaviors. User permissions, connectivity issues, bugs, and system availability limit our control over them. We may be able to indirectly influence them through injecting data and other flows that set them in a wanted state, but cannot force an outcome. If we can influence it into a wanted state, then we assume it works as expected, and thus, we regain some control over our test. This is not always be possible, though.

Finally, production is just another environment where our application exists. Because what matters is where a user interacts with it, a user’s environment is an extension from our understanding of it. That leads to a very crucial sub-point...

Sub-point: Production is not reality.

Summary: Reality is the situation in which an end user observed the behavior of the application. The only way to measure that situation is by approximating it. Production is just a name for a place that allows the application to be available to the general public, but is just a layer within the user's own environment.

“Production” is just the name of an environment where we’ve agreed to release the application to the greatest set of users: the public. They could be either human users or other systems whose actions alter reality. However, production is only an abstraction of reality, because only the environment in which a user interacts with the application will produce outcomes that affect others. Production makes the assumptions that still allow for expected system behavior like any other environment below it, yet, does not define the user’s environment. Our production application is influenced by everything within the user's environment.

A medical drug could be effective for 99 percent of people, but cause side effects in that last percent. A drug produced does not cause effects until it is taken by a person. Compared to a software production environment, having a full test set may catch those "side effects" that happen for our one-percent of users. We make assumptions that the most reliable parts of the system (meaning, the parts with the most trusted tests) should work for any user.

When a software bug comes in raised by a customer, then the only way to determine how it was caused is to approximate how that person observed its occurrence. We can only ever approximate a scenario within an environment because we did not observe it originally. Thus, we estimate as close as possible to what actually happened.

For a web app, this means we attempt to account for the conditions that created the initial state of a user's observation: their browser type, network connectivity, timings, physical location, device hardware, etc. But again, production is not necessarily the user’s environment. Production is an abstraction of one because it exists as only a single layer there. Thus, if a user encounters an issue, then that user experienced it in their own version of reality constructed with their own conditions, actions, timings, and assumptions.

We may not be able to encounter the bug even when executing those same actions, unless their initial state can be mimicked. If not, then we did not make accurate assumptions about their environment. Being unable to recreate an issue does not make it less real, but instead means we cannot define its situation.

Production is not the end result. It is where “public” users can experience it within the conditions of their environment.

Integrating external systems assumes a level of trust has been met.

Summary: Integrating a system involves trust that it acts as described by its maintainers. Testing is a commonly agreed method for which trust can be derived. Thus, utilizing a system has an implicit agreement where the engineer believes that a certain level of trust has been fulfilled.

If we involve an external system for which we do not directly test, then we agree that it works for our use cases. Each integration of our application to a higher environment means we need to account for a greater number of systems for which we cannot directly control. That also means we trust in the functionality of those systems because their expectations meet their descriptions.

There are many ways trust can be derived, with some ways being more transparent than others: the developers of the application could say it works without making their code public, a third party can use it and give it a stamp of a approval, or I could use the application directly and check for how I want it to work. However, each of these ways includes executing trials, recorded or not. If no tests or trials are performed, then how can one predict what will occur in reality? Since test code must use application code to make a prediction, wouldn’t it be an independent entity from the application?

If we agree that test code forms trust surrounding the functionality of the system, we also agree that the test code can form a basis of trust in itself. Its inherent purpose is to conduct trials and alert the results, so passing tests means that we expect its trials to be conducted properly.

Sub-point: Test code may be considered an independent entity from the application.

Summary: Test execution code is considered as trustworthy and accurate when it performs its duties to ensure stability of a system — thus, when effective, it could be considered an an external system for which trust is derived from it in itself.

Being able to see the results of other engineers' tests builds trust since we agree there is no need for additional testing of that system. But we also did not write the test code, and running the test may not explain how it was conducted. What if they are inaccurate?

Test code is implied to be correct and carries a degree of honesty with it. Application code itself does not validate its own functionality alone. To write tests that make correct predictions about that functionality means that the writers are trusted to be honest, since we may not see the actions taken in the test. The test code is a third party that has no bond with the application, and also exists as an external system itself because it both requires and affects the application.

Unreliable tests, unaccounted situations, and misleading test titles lessen that bond. Undefined functionality and scenarios are understood to be hard to conduct, but core functionality as described by its maintainers must be thoroughly tested for well-defined situations. Anything less can harm the integrity of the owners of the software. Writing tests to always pass, regardless of its trial, is deceptive behavior. But at that point, the external system is untrustworthy because it fails to explain the behavior of the application it supports.

I always like to ask the question, “who tests the tests?” It's hard to do -- the chain of tests would grow forever. The test code has to be checked itself to be correct. Unreliable functionality may not mean the test writer is dishonest, but it does mean that the application is difficult to trust. There may be real reasons why certain scenarios are missed — unable to be recreated, external systems misbehaving, or the application’s behavior not being well-defined enough to design a test trial. A test’s value is determined as a cost-relationship — how much value do its checks provide versus its complexity to maintain? It is subjective and hard to calculate.

This third and final part deals with the who and when of running tests. Testing is just one part of quality, and knowing where it fits is of upmost importance.

Everyone owns quality, but those who provide the product bear the ultimate responsibility.

Summary: While testers perform the fullest duties of quality, it is something everyone should advocate and own. The ones who provide the product to others take the fullest responsibility of communicating that the product works as intended.

It is the responsibility of everyone involved in the making of a product to ensure it works as described. Quality is owned by everyone. If quality must be restricted in order to release a product, it is not the fault of that department when faults are discovered. Those who provide the product to others must answer to any problems that arise.

This goes for letting improper tests exist as well. If there is a tester who doesn’t produce accurate or trustworthy tests, then that person will be responsible for their own actions. That person harms the product and the organization. However, letting those tests remain in place to sign off on behaviors is the responsibility of those who allow them to exist there. Tests that are missed or unable to be completed for the sake of meeting deadlines also are placed in the realm of those that allowed them to be removed.

Measuring the usefulness of quality in the present is unrealized, but constant discovery can bring future stability.

Summary: Tests need to have a specific focus, large test sets need to cover expected situations, and discoverability allows for constant attention to observing new behaviors. Alone, they are not enough, but together, they forge the path to ensuring high levels of application stability and sustainability. Testing does not complete -- is an ongoing effort.

Writing a test to pass for the present is great! We need to know it works. However, writing tests to make sure those checks are maintained upon future changes to an application is necessary. Each change to the base application brings in uncertainty, much like adding external sources. If test code is meant to be trustworthy, a test producing a consistent and reliable check creates that trust. A failing test means an unexpected change was detected, and that it.

Adding more tests means making extra predictions (or, business scenarios), but does not inherently creates reliable coverage. A combination of multiple test disciplines carries more trust than a permutation of a single type of test. There needs to be a focus placed — an intention defined — or we cannot understand what we evaluate.

However, a test’s cost is determined by how accurate it answers a prediction against its complexity to maintain.

Designing scenarios for the present that are executed in the future are great, but become static — they only check for what we know already. They can never account for new behaviors unless new tests are created. Thus, testing is not static — it is ongoing as a source of discovering new scenarios, new behaviors, new outcomes. Just because it works as intended for us does not mean it works as intended for others.

Choosing to skip testing for the present may mean deadlines are hit, but the effects of it will cost more over time. A bug found tomorrow is an issue not fixed today.

Quality is a combination of many facets, and testing is a major portion. Quality is always important as it does not end at the application’s release to “production.” It does not end with the user. It continues for as long as the application exists. Therefore, always continue to test as if its users require greater trust.

Conclusion

Testing is mainly used as simply checking for a pass/fail state, but it is just one part of a larger method.
- Summary: Application testing is an alert mechanism for the rate of change between the system and a set of defined actions performed upon it. It is just one part of the scientific method: most kinds of software testing, however, begin with expectations rather than predictions.
A test is arbitrary unless sufficiently described.
- Summary: The title of a test needs to succinctly describe its actions or else it loses value.
Higher degrees of manageability is directly related to greater assumptions of reality.
- Summary: Testing closer to the source code involves assumptions of a situation that may not match what actually happens.
- Sub-point: Production is not reality.
- Summary: Reality is the situation in which an end user observed the behavior of the application. The only way to measure that situation is by approximating it. Production is just a name for a place that allows the application to be available to the general public, but is just a layer within the user's own environment.
Integrating external systems assumes a level of trust has been met.
- Summary: Integrating a system involves trust that it acts as described by its maintainers. Testing is a commonly agreed method for which trust can be derived. Thus, utilizing a system has an implicit agreement where the engineer believes that a certain level of trust has been fulfilled.
- Sub-point: Test code may be considered an independent entity from the application.
- Summary: Test execution code is considered as trustworthy and accurate when it performs its duties to ensure stability of a system — thus, it could be considered an an external system for which trust is derived from it in itself.
Everyone owns quality, but those who provide the product bear the ultimate responsibility.
- Summary: While testers perform the fullest duties of quality, it is something everyone should advocate and own. The ones who provide the product to others take the fullest responsibility of communicating that the product works as intended.
Measuring the usefulness of quality in the present is unrealized, but constant discovery can bring future stability.
- Summary: Tests need to have a specific focus, large test sets need to cover expected situations, and discoverability allows for constant attention to observing new behaviors. Alone, they are not enough, but together, they forge the path to ensuring high levels of application stability and sustainability. Testing does not complete -- is an ongoing effort.

Misconceptions with testing paradigms and how to address them