Testing Legacy Software

Automated testing has become an essential part of software development. Good test support is a core feature of many modern application frameworks, and there is rarely a debate about whether tests should be written or not, but rather on how and which tests should be written.
However, there is also a reality of some existing systems having few - if any - automated tests in place. While this might not be a problem per se, a degrading understanding of a system can make it harder to confidently make necessary changes - up to a point where no one dares to touch anything anymore, in fear of not noticing that a change breaks existing functionality. Having a suite of automated tests in place can greatly reduce that risk and thereby enable the continuous evolution of the system.

To explore this in some more detail, three questions will be discussed in the following:

What are properties that automated tests should have in order to maximize the value they provide over the lifetime of a system?
How can systems be designed to facilitate the creation of useful tests?
What can be done if a system is not (yet) designed in a way which allows to accomodate such tests?

Most of the aspects reflected on below are described in-depth i.a. in the following books:

Working Effectively with Legacy Code, M. C. Feathers, 2005 - an often-recommended classic motivating why to create automated tests, and how to make it possible [Feathers, 2005]
Software Engineering at Google, T. Winters, T. Manshreck, and H. Wright, 2020 - devoting several chapters to the way tests are used in that huge software organization [Winters et al., 2020]
Clean Architecture, R. C. Martin, 2017 - containing general advice on how to structure (testable) systems [Martin, 2017]
Object-Oriented Reengineering Patterns, S. Demeyer, S. Ducasse, O. Nierstrasz, 2003 - a collection of approaches for restructuring existing systems, showing ways to improve the design, and describing strategies to use and grow test-suites [Demeyer et al., 2003]

1. Working with legacy code

Most relevant software lives for long periods of time and the continuous adaption to new requirements is a relevant challenge.

With growing size, age, and complexity of a system, it seems not to be unusual for even seemingly simple changes to take longer and longer to be implemented, and to carry increasing risks of breaking existing functionality. Of course, this holds especially true for systems which are developed just relying on “working-with-care” (sometimes also known as “professionalism”). When it comes to making further changes, [Feathers, 2005] describes developers to then “edit and pray” that the change does not break anything, building confidence upon their experience with the system and some exploratory manual testing.
Obviously this only gets riskier, once different people start to extend different parts of the system, and once experienced contributors leave the team while new developers join. Slowly, the architectural vision of the design starts to blur, and the understanding of how the systems works gets lost up to the point where no one really knows anymore what is going on. Needless to say, that just continues to increase the risk of breaking changes (ultimately, how could one do any regression testing when not even knowing at all how the system is supposed to behave?).

As an alternative way, Feathers describes an approach of covering existing functionality with automated tests, providing a “safety net”, which allows for controlled refactorings as well as the controlled addition of new features or fixes. This way, the chance to break existing functionality is reduced and all developers (experienced or not) are provided with confidence into the correctness of the software.

So how could those automated tests look like?

2. Properties of useful automated tests

[Winters et al., 2020] mention different dimensions to classify automated tests:

scope (describing the amount of validated code)
- narrow (e.g. class, or even single function)
- medium (e.g. multiple classes)
- large (e.g. system / end-to-end tests, verifying the interaction of sub-systems)
size (describing the amount of resources the test needs)
- small (the test and its dependencies all run inside a single thread)
- medium (everything runs locally, in different processes, network calls to localhost, or file-system access are allowed)
- large (calls to external systems are allowed)

Fast and deterministic

The execution time of a test is typically determined by its size, and the larger the test the more flaky it tends to be, since e.g. network calls to external systems may timeout (flakiness - the extent to which a test tends to fail sometimes, not being caused by any actually problematic code change).

To keep a test suite fast and deterministic, it is recommended to rely on a majority of small tests whenever possible (not leaving room for any flakiness, and even providing the option for simple parallel execution by multiple threads). Furthermore, [Winters et al., 2020] emphasize the fact that only fast, small tests are practical to be run as part of the normal development workflow, and experience shows that longer-running tests tend to be not executed. However, the importance of larger tests is also acknowledged in order to cover aspects that small tests cannot verify.

Robust and maintainable

When it comes to scope, there are different aspects to keep in mind:

On the one hand, narrow-scoped tests (e.g. testing a single implementation class) usually allow for a quick and precise analysis of the root-cause in the event of failure.
On the other hand, narrow-scoped tests tend to be brittle (they break on unrelated changes), since a simple redistribution of some logic between collaborating classes can already cause a lot of test failures.

[Winters et al., 2020] recommend to test business-relevant behaviour via the public API instead of directly depending on implementation details, to avoid frequently changing the tests (“Don’t depend on volatile things!” - [Martin, 2017] even recommends to use a dedicated testing API to shield tests from changing implementation details).
Pure refactorings should not break tests, and doing so may indicate an innapropriate level of abstraction (test behaviour - not methods/classes). The same holds true for changes that introduce new features or fix bugs, which also should not require an adjustment of existing tests. So with respect to scope, following these recommendations may typically result in testing at least several related classes together.

Another interesting aspect of scope is that it is defined by the amount of validated code, as opposed to executed. In particular, [Winters et al., 2020] argue that - if possible - a test should stick with the real implementations of the dependencies of the tested code, instead of replacing them with test doubles by default (preferring classical over mockist testing):

“Using real implementations can cause your test to fail if there is a bug in the real implementation. This is good! You want your tests to fail in such cases because it indicates that your code won’t work properly in production.”

This is especially true when the dependency itself is not properly tested on its own.

Obviously, dependencies on things running outside the test thread (e.g. external services, databases) must be replaced by test doubles in order to keep the test small. Here, [Winters et al., 2020] prefer the usage of lightweight fake implementations over mocks to be able to test state instead of interactions.

With respect to scope, [Feathers, 2005] also mentions the importance of narrow-scoped tests, i.a. because of the simplicity with which failure causes can be located.
While I certainly also like this property, I also fear that it tends to motivate me to slavishly create tests for each and every single class, leading to the brittleness problems described above. More often than not it should be easy to spot the root cause of a problem even among a couple of collaborators. Nevertheless, this does of course not mean that I would oppose occasionally making some pure function package-private to be able to quickly test some nasty regex in isolation.

3. How to make code untestable

While it may sound just great to have a blazingly fast, deterministic, robust, and maintainable test suite in place, it really needs to be kept in mind that this must not be an afterthought during development. If not carefully taken into account, it is surprisingly easy to end up with code which makes it incredibly hard to add any useful (fast, small) automated tests ex-post (an endavour hard to describe other than painful).

Consider the following Java sample:


@Service
class CalculationServiceImpl implements CalculationService {

  /**
   * @return true, if successful
   */
  @Override
  public boolean calculate(int input) {
    Result result = new FirstCalculator().calculateFirstPart(input); // 1.
    SecondProcessor.calculateSecondPart(result, input); // 2.
    SessionContext.store("myResult1", result); // 3.
    boolean isSuccessful = result.getValue32() == 13; // 4.
    if (isSuccessful) {
      ThirdProcessor.calculateThirdPart(); // 5.
      NotificationService.sendKafkaMessageToCalculationsTopic(); // 6.
    }
    return isSuccessful;
  }
}

class FirstCalculator {
  Result calculateFirstPart(int input) {
    int baseValue = new MariaDbDatabaseAccess().getBaseValue(); // 1.1
    // [...] some calculation logic
    return result;
  }
}

class SecondProcessor {
  static void calculateSecondPart(Result result, int input) {
    int extraInfo = CalculationHelperWebService.getExtraInfo(); // 2.1
    // [...] some calculation logic
  }
}

class ThirdProcessor {
  static void calculateThirdPart() {
    Result result = (Result) (SessionContext.get("myResult1")); // 5.1
    // [...] some calculation logic
  }
}

  // [...] various other classes

Imagine that we’d want to write a test for the calculate functionality of the CalculationService (public API).
We notice that the actual calculation logic seems to be distributed over at least three different collaborator classes, but the distribution seems rather arbitrarily. In order to avoid creating a brittle test which is broken by the first upcoming refactoring, we avoid the temptation to test any of the three calculation parts in isolation.

new FirstCalculator().calculateFirstPart(input)
First, there is a call to a collaborating class (FirstCalculator), which in turn collects some additional information from a database (1.1):
int baseValue = new MariaDbDatabaseAccess().getBaseValue();
This may already become a first problem, since there is no simple way to replace the MariaDB with a lightweight fake implementation, so that we would need to either run the complete test against a real MariaDB (making the test larger), or resort to advanced mocking library features (which are hopefully available).
SecondProcessor.calculateSecondPart(result, input);
Then the collaborator SecondProcessor is called, which first fetches some needed information from a remote service via network before running its calculations (2.1):
int extraInfo = CalculationHelperWebService.getExtraInfo();
Again, replacing this static call may require advanced magic to be available (again also including possibly surprising side-effects in case the static mock is not properly cleaned up at the end of the test).
SessionContext.store("myResult1", result);
This shows another possibly hard-to-replace static call that depends on some framework-provided (thread-local?) session state to be setup, so that later parts of the calculation logic can then retrieve that state (e.g. 5.1 and 6). Relying on side-effects like this is just a great option to create a sufficiently confusing data flow which does not ease creating relevant test cases in general.
Why would one ever resort to using some sort of SessionContext at all here? Well, at least it allowed us to share information between different places without having been required to refactor much existing logic which would have been risky to touch..
boolean isSuccessful = result.getValue32() == 13;
At 4. the orchestrating logic of the calculate method is suspended by some core business logic, breaking with the abstraction level of the method. Apart from making it harder to understand the now even more widespread calculation logic, this does not really hinder creating the test. It merely serves as an example of needed refactoring and is probably again a symptom of missing tests, since it was probably too risky to put it elsewhere in the first place. This also illustrates the importance of testing via stable interfaces (public API), since implementation details such as the distribution of logic among collaborating classes may be likely to change - especially in systems where missing tests prevented adding new features at the right places.

Finally, 5. just represents some dependency on the SessionContext, and 6. another hard-coded dependency on external systems. Obviously the overall sample is kept short and the length of realistic methods won’t contribute a lot to make writing tests any easier.

Summing up some general problems which may add up over time and result in hard-to-test code:

hard-wired dependencies to external systems (hard to fake/mock)
dependencies hidden deep in the core business logic
behavior relying on side-effects and arcane features of the used framework
mangling different aspects and abstraction levels, just to avoid changing code at other places

4. Design for testability

As shown in the previous part, testability should be a key aspect during system design/implementation. So how can code be structured to allow for useful tests?

Providing Seams

The Seam is a central concept of [Feathers, 2005], described as “a place to alter behavior without editing that place”. In particular, Object Seams are recommended, i.e. providing places to allow replacing problematic dependencies with subtypes.
Dependency injection (DI) is a central functionality of many popular frameworks such as Spring or Quarkus (just to give some Java examples), and both DI containers allow to replace existing bindings of (problematic) implementation classes with mocks or own fake implementations. However, when using constructor injection it is also simply possible to construct instances of the classes under test by hand, without relying on any DI framework functionality. Of course, this manual construction has the disadvantage of having to manually create the complete dependency graph, but may speed up test execution significantly.

For the problematic FirstCalculator above, a better testable version could look like this:

class FirstCalculator {

  private final MariaDbDatabaseAccess dbAccess;

  // constructor allows to provide fake/mock dependencies
  FirstCalculator(MariaDbDatabaseAccess dbAccess) {
    this.dbAccess = dbAccess;
  }

  Result calculateFirstPart(int input) {
    int baseValue = dbAccess.getBaseValue();
    // [...] some calculation logic
    return result;
  }
}

This way, a fake-implementation of MariaDbDatabaseAccess could be provided (subclassing it and overriding problematic behavior).

Additionally, the CalculationServiceImpl would also need to offer a seam by providing a constructor allowing to inject the FirstCalculator instance with the faked database access dependency:

@Service
class CalculationServiceImpl implements CalculationService {

  private final FirstCalculator firstCalculator;

  // constructor allows to provide test-specific dependencies
  CalculationServiceImpl(FirstCalculator firstCalculator) {
    this.firstCalculator = firstCalculator;
  }

  @Override
  public boolean calculate(int input) {
    Result result = firstCalculator.calculateFirstPart(input);
    // [...]
  }
}

(Luckily, lombok may save us at least some of this constructor boilerplate.)

Clean architecture

[Martin, 2017] describes clean architecture, a general approach to structure a system so that central business logic is properly separated from external dependencies. The main idea is also at the core of similar concepts like hexagonal architecture or onion architecture.
In particular, a Dependency Rule is formulated which states that:

“Source code dependencies must point only inward, toward higher-level policies.”

When core business logic needs to invoke functionality from outer layers, this dependency must be inverted (Dependency Inversion Principle), so that the source-code dependency still only points inward, opposing the flow of control.

Consider the sample above:

outward-dependency

A cleaner version of the code sample shown above could be realized as follows, replacing the low-level MariaDB dependency of the core logic with the abstract interface BaseValueProvider:

inverted-dependency

// core business logic
class FirstCalculator {

  private final BaseValueProvider baseValueProvider; // abstract dependency

  // constructor allows to provide fake/mock dependencies
  FirstCalculator(BaseValueProvider baseValueProvider) {
    this.baseValueProvider = baseValueProvider;
  }

  Result calculateFirstPart(int input) {
    // core logic directly invokes functionality from outer layers,
    // but has no source-code dependency
    int baseValue = baseValueProvider.getBaseValue();
    // [...] some calculation logic
    return result;
  }

  interface BaseValueProvider { // part of the core business logic
    int getBaseValue(); // abstract functionality needed by the core logic
  }
}

// ---

// implementation detail, outside the core logic
class MariaDbDatabaseAccess implements BaseValueProvider {
  @Override
  public int getBaseValue() {
    // [...] actual MariaDB access logic
  }
}

As a result, central business logic can be kept independent from external influences and low-level details, be it frameworks, UI, or databases. This does not only make it simpler to replace a specific database or UI technology when a popular new one emerges, but also makes it simple to provide seams which allow to swap out problematic dependencies with mocks or lightweight fake implementations. Consequently, the creation of small, useful automated tests is facilitated.

When implementing clean architectures it may also be helpful to enforce the dependency rule e.g. in Java with help of ArchUnit tests.

Sidenote: on the over-usage of Java interfaces

Another, somehow related and sometimes observed problem in Java is a general tendency towards the over-usage of interfaces. Consider the sample introduced above:

@Service
class CalculationServiceImpl implements CalculationService {
  @Override
  public boolean calculate(int input) {
    // [...]
  }
}

Imagine the calculate method being called exclusively by some CalculationRestController upon an HTTP-request initiated by user interaction. In that case, there would not be any problem having a direct dependency from the CalculationRestController (low-level detail) towards the CalculationServiceImpl (core business logic). Infact, we might just be able reduce some clutter by removing the useless CalculationService interface as well as the annoying Impl postfix. In many cases using interfaces is not required and should be a conscious decision rather than the default.

5. Reengineering for testability

Given some system that was not designed with clean architecture and useful automated tests in mind, how can we still get there? [Demeyer et al., 2003] give a number of useful recommendations on reengineering, i.e. on how to restructure systems in an improved form.

One chapter touches on the question of which parts of a system to prioritize. Understandibly, reengineering efforts should not focus on stable, flawlessly working parts, but rather on the faulty ones, which require change and suffer the worst from reliance on outdated technologies, developer fluctuation, insufficient documentation, duplicated code, or tangled structure.

Starting with the most problematic parts, the core business functionalities as well as dependencies and auxiliary functions need to be analyzed to identify a cleaner, more testable target design as well as the corresponding target scope of useful automated tests.

In order to safely make the necessary code changes (e.g. introducing seams to break and invert dependencies), [Demeyer et al., 2003] recommend to incrementally introduce tests for the parts of the system which are changed.

However, isn’t there the chicken-and-egg problem of already requiring tests to safely do changes which only enable the creation of tests?

Yes, that’s what [Feathers, 2005] calls the legacy code dilemma. To alleviate this problem, a two-step approach can be taken:

start with larger-sized tests which allow to keep as many dependencies in place as possible, minimizing the amount of necessary code changes to create the tests
refactor the covered code so that creating small tests of the core business logic becomes feasible

First, this may e.g. involve running tests against an existing database or other external services. Even though running the larger tests may take time and will be subject to flakiness, it will still provide the necessary safety net to incrementally move towards a cleaner design with faster tests.

[Demeyer et al., 2003] advise to start with black-box tests of big abstractions, focusing on business values, instead of individual sub-components. In particular, one recommendation is to record business rules as tests, aiming to represent core functionality by a set of canonical examples with well-defined actions and clear, observable results. Since covering all rules may not be feasible (depending on their number and the runtime of the larger tests), it is suggested to start with essential cases. The 80/20 rule may apply here as well, maybe 80% of production cases only exercise 20% of the business logic?.

Having the larger-sized tests in place, the necessary refactorings can be done to break the problematic dependencies and introduce a cleaner architecture. Subsequently, the implemented test scenarios of the larger-sized tests can be nicely reused to create fast-running small tests, swapping out problematic dependencies with leightweight fake implementations. Furthermore, more small tests can be added to further increase the amount of covered business rules.

While the small tests can be run quickly and often as part of the local development workflow, the larger-sized tests should still be run on a regular basis in order to verify the functionality against the actual dependencies. (Drawing from own experience, errors seem to come as often from self-developed business logic as from unexpected behavior of dependencies - be it caused by actual bugs or just by unclear documentation.)

Summing up

To be fast and not flaky, automated tests must be small in size (running inside a single thread).
To be maintainable and not brittle, automated tests should test through stable interfaces (public API), focusing on business requirements instead of being too narrow-scoped. This leaves room to freely refactor the internal implementation.
Being able to build useful automated tests requires conscious management of source code dependencies. This needs to be kept in mind when designing and implementing a system.
Adding useful automated tests to a grown system ex-post can be a laborious - yet worthwile - endavour, which may benefit from first creating larger-sized tests as an intermediate step towards fast and deterministic smaller tests.

small-sized_larger-sized_tests

1. Working with legacy code#

2. Properties of useful automated tests#

Fast and deterministic#

Robust and maintainable#

3. How to make code untestable#

4. Design for testability#

Providing Seams#

Clean architecture#

Sidenote: on the over-usage of Java interfaces#

5. Reengineering for testability#

Summing up#