[Textbook] Tests

Joel Lim
9 min readOct 10, 2021

So, going back to pyramids versus honeycombs, when I read advocates of honeycomb and similar shapes, I usually hear them criticize the excessive use of mocks and talk about the various problems that leads to. From this I infer that their definition of “unit test” is specifically what I would call a solitary unit test. Similarly their notion of integration test sounds very much like what I would call a sociable unit test. This makes the pyramid versus honeycomb discussion moot, since any descriptions I’ve heard of the test pyramid consider unit tests to be sociable and/or solitary.

This semantic picture is made even muddier by the definition of Integration Test, which makes “unit test” look tightly defined. The take-away here is when anyone starts talking about various testing categories, dig deeper on what they mean by their words, as they probably don’t use them the same way as the last person you read did.

A more important distinction is whether the unit you’re testing should be sociable or solitary [2]. Imagine you’re testing an order class’s price method. The price method needs to invoke some functions on the product and customer classes. If you like your unit tests to be solitary, you don’t want to use the real product or customer classes here, because a fault in the customer class would cause the order class’s tests to fail. Instead you use TestDoubles for the collaborators.

But not all unit testers use solitary unit tests. Indeed when xunit testing began in the 90’s we made no attempt to go solitary unless communicating with the collaborators was awkward (such as a remote credit card verification system). We didn’t find it difficult to track down the actual fault, even if it caused neighboring tests to fail. So we felt allowing our tests to be sociable didn’t lead to problems in practice.

The test pyramid comes up a lot in Agile testing circles and while its core message is sound, there is much more to say about building a well-balanced test portfolio. A common problem is that teams conflate the concepts of end-to-end tests, UI tests, and customer facing tests. These are all orthogonal characteristics. For example a rich javascript UI should have most of its UI behavior tested with javascript unit tests using something like Jasmine. A complex set of business rules could have tests captured in a customer-facing form, but run just on the relevant module much as unit tests are.

I always argue that high-level tests are there as a second line of test defense. If you get a failure in a high level test, not just do you have a bug in your functional code, you also have a missing or incorrect unit test. Thus I advise that before fixing a bug exposed by a high level test, you should replicate the bug with a unit test. Then the unit test ensures the bug stays dead.

You have self-testing code when you can run a series of automated tests against the code base and be confident that, should the tests pass, your code is free of any substantial defects. One way I think of it is that as well as building your software system, you simultaneously build a bug detector that’s able to detect any faults inside the system. Should anyone in the team accidentally introduce a bug, the detector goes off. By running the test suite frequently, at least several times a day, you’re able to detect such bugs soon after they are introduced, so you can just look in the recent changes, which makes it much easier to find them. No programming episode is complete without working code and the tests to keep it working. Our attitude is to assume that any non-trivial code without tests is broken.

Self-testing code is a key part of Continuous Integration, indeed I say that you aren’t really doing continuous integration unless you have self-testing code. As a pillar of Continuous Integration, it is also a necessary part of Continuous Delivery.

If you make a certain level of coverage a target, people will try to attain it. The trouble is that high coverage numbers are too easy to reach with low quality testing. At the most absurd level you have AssertionFreeTesting. But even without that you get lots of tests looking for things that rarely go wrong distracting you from testing the things that really matter.

Like most aspects of programming, testing requires thoughtfulness. TDD is a very useful, but certainly not sufficient, tool to help you get good tests. If you are testing thoughtfully and well, I would expect a coverage percentage in the upper 80s or 90s. I would be suspicious of anything like 100% — it would smell of someone writing tests to make the coverage numbers happy, but not thinking about what they are doing.

The reason, of course, why people focus on coverage numbers is because they want to know if they are testing enough. Certainly low coverage numbers, say below half, are a sign of trouble. But high numbers don’t necessarily mean much, and lead to ignorance-promoting dashboards. Sufficiency of testing is much more complicated attribute than coverage can answer. I would say you are doing enough testing if the following is true:

You rarely get bugs that escape into production, and

You are rarely hesitant to change some code for fear it will cause production bugs.

Can you test too much? Sure you can. You are testing too much if you can remove tests while still having enough. But this is a difficult thing to sense. One sign you are testing too much is if your tests are slowing you down. If it seems like a simple change to code causes excessively long changes to tests, that’s a sign that there’s a problem with the tests. This may not be so much that you are testing too many things, but that you have duplication in your tests.

Since tests don’t have tests, it should be easy for humans to manually inspect them for correctness, even at the expense of greater code duplication. This means that the DRY principle often isn’t a good fit for unit tests, even though it is a best practice for production code. In tests we can use the DAMP principle (“Descriptive and Meaningful Phrases”), which emphasizes readability over uniqueness. Applying this principle can introduce code redundancy (e.g., by repeating similar code), but it makes tests more obviously correct.

Note that the DRY principle is still relevant in tests; for example, using a helper function for creating value objects can increase clarity by removing redundant details from the test body. Ideally, test code should be both readable and unique, but sometimes there’s a trade-off. When writing unit tests and faced with a choice between the DRY and DAMP principles, lean more heavily toward DAMP.

Tests create a feedback loop that informs the developer whether the product is working or not. The ideal feedback loop has several properties:

It’s fast. No developer wants to wait hours or days to find out if their change works. Sometimes the change does not work — nobody is perfect — and the feedback loop needs to run multiple times. A faster feedback loop leads to faster fixes. If the loop is fast enough, developers may even run tests before checking in a change.

It’s reliable. No developer wants to spend hours debugging a test, only to find out it was a flaky test. Flaky tests reduce the developer’s trust in the test, and as a result flaky tests are often ignored, even when they find real product issues.

It isolates failures. To fix a bug, developers need to find the specific lines of code causing the bug. When a product contains millions of lines of codes, and the bug could be anywhere, it’s like trying to find a needle in a haystack.

If you want to reduce your test mass, the number one thing you should do is look at the tests that have never failed in a year and consider throwing them away. They are producing no information for you — or at least very little information. The value of the information they produce may not be worth the expense of maintaining and running the tests. This is the first set of tests to throw away — whether they are unit tests, integration tests, or system tests.

Another client of mine also had too many unit tests. I pointed out to them that this would decrease their velocity, because every change to a function should require a coordinated change to the test. They informed me that they had written their tests in such a way that they didn’t have to change the tests when the
functionality changed. That of course means that the tests weren’t testing the functionality, so whatever they were testing was of little value.

Don’t underestimate the intelligence of your people, but don’t underestimate the collective stupidity of many people working together in a complex domain. You probably think you would never do what the team above did, but I am always finding more and more things like this that almost defy belief. It’s likely that you have some of these skeletons in you closet. Hunt them out, have a good laugh at yourself, fix them, and move on. If you have tests like this, that’s the second set of tests to throw away.

The third tests to throw away the tautological ones. I see more of these than you can imagine — particularly in shops following what they call test-driven development. (Testing for this being non-null on entry to a method is, by the way, not a tautological test — and can be very informative. However, as with most unit
tests, it’s better to make this an assertion than to pepper your test
framework with such checks. More on that below.)

Running Databases for Unit testing used to suck

In the past, setting up a database to use with unit testing was a hassle. It meant running a database service on either a physical or virtual server. Once you get a server, the following install process is generally long itself.

In big companies, this could mean working with many teams, which elongates the process. This alone is a common deterrent for using a real database within unit tests.

Even after going through the process, it is often cost or time prohibitive to get more than one sever. This makes it even more difficult to use within a CI pipeline.

Most applications have many developers. Trying to use a single database service for every developer doesn’t work. It’s very difficult to coordinate more than one build at a time.

With the speed of development these days, running a pipeline one build at a time is a major hinderance. It’s also a waste of time.

What I’ve seen people do in response is only run the tests on the main branch build. Where pull requests or local unit test runs don't run these tests. This means less testing is performed on pull requests and more testing is done after merge. This in itself is fundamentally flawed.

The point of executing builds on pull requests is to ensure that new changes work with existing code. Once merged everything should just work. If pull requests are lacking in tests, then there is a higher likelihood of breaking the main build.

Which means rolling back changes, this is difficult on high velocity repositories. With many pull requests being submitted and merged in the same day. It can be very difficult finding which change broke the build.

I think you’re missing the point when you refer to Fowler’s testing pyramid. Martin Fowler sees a unit as a unit of behaviour, not a unit of code. This means what Kent Dodds refers to as an “integration test” is basically the same as what Martin Fowler sees as a “unit test”. Martin Fowler writes unit tests that don’t mock out all collaborating code, and that’s the key to it all. The problem is that so many people consider a unit to be a unit of code, and that’s where you end up with tests that are tightly coupled to implementation details and start down the road of diminishing returns.

--

--