Case Study: Facing Growing Complexity of a Software Project by overcoming flaky tests

Executive summary

This article started as a case study on flaky tests and CI automation, but the investigation quickly revealed a deeper problem.

Flaky tests are rarely the root cause. They are often a symptom of growing software complexity. As systems evolve, feedback loops that engineers rely on become less trustworthy. Tests fail when nothing is broken, pipelines become noisy and teams gradually lose confidence in the signals used to validate changes.

Once trust in feedback is lost, the cost of every change starts increasing. Engineers spend more time investigating failures, planning becomes less predictable and maintenance consumes an ever larger portion of development capacity.

This publication examines flaky tests through the broader perspective of feedback loops, software complexity and long-term maintainability. It also describes practical measures implemented in a real project and discusses the trade-offs between testing costs, architectural quality and delivery predictability.

The central argument is simple: flaky tests are not merely a testing problem. They are an early warning signal that system complexity is growing faster than the engineering practices used to control it.

Context

Feedback loop as a basis for learning

Feedback loop is a basis of our learning process. We try a dish and if we don't like it, we try to avoid it in the future, but if we like it, we try cook it more often. Feedback here is a taste of the dish we just have cooked a five minutes ago, and based on this feedback we make a conscious decision to cook it again or not in the future.

Feedback loop in software development

In software development we receive such feedback in several ways:

Compiler in IDE executes our code without errors
Running the code further doesn't show any errors
Tests are passing

Others less obvious metrics could be added here as well: we get paid for our work, we get a moral satisfaction of the problem we have solved, we get promoted to a new rank, customer gives as a positive feedback and so on.

It is a bit offtopic for this publication, but I worth to mentioned a fundamental difference between positive feedback for developer and positive feedback for a business owner.

When engineers are usually happy with things above, the customer main drivers are end users are happy with the product and stakeholders return enough on their investments.

When this two different parts of stimulus are aligned with each other, customer and engineer's team goes in the same direction. But by default there is a gap between them which should be overcame. That's one of my key points I trailblaze for more than a decade. (here and there))

Feedback loop built into the process

Engineers goes even further an integrate such feedback loop into the development process itself -- continuous integration, continuous delivery, continuous deployment.

Each commit into the codebase means new changes, therefore opens a possibility to break something. CI solves this issue by automatically compiling the code and execute any other tasks engineer decides to add on.

Testing and its automation as a feedback loop amplifier

It was mentioned a bit earlier, but it is important to stress this point again. When compiling program successfully gives us signal that from compiler perspective everything is fine, the program itself could be far from working product. Tests here and especially automated tests strengthen this feedback even further.

Issue #1: tests stops serving as a reliable indicator

When we implemented tests which covers all key user scenarios, passing all tests means our product do all user actions correctly -- product does its job. In this case looking on the tests passed is enough to move forward.

But with the growth of the project, we start to face with situations when tests are flaky -- they are failed, but the codebase is fine. They require more time and often change in architecture (there is a view the flaky tests caused by issues in the software architecture)), but the point is you stop rely on this as a source of understanding you product is doing well. You are actually not sure anymore does your commit break something or not.

By the time production incidents increase, delivery slows down and maintenance costs become visible, architectural complexity has often been growing for years. Flaky tests frequently appear much earlier and may therefore serve as one of the first warning signals that engineering practices are no longer keeping pace with system growth.

Issue #2: maintaining tests takes more and more time

Another issue comes from the fact tests are also software, they also requires engineers, engineering time to implement them and maintain them. They also could have bugs, they also could become absolete and require refactoring. With the growth of the main codebase amount of tests grows too, therefore their costs of development and maintenance.

When the issue #1 appears, together with there costs of issue #2, we come to the question -- is it worth to invest in tests at all? Do we need them?

Solution

The answer is yes, we still need them. Software needs to be verified it works as expected and all this verification or at least a part of it should be automated, because automated verification makes it cheaper. Tests have been serving us well till this point, current issues means we need to invent something else, but first to improve situation with the current circumstances.

My investigation with help of ChatGPT shown that software engineers worldwide face with the same issue. Tests works till some point, but after project complexity threshold they start to fail and require more and more time to maintain. No matter how talented engineer is or how brilliant software architecture is, the complexity of the growing project causes this issue. Running 100% tests coverage is very expensive, my personal proposal end-to-end integration tests for all user scenarios plus unit test per each found bug stop working either. If they face with same issues, which options to solve they have tried and which on they end with?

Discovery of the best practices of the market leaders shown several interesting ideas, but let me describe aggregated solution which I found reasonable for my case. Each company and project is different, that's why plain direct transition of practices and recipes often is bad idea. The working solution is always a creative process based on your experience with limitations that exists at this moment and in the nearest future. And I prefer do it in the LEAN way and experimenting on small part\project before scaling it further.

Rerunning tests

The first and probably the simplest mitigation strategy is rerunning failed tests automatically. If a test occasionally fails because of timing issues, unstable external dependencies, network latency or other non-deterministic factors, executing it one more time often produces a successful result.

This approach does not solve the root cause of the problem and therefore should not be considered a final solution. However, it significantly reduces the noise level in the feedback loop and allows engineers to focus on real failures instead of investigating false alarms every day.

In our case we implemented a separate CI stage which automatically reruns failed tests several times and collects additional metrics. This allows us to distinguish deterministic failures from flaky ones and build statistics around test stability. As a result engineers spend less time restarting pipelines manually and receive a cleaner signal from the testing system.

Quarantining tests as a separate stage

The second idea comes from practices used by many large software companies. Once a test is identified as flaky, it should stop blocking delivery of new functionality. Otherwise the entire engineering process becomes hostage to unreliable automation.

This does not mean the test should be removed. Instead it is moved into a dedicated quarantine stage. Such tests are still executed, their results are still collected and monitored, but they no longer determine whether the entire pipeline succeeds or fails.

This approach creates a clear separation between two different goals: validating business functionality and improving test quality. Product development can continue while engineering teams gradually reduce the technical debt accumulated in the testing infrastructure.

In practice this solution turned out to be surprisingly effective because it restores trust in the main CI pipeline. If the main pipeline is green, engineers can reasonably assume that the product is healthy. Failures in the quarantine stage become a separate stream of work instead of random interruptions to daily development.

Investing in building a more testable architecture

Eventually every team discovers that flaky tests are often symptoms rather than the disease itself. The deeper cause frequently lies in software architecture.

Systems with hidden dependencies, shared mutable state, asynchronous race conditions and tightly coupled components are difficult to test reliably. The testing framework simply exposes problems that already exist inside the design of the system.

Therefore part of the solution is investing in architecture that is easier to verify. Clear interfaces, dependency injection, separation of concerns, deterministic behaviour and well-defined boundaries between components reduce both production defects and testing complexity.

Such investments rarely provide immediate visible business value, which is why they are often postponed. However, over time they reduce maintenance costs, increase delivery speed and improve confidence in every change made to the codebase.

Moving to TDD as a prerequisite for testable architecture by default

Looking further into the future, I came to the conclusion that testability should not be added after the architecture is already built. It should be one of the forces shaping the architecture from the very beginning.

It was said earlier many issues with tests came from issues with software architecture -- architecture wasn't designed with testability in mind. Many teams has a separation on engineers who write main code and software engineers who write automated tests. It is considered cost effective, but this topic shown a fundamental issue with it -- you can not designed a proper architecture until you develop test coverage for the product by yourself. This cause all cascade of issues described in the publication.

This is one of the strongest arguments in favour of Test Driven Development. Regardless of discussions around productivity, TDD forces engineers to think about interfaces, boundaries and behavior before implementation details appear.

When a component is difficult to test, it is often a sign that its design can be improved. TDD turns testability into a design constraint and therefore naturally encourages creation of loosely coupled and more maintainable systems.

I would not claim that TDD eliminates flaky tests completely. Complexity still grows, external integrations still exist and large systems remain challenging. However, TDD shifts the architecture in a direction where verification becomes cheaper and reliability remains manageable for a much longer period of time.

TDD as a common engineering practice might be a topic for holy war. I keep my proposal here loosely, but they key point is testability must influence architecture and that's where I stay firm.

Post scriptum: complexity of projects, exponentially increasing cost of fixing bugs vs linearly increasing cost of tests, trade-offs

There is another aspect of this discussion which is worth mentioning. Whenever engineers talk about automated testing, the conversation often becomes polarized: either tests are considered mandatory for everything, or they are treated as unnecessary overhead slowing down delivery.

My experience suggests that reality is much more nuanced. Both approaches have costs, and software engineering is largely a discipline of managing trade-offs between them.

Software developed without sufficient automated verification creates a high level of uncertainty. Engineers spend less time writing and maintaining tests today, but they pay for this decision later through debugging, production incidents, regression defects and unpredictable delivery dates.

The most problematic consequence is not even the bugs themselves. It is the loss of predictability. When the team cannot reliably estimate the effort required to validate changes, planning becomes increasingly difficult. Delivery dates become less trustworthy, maintenance activities consume a growing portion of engineering capacity and every modification carries additional risk.

Automated tests are not a silver bullet either. Tests are software themselves. They require implementation, maintenance, refactoring and occasionally complete redesign. As demonstrated by flaky tests, testing infrastructure can become a source of complexity on its own.

However, while tests introduce an additional maintenance cost, they also create a more predictable development environment. Teams receive faster feedback, defects are discovered earlier and the cost of investigating failures becomes more manageable. In practice this additional investment often pays for itself through reduced uncertainty.

The deeper issue behind many struggling software projects is the hidden growth of development costs over time. As systems become larger, engineers gradually spend less time delivering new functionality and more time understanding existing code, fixing regressions, maintaining integrations, resolving incidents and preserving compatibility with previous decisions.

This phenomenon has been observed repeatedly across the industry. Studies and industry reports consistently show that technical debt, architectural complexity and insufficient quality controls reduce delivery performance and increase operational costs. In many projects the transition happens so gradually that it remains unnoticed until maintenance consumes the majority of engineering effort.

From this perspective, flaky tests are not merely a testing problem. They are an early warning signal indicating that system complexity is growing faster than the engineering practices used to control it.

The goal is therefore not to maximize or minimize testing. The goal is to maintain a sustainable balance where the cost of verification remains lower than the uncertainty it removes. Different teams, products and business environments will find this balance at different points.

In the end, software development remains a process of learning. Reliable feedback allows teams to learn faster, make better decisions and maintain confidence while systems continue to grow. Automated testing, despite all its imperfections, remains one of the most effective mechanisms currently available to support that learning process.