<aside> 💡 Hola friends! We recently created a Discord for Software Automation where we discuss topics regarding software automation.

Responses range from people with 3 to 30+ YOE in the industry and wanted to share this with the community!

</aside>

What is a Flaky Test?

A flaky test refers to testing that generates inconsistent results, failing or passing unpredictably, without any modifications to the code under testing. Unlike reliable tests, which yield the same results consistently, flaky tests create uncertainty, posing challenges for software development teams.

<aside> 💡 Below are the gathered discussions from our Discord server, enjoy!

</aside>

How do you Address and Prevent Flaky tests?

mike-lahar shares:

Test flakiness is inevitable, especially with UI tests. Here are a few things that we use to try to mitigate the issue:

Created a wrapper around our test scripts (i.e. phpunit) that rerun failed tests up to 3 times. If the test fails 3 times, we are pretty certain it’s a real failure.

Very basic, but keep metrics! You can’t fix what you don’t know. Most testing libraries have the ability to output summaries. Save those results to a DB. Create some dashboard using DataDog or Grafana or any other data observation tooling. Trigger alerts if a test is failing above a certain threshold.

Stress-test new or modified tests as a preventative measure. We created a GitHub Action that we use for our test suites. It checks the modified files and will automatically stress-tests new or modified tests (i.e. it runs the test 100 or 1000 times). If those tests don’t pass above a certain threshold, we fail the CI check and the author has to fix them.

/u/Xeek shares:

Poll poll poll. Every time I think FindElement (implicit wait) is going to work; later I am proven wrong and I have to put a Wait in.

Avoid all of those sleeps, snoozes, etc. they're unsustainable, poll for elements to exist, be visible, etc. instead.

Locators; we need to be better at creating resilient locators and it's hard to trust developers will add or maintain IDs on elements.

I almost never check for exact text, if it's "Save User" and I'm sure this page will have no other save button on the page then I'll just find the button containing "Save" .. at least until a developer gets clever an renames it to "Update", they're sneaky like that.

Create well though-out but relaxed CSS Selectors or XPaths, quite often this means learning how to construct them, generated locators are flaky and break quickly if the page is generated output from something like react or angular.

Stefan Ayala (Basecase) shares:

Flaky tests are a tricky beast because while they most commonly occur in the testing layer of the stack (surface level), it can also be indicative of other parts of the stack such as the application or infrastructure layer that hosts the application.

It’s really important to dig into every single flaky test failure to at least figure out where the problem lies. This sometimes becomes a team effort the more you rule out common flaky issues. Depending on your team’s test culture/discipline, success against flaky tests will vary 🙂.

It’s rare but it has happened enough times that flaky tests were a result of a rare edge case bug that later showed up on production.

It’s also really important to dig into the problem as soon as it occurs because reproducing it later could prove difficult if the problem ends up something deep and complex (like an infrastructure related issue that rears its symptoms once or twice a day in the form of a test failure).

In reality the most common flakiness is due to poorly written tests. Common examples I see is forgetting to use polling mechanisms, not writing a test to be parallel friendly, or having tests rely on another. These problems emerge the more you scale your test-suite and introduce more workers/resources.

For example if your test has a 100% success rate when running by itself, but 20% success rate when running 2 at a time in parallel, then you’re gonna have a bad time when you decide you want to run 15 at a time, or 40.

The second which is less common but easier to describe are application bugs, poorly written app code. There are times where I’ll have to ask a domain expert to investigate the test failure when I’m not sure what the business logic should be.

Again these are less common but do happen enough that I make it part of my checklist when addressing flaky tests.

Nothing stings more than a flaky test you ignored coming back to bite you in prod lol.

The remaining types of flakiness I put into the “Environment/Infra” bucket. These failures might only happen in certain environments (Dev or CI)

For example a test may fail because the process ran out of memory, but it only occurs in CI because the CI machine may have diff configs and extra overhead of running automation frameworks.

Maybe you’re not isolating your testing databases and one test is leaking into another causing a data failure.

Maybe a dev introduced a bad data migration which is now breaking tests for everybody, or just some tests etc.

Maybe a new dependency that was installed on the machines is now causing an image rendering problem 1 out of 100 times.

These last types of failures are really hard to pin down so I try to simplify and isolate my testing environment as much as possible and deduce the easiest things first.

Kinda like a hospital, I try to keep things sanitized and not share needles to prevent weird edge cases that I never saw coming which sometimes leads to me chasing ghosts (problems that appear/disappear on there own).

In terms of actually addressing/preventing them, here’s what we’ve done in the past which others mentioned in more detail so I’ll attempt to keep short:

Treat test code like any other type of code. Have it go through your normal code review process and treat it like first class citizen code.

Stress test any newly introduced test in parallel to sniff out any common issues (like needing to add a polling mechanism or make it parallel safe).

Metrics for all your flaky tests and your test-suite in general. This is critical with debugging/addressing flaky tests.

Think of the environment. Isolate/Sanitize your tests as much as possible to help prevent weird issues in the first place

Good TEAM TEST culture to actually ADDRESS the flaky tests as they come up and not just sweep them under the rug.

Error reporting (like Sentry or Datadog) for your tests to see if any flaky tests share common failures

Have a mechanism to Quarantine your tests temporarily while addressing test (please only do this if you are actually going to address the problem 😁 )

Once you reduce the flakiness, you’ll increase confidence in the test-suite which will in turn increase team/developer buy in. Getting that momentum and critical mass can be difficult if you have a lot of flaky tests or poor test culture.

Also CI/CD becomes a well oiled machine and easier to maintain since you reduce a lot of “ghosts” in the system. The more work you put into it, eventually the easier test-suite maintenance will get.

chags shares:

schedule some time with devs and learn how to run the app locally like they do. Then learn how to add data-testid selectors yourself. Then add some and create a pull request for them to review and merge in. This will solve flaky selector issue.

Other than that, i agree with @Xeek above, build your framework so that it polls. If you're using selenium, wrap the driver (you should be doing this anyways), and use that to create your page object

Doublez shares:

At one of the companies that I worked for, whenever we instantiated a page object, in the constructor we would wait for the last object on the page to load before beginning the test.

This was using java/selenium where explicitly waiting for things is usually quite a big deal. That helped with a lot of the flake that we previously saw where pages weren't in a state for a test to actually start.

Aside from that, I couldn't agree more with @Xeek above. Explicit waits will save your life. Are you closing out of a modal? Explicitly wait for the modal to either not be present or for the next element you are going to interact with on your page be visible/clickable (you can actually create flake if you're not careful though, for example if an element is visible even with said modal being present).

Are you navigating to another page/header? Explicitly wait for the next page's last element that loads to be visible. Are you opening a new tab? Wait for something! You get the idea.

Whenever possible I like to have tests be able to run in isolation. Chaining a bunch of tests off of each other is just begging for flake. If you have a sequence of tests that all depend on each other, then as soon as one fails, the rest in the sequence are going to fail. In situations where you have to have tests run off of each other, try and come up with ways in a before or after method to get the application back to a state where the next test will pass.