The execution of tests on CI should be as deterministic as possible, besides potential issues coming from infrastructure or external dependencies. When a test sometimes fails and sometimes passes, apparently randomly, without any relationship with the changes in a PR, it has very negative implications:
- The results of the whole suite are less valuable, we can no longer trust the information given by the suite execution. Changes are added to the codebase without actually knowing if they are going to break something.
- As a consequence, we can't be completely sure about the state of the product at release time, it can be ok or not.
- The development process is affected negatively. Developers need to wait until a lucky execution aligns all the required tests in green, which can take more than a week in some cases.
In summary, the CI system stops helping the product evolution and instead becomes an obstacle.
We have introduced a quarantine methodology to put apart the tests that don't behave deterministically until they are fixed, so that we can keep the rest of the suite as healthy as possible. You can read more about test quarantine methodology in 1 and 2, and about an actual implementation in 3.
The purpose of appplying the methodology described in this document is increasing the stability of the CI suite. We consider the CI stable if the whole test suite has a failure rate below 10%.
We need to multiplicate the individual failure rates to obtain the whole suite passing rate, this means that with a failure rate of 5% per individual test, more than 2 flaky tests would lead to an overall test suite failure rate above the 10% goal (0.95 ** 2 = 0.9025). And, for instance, 17 failing tests with 5% failure rate would lead to a terrible 41.81% passing rate of the whole suite (0.95 ** 17 = 0.4181).
In order to remove as much as possible the influence of changes in PRs to determine the stability of the suite, we will take into account only results from the periodics that run e2e tests from main (hese jobs can be checked on testgrid) and presubmits that are executed on merged code (on tide merge batches as reported by flakefinder).
We will consider test failures only in jobs where less than 5 tests failed, so that we don't take into account systemic failures caused for instance by an infrastructure problem.
A test must be put in quarantine when any of these conditions is met:
- It has a failure rate higher than 5% in the last two weeks.
- It has a failure rate higher than 20% in the last 3 days.
A PR will be proposed on Mondays every two weeks with a batch of the tests that
met the first condition. A PR can be proposed at any time for the tests that meet
the second condition. In both cases the PR will add the text [QUARANTINE]
to
each test's description in the code. An email will be sent to the owners of the
suspected tests.
After the PR with the quarantine candidates is proposed there is a grace period of 2 days to prepare and land a fix for a test in the batch. If at least 5 consecutive executions with the fix pass the test can be removed from the batch.
Each quarantined test must have a team owner. The PR will add the text
[sig-{compute,network,storage,operator}]
to each test's description.
When a test marked with the [release-blocker] meets the conditions to be quarantined we will:
- Create github issue with a comment
/release-blocker main
to ensure that the issue is addressed before a new release is cut. - Ensure that the github issue is assigned to an individual who will own bringing the blocker to completion within a quick time frame.
A member of the team assigned to each quarantined tests should propose a fix for the test or, after investigating the source of the errors, determine that the test itself doesn't need changes to be fixed (maybe the fix needs to be done on other parts of the code base or in a separate repo). In any case, the team assigned must communicate when the test is expected to be stable.
A test must be put out of quarantine when:
- It hasn't failed on any of the periodic lanes in the two weeks after the time indicated by the team assigned to bring the test back to the stable suite.
After two weeks with successful executions has passed, a quarantined tests will
be ready to join the stable suite again. A member of the team assigned to each
quarantined tests will propose a PR to remove the text [QUARANTINE]
from the
test description in the code. After merging this PR the test will be out of
quarantine.