Skip to content

Check precise Windows GIX_TEST_IGNORE_ARCHIVES expectations on CI #1663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Nov 9, 2024

This modifies the CI test-fixtures-windows job introduced in #1657, by making it so that, instead of checking that no more than the expected number of test cases fail, it checks that the exact test cases expected to fail are the ones that do fail. (The other check done before as part of this job, that no tests report errors, is still done.) The approach taken is:

  1. To collect actual failures, parse the XML output saved by cargo nextest to get the package name and test name of each failing test and save them, one per line, to a file.

    Getting this information is pretty straightforward in PowerShell. When iterating through testcase nodes, which are filtered by the presence of a failure subnode, $_.name gives the name of the test case, including :: qualifications within the containing package, and $_.classname gives the name of the containing package. Thus "$($_.classname) $($_.name)" identifies a test in the same way it is identified in human-readable cargo nextest output.

    The unintuitive attribute name classname is because cargo nextest XML is in the JUnit format, which originated for Java tests where test cases are methods in classes. An alternative approach is to use $_.ParentNode.name instead of $_.classname. I've verified that this works too, but I don't know of any strong reason to prefer one over another, so I went with the more compact $_.classname.

  2. To collect expected failures, use gh (the GitHub CLI) to download the text of #1358, and parse the first ```text code block with regex, extracting the list of failing tests and saving them, one per line, to a file.

    One of the tests currently listed there, gix-ref-tests::refs packed::iter::performance, is a performance test that seems not to fail on CI (anymore?), though I still find it to fail locally. If it were to start failing on CI, we would want to know about it. So, before these changes, it was not included in the count. I've carried that over to the precise matching done here: tests that appear to be performance tests due to having performance in their names (not as part of any longer \w+ word) are not regarded as expected to fail.

    It is not obvious that the general approach of downloading the information from an issue is the best one. Although I like this approach and I think it's not too cumbersome, this is admittedly the less elegant step. Other approaches, including hard-coding the expected failures in the workflow or another file, could be used if this approach is not wanted.

  3. To compare them, use git diff --no-index with various other options, printing the entire list of failures as context if they differ. The job fails if there are any differences.

    Because this neither a required check for PR auto-merge nor a dependency of one, I think failing not only if tests unexpectedly fail but also if any unexpectedly pass is unlikely to cause any significant problems, and that knowing whenever the set of failing tests changes in any way is worthwhile.

I intentionally excluded performance tests late in the process, so that there would be a commit whose CI results could be inspected that would verify that the check is capable of failing when the diff is nonempty (and that the output would display in a useful way, with effective colorization, etc., in this situation). That can be observed in this run output and compared to passing runs, such as the runs in this PR.

This modifies the `test-fixtures-windows` job that tests on Windows
with `GIX_TEST_IGNORE_ARCHIVES=1` so that, instead of checking that
no more than 14 failures occur, it checks that the failing tests
are exactly those that are documented in GitoxideLabs#1358 as expected to fail.

The initial check that no tests have *error* status is preserved,
with only stylistic changes, and kept separate from the subsequent
logic so that the output is clearer.

The new steps are no longer conditional on `nextest` having exited
with a failure status, since (a) that was probably unnecessary
before and definitely unnecessary now, (b) at last for now, the
comparison is precise, so it would be strange to pass if the diff
were to have changes on *all* lines, and (c) this makes it slightly
less likely that GitoxideLabs#1358 will accidentally stay open even once fixed.

The current approach is to actually retrieve the list of tests
expected to fail on Windows with `GIX_TEST_IGNORE_ARCHIVES=1` from
the GitoxideLabs#1358 issue body. This has the advantage that it automatically
keeps up to date with changes made to that issue description, but
this is of course not the only possible approach for populating the
expected value.

Two changes should be made before this is ready:

- As noted in the "FIXME" comment, the job should currently fail
  becuase the performance test reported to fail in GitoxideLabs#1358 is not
  being filtered out from the expected failures list. It's left in
  as of this commit, to verify that the job is capable of failing.

  (After that, the performance test should either be filtered out
  or removed from the list in GitoxideLabs#1358, but the former approach is
  currently preferable because I have not done diverse enough
  testing to check if the failure on my main Windows system is due
  to that system being too slow rather than a performance bug.)

- The scratchwork file should be removed once no longer needed.
This is to fix the error:

    gh: To use GitHub CLI in a GitHub Actions workflow, set the GH_TOKEN environment variable. Example:
      env:
        GH_TOKEN: ${{ github.token }}
    InvalidOperation: D:\a\_temp\ba9d92b7-3a94-4f7c-b541-d19003f40e19.ps1:9
    Line |
       9 |  $expected_failures = $match_info.Matches.Groups[1].Value -split "`n"  …
         |  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         | Cannot index into a null array.
    Error: Process completed with exit code 1.
Including color in the diff of expected vs. actual failed tests
makes the output easier to see (much as color helps in `nextest`).

This commit also makes some stylistic changes to the command so it
is easier to read.
This omits tests containing `performance` (and not as part of a
larger "word", not even with `_`) from being expected to fail on CI
with `GIX_TEST_IGNORE_ARCHIVES=1` on Windows.

Currently there is one such test listed in GitoxideLabs#1358,
`gix-ref-tests::refs packed::iter::performance`.
The relevant code is in the `test-fixtures-windows` CI job and it
is working (both to fail the job when there is a mismatch and to
have the job succeed when there is agreement).
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for making this happen!

I am very impressed by what can be done with powershell, it clearly is .NET in a shell :).

And indeed, like predicted, reading the text block from the issue breaks isolation too much and is the reason I'd love to see these extracted into a file, for example if that's OK with you.

Thanks again.

@EliahKagan EliahKagan requested a review from Byron November 9, 2024 08:32
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, thanks so much!

@Byron Byron merged commit 1df68e4 into GitoxideLabs:main Nov 9, 2024
17 checks passed
@EliahKagan EliahKagan deleted the run-ci/test-fixtures-windows-precise branch November 9, 2024 09:34
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jan 22, 2025
Since GitoxideLabs#1663, the `test-fixtures-windows` CI job checks actual
failures against a list of specific tests that are known to fail on
Windows when `GIX_TEST_IGNORE_ARCHIVES=1`. It is therefore capable
of providing useful information about new failures, or newly
passing tests that should be removed from the list, if the job ever
does fail.

The job also seems not to fail. This is to say that while GitoxideLabs#1358 is
not fixed, the `test-fixtures-windows` job has a very low rate of
failure and, if it does fail, something new and interesting would
be happening such that we would want to know about it and probably
not immediately merge a PR that caused it without checking how and
why that happened.

This adds `test-fixtures-windows` to the list of jobs that are
dependencies of a required check for branch protection based PR
auto-merge.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jan 23, 2025
Since GitoxideLabs#1663, the `test-fixtures-windows` CI job checks actual
failures against a list of specific tests that are known to fail on
Windows when `GIX_TEST_IGNORE_ARCHIVES=1`. It is therefore capable
of providing useful information about new failures, or newly
passing tests that should be removed from the list, if the job ever
does fail.

The job also seems not to fail. This is to say that while GitoxideLabs#1358 is
not fixed, the `test-fixtures-windows` job has a very low rate of
failure and, if it does fail, something new and interesting would
be happening such that we would want to know about it and probably
not immediately merge a PR that caused it without checking how and
why that happened.

This adds `test-fixtures-windows` to the list of jobs that are
dependencies of a required check for branch protection based PR
auto-merge.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jul 13, 2025
On CI, `test-fixtures-windows` does `GIX_TEST_IGNORE_ARCHIVES=1`
test runs on Windows, which have a number of known and expected
failures, as described in GitoxideLabs#1358. Since GitoxideLabs#1663, these are listed in
`etc/test-fixtures-windows-expected-failures-see-issue-1358.txt`
and `test-fixtures-windows` compares those expected failures to the
actual failures of a run. If there are any differences, it shows a
unified diff with full context and fails the run.

The default Git foreground color scheme has been used for the diff,
where any tests that are no longer failing are shown as `-` lines
in red, and any tests that are newly failing are shown as `+` lines
in green. These are the usual default `git diff` colors; we did not
pick those colors specifically to be meaningful to this scenario.

While the use of `-` and `+` lines and the specific unified diff
format chosen are intuitive and informative, this is less so for
the color scheme. That is for two reasons:

1. By default and unless another theme is chosen explicitly or
   selected based on device settings, the GitHub web interface
   uses a light theme. In this theme, as currently coded/styled, on
   the pale background in GitHub Actions, foreground normal-styled
   green text can be hard to distinguish visually from foreground
   normal-style black text. So it is often not immediately clear
   which test failures are new. (New test failures are the typical
   way a `test-fixtures-windows` job fails, when it does.)

2. A test that is no longer failing would usually mean that a test
   or implementation bug has been fixed, or that compatibility has
   otherwise been improved. `test-fixtures-windows` fails on these
   so that we attend to them, usually by dropping newly passing
   tests from the list of expected failures. Likewise, a test that
   is failing, especially a newly failing tests, usually if not
   always means that something is not right, even if it is merely
   the test itself that needs to be improved.

   But showing things that have gotten better in red, and things
   that have gotten worse in green, is unintuitive. This is partly
   because these colors have the opposite meaning in numerous
   contexts, including in some parts of the GitHub Actions
   interface. It is more specifically because they have the
   opposite meaning in the context of previous step output, where
   `nextest` shows stderr in red (and specifically shows stderr on
   tests that have failed), shows `FAIL` in red, and shows `PASS`
   in green.

It may be possible to make red and green show up better. These
colors are clear in the `nextest` output. However, the options for
`diff` colors, at least when configuring them with `color.diff.*`,
are limited. Furthermore, that would only improve (1), not (2).

This bolds both `-` and `+` lines, shows `-` lines in magenta, and
shows `+` lines in blue. These colors seem to be visible on both
light and dark backgrounds, and they seem to be at least as
distinguishable from each other as are red and green, in general.

This change only applies to the specific diff that shows newly
passing and newly failing tests in `test-fixtures-windows`.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jul 13, 2025
On CI, `test-fixtures-windows` does `GIX_TEST_IGNORE_ARCHIVES=1`
test runs on Windows, which have a number of known and expected
failures, as described in GitoxideLabs#1358. Since GitoxideLabs#1663, these are listed in
`etc/test-fixtures-windows-expected-failures-see-issue-1358.txt`
and `test-fixtures-windows` compares those expected failures to the
actual failures of a run. If there are any differences, it shows a
unified diff with full context and fails the run.

The default Git foreground color scheme has been used for the diff,
where any tests that are no longer failing are shown as `-` lines
in red, and any tests that are newly failing are shown as `+` lines
in green. These are the usual default `git diff` colors; we did not
pick those colors specifically to be meaningful to this scenario.

While the use of `-` and `+` lines and the specific unified diff
format chosen are intuitive and informative, this is less so for
the color scheme. That is for two reasons:

1. By default and unless another theme is chosen explicitly or
   selected based on device settings, the GitHub web interface
   uses a light theme. In this theme, as currently coded/styled, on
   the pale background in GitHub Actions, foreground normal-styled
   green text can be hard to distinguish visually from foreground
   normal-style black text. So it is often not immediately clear
   which test failures are new. (New test failures are the typical
   way a `test-fixtures-windows` job fails, when it does.)

2. A test that is no longer failing would usually mean that a test
   or implementation bug has been fixed, or that compatibility has
   otherwise been improved. `test-fixtures-windows` fails on these
   so that we attend to them, usually by dropping newly passing
   tests from the list of expected failures. Likewise, a test that
   is failing, especially a newly failing tests, usually if not
   always means that something is not right, even if it is merely
   the test itself that needs to be improved.

   But showing things that have gotten better in red, and things
   that have gotten worse in green, is unintuitive. This is partly
   because these colors have the opposite meaning in numerous
   contexts, including in some parts of the GitHub Actions
   interface. It is more specifically because they have the
   opposite meaning in the context of previous step output, where
   `nextest` shows stderr in red (and specifically shows stderr on
   tests that have failed), shows `FAIL` in red, and shows `PASS`
   in green.

It may be possible to make red and green show up better. These
colors are clear in the `nextest` output. However, the options for
`diff` colors, at least when configuring them with `color.diff.*`,
are limited. Furthermore, that would only improve (1), not (2).

This bolds both `-` and `+` lines, shows `-` lines in magenta, and
shows `+` lines in blue. These colors seem to be visible on both
light and dark backgrounds, and they seem to be at least as
distinguishable from each other as are red and green, in general.

This change only applies to the specific diff that shows newly
passing and newly failing tests in `test-fixtures-windows`.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jul 13, 2025
On CI, `test-fixtures-windows` does `GIX_TEST_IGNORE_ARCHIVES=1`
test runs on Windows, which have a number of known and expected
failures, as described in GitoxideLabs#1358. Since GitoxideLabs#1663, these are listed in
`etc/test-fixtures-windows-expected-failures-see-issue-1358.txt`
and `test-fixtures-windows` compares those expected failures to the
actual failures of a run. If there are any differences, it shows a
unified diff with full context and fails the run.

The default Git foreground color scheme has been used for the diff,
where any tests that are no longer failing are shown as `-` lines
in red, and any tests that are newly failing are shown as `+` lines
in green. These are the usual default `git diff` colors; we did not
pick those colors specifically to be meaningful to this scenario.

While the use of `-` and `+` lines and the specific unified diff
format chosen are intuitive and informative, this is less so for
the color scheme. That is for two reasons:

1. By default and unless another theme is chosen explicitly or
   selected based on device settings, the GitHub web interface
   uses a light theme. In this theme, as currently coded/styled, on
   the pale background in GitHub Actions, foreground normal-styled
   green text can be hard to distinguish visually from foreground
   normal-style black text. So it is often not immediately clear
   which test failures are new. (New test failures are the typical
   way a `test-fixtures-windows` job fails, when it does.)

2. A test that is no longer failing would usually mean that a test
   or implementation bug has been fixed, or that compatibility has
   otherwise been improved. `test-fixtures-windows` fails on these
   so that we attend to them, usually by dropping newly passing
   tests from the list of expected failures. Likewise, a test that
   is failing, especially a newly failing tests, usually if not
   always means that something is not right, even if it is merely
   the test itself that needs to be improved.

   But showing things that have gotten better in red, and things
   that have gotten worse in green, is unintuitive. This is partly
   because these colors have the opposite meaning in numerous
   contexts, including in some parts of the GitHub Actions
   interface. It is more specifically because they have the
   opposite meaning in the context of previous step output, where
   `nextest` shows stderr in red (and specifically shows stderr on
   tests that have failed), shows `FAIL` in red, and shows `PASS`
   in green.

It may be possible to make red and green show up better. These
colors are clear in the `nextest` output. However, the options for
`diff` colors, at least when configuring them with `color.diff.*`,
are limited. Furthermore, that would only improve (1), not (2).

This bolds both `-` and `+` lines, shows `-` lines in magenta, and
shows `+` lines in blue. These colors seem to be visible on both
light and dark backgrounds, and they seem to be at least as
distinguishable from each other as are red and green, in general.

This change only applies to the specific diff that shows newly
passing and newly failing tests in `test-fixtures-windows`.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jul 13, 2025
On CI, `test-fixtures-windows` does `GIX_TEST_IGNORE_ARCHIVES=1`
test runs on Windows, which have a number of known and expected
failures, as described in GitoxideLabs#1358. Since GitoxideLabs#1663, these are listed in
`etc/test-fixtures-windows-expected-failures-see-issue-1358.txt`
and `test-fixtures-windows` compares those expected failures to the
actual failures of a run. If there are any differences, it shows a
unified diff with full context and fails the run.

The default Git foreground color scheme has been used for the diff,
where any tests that are no longer failing are shown as `-` lines
in red, and any tests that are newly failing are shown as `+` lines
in green. These are the usual default `git diff` colors; we did not
pick those colors specifically to be meaningful to this scenario.

While the use of `-` and `+` lines and the specific unified diff
format chosen are intuitive and informative, this is less so for
the color scheme. That is for two reasons:

1. By default and unless another theme is chosen explicitly or
   selected based on device settings, the GitHub web interface
   uses a light theme. In this theme, as currently coded/styled, on
   the pale background in GitHub Actions, foreground normal-styled
   green text can be hard to distinguish visually from foreground
   normal-style black text. So it is often not immediately clear
   which test failures are new. (New test failures are the typical
   way a `test-fixtures-windows` job fails, when it does.)

2. A test that is no longer failing would usually mean that a test
   or implementation bug has been fixed, or that compatibility has
   otherwise been improved. `test-fixtures-windows` fails on these
   so that we attend to them, usually by dropping newly passing
   tests from the list of expected failures. Likewise, a test that
   is failing, especially a newly failing tests, usually if not
   always means that something is not right, even if it is merely
   the test itself that needs to be improved.

   But showing things that have gotten better in red, and things
   that have gotten worse in green, is unintuitive. This is partly
   because these colors have the opposite meaning in numerous
   contexts, including in some parts of the GitHub Actions
   interface. It is more specifically because they have the
   opposite meaning in the context of previous step output, where
   `nextest` shows stderr in red (and specifically shows stderr on
   tests that have failed), shows `FAIL` in red, and shows `PASS`
   in green.

It may be possible to make red and green show up better. These
colors are clear in the `nextest` output. However, the options for
`diff` colors, at least when configuring them with `color.diff.*`,
are limited. Furthermore, that would only improve (1), not (2).

This bolds both `-` and `+` lines, shows `-` lines in magenta, and
shows `+` lines in blue. These colors seem to be visible on both
light and dark backgrounds, and they seem to be at least as
distinguishable from each other as are red and green, in general.

This change only applies to the specific diff that shows newly
passing and newly failing tests in `test-fixtures-windows`.
EliahKagan added a commit to EliahKagan/gitoxide that referenced this pull request Jul 14, 2025
On CI, `test-fixtures-windows` does `GIX_TEST_IGNORE_ARCHIVES=1`
test runs on Windows, which have a number of known and expected
failures, as described in GitoxideLabs#1358. Since GitoxideLabs#1663, these are listed in
`etc/test-fixtures-windows-expected-failures-see-issue-1358.txt`
and `test-fixtures-windows` compares those expected failures to the
actual failures of a run. If there are any differences, it shows a
unified diff with full context and fails the run.

The default Git foreground color scheme has been used for the diff,
where any tests that are no longer failing are shown as `-` lines
in red, and any tests that are newly failing are shown as `+` lines
in green. These are the usual default `git diff` colors; we did not
pick those colors specifically to be meaningful to this scenario.

While the use of `-` and `+` lines and the specific unified diff
format chosen are intuitive and informative, this is less so for
the color scheme. That is for two reasons:

1. By default and unless another theme is chosen explicitly or
   selected based on device settings, the GitHub web interface
   uses a light theme. In this theme, as currently coded/styled, on
   the pale background in GitHub Actions, foreground normal-styled
   green text can be hard to distinguish visually from foreground
   normal-style black text. So it is often not immediately clear
   which test failures are new. (New test failures are the typical
   way a `test-fixtures-windows` job fails, when it does.)

2. A test that is no longer failing would usually mean that a test
   or implementation bug has been fixed, or that compatibility has
   otherwise been improved. `test-fixtures-windows` fails on these
   so that we attend to them, usually by dropping newly passing
   tests from the list of expected failures. Likewise, a test that
   is failing, especially a newly failing tests, usually if not
   always means that something is not right, even if it is merely
   the test itself that needs to be improved.

   But showing things that have gotten better in red, and things
   that have gotten worse in green, is unintuitive. This is partly
   because these colors have the opposite meaning in numerous
   contexts, including in some parts of the GitHub Actions
   interface. It is more specifically because they have the
   opposite meaning in the context of previous step output, where
   `nextest` shows stderr in red (and specifically shows stderr on
   tests that have failed), shows `FAIL` in red, and shows `PASS`
   in green.

It may be possible to make red and green show up better. These
colors are clear in the `nextest` output. However, the options for
`diff` colors, at least when configuring them with `color.diff.*`,
are limited. Furthermore, that would only improve (1), not (2).

This bolds both `-` and `+` lines, shows `-` lines in magenta, and
shows `+` lines in blue. These colors seem to be visible on both
light and dark backgrounds, and they seem to be at least as
distinguishable from each other as are red and green, in general.

This change only applies to the specific diff that shows newly
passing and newly failing tests in `test-fixtures-windows`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants