Skip to content

nexus add/remove live test fails #7822

@davepacheco

Description

@davepacheco

The live test we have for Nexus add/removal currently fails:

root@oxz_switch:~# TMPDIR=/var/tmp ./cargo-nextest nextest run --profile=live-tests          --archive-file live-tests-archive/omicron-live-tests.tar.zst          --workspace-remap live-tests-archive
  Extracting 1 binary, 1 build script output directory, and 3 linked paths to /var/tmp/nextest-archive-UDJMeA
   Extracted 46 files to /var/tmp/nextest-archive-UDJMeA in 1.07s
info: experimental features enabled: setup-scripts
------------
 Nextest run ID 5b49c339-0df6-49d6-ab3c-8d8ed1b7df4d with nextest profile: live-tests
    Starting 1 test across 1 binary
        SLOW [> 60.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [>120.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        FAIL [ 122.103s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
---- STDOUT:             omicron-live-tests::test_nexus_add_remove test_nexus_add_remove

running 1 test
test test_nexus_add_remove has been running for over 60 seconds
test test_nexus_add_remove ... FAILED

failures:

failures:
    test_nexus_add_remove

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 122.06s

---- STDERR:             omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
log file: /var/tmp/test_nexus_add_remove-3ad37aa113db9b44-test_nexus_add_remove.28927.0.log
note: configured to log to "/var/tmp/test_nexus_add_remove-3ad37aa113db9b44-test_nexus_add_remove.28927.0.log"
note: using DNS server for subnet fd00:1122:3344::/48

thread 'test_nexus_add_remove' panicked at live-tests/tests/test_nexus_add_remove.rs:180:6:
called `Result::unwrap()` on an `Err` value: TimedOut(60.063460023s)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Pool dropped without invoking `terminate`

  Cancelling due to test failure
------------
     Summary [ 122.111s] 1 test run: 0 passed, 1 failed, 0 skipped
        FAIL [ 122.103s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
error: test run failed

This is timing out after 60s waiting for a Nexus instance to have recovered the saga that was running on the Nexus instance that the test just expunged.

The problem is: the current blueprint reflects that the Nexus instance is expunged, but is not yet ready for cleanup:

# omdb nexus blueprints show current | grep -i nexus
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::6]:12221
    oxp_ad5f9396-95d1-43cd-8109-17dbe94437f5/crypt/zone/oxz_nexus_7801b712-dbcd-476d-9aa8-5f188539a209             2e7954c2-85c6-4f08-80b7-5e16de7cfe9a   expunged      none      none          off        
    oxp_c53bb8e5-2cf4-4c0e-a943-609a824c60aa/crypt/zone/oxz_nexus_c6bc048f-bfef-40b0-9ebd-763d0714b9e0             be232f76-4156-406c-b511-95b69572f669   in service    none      none          off        
    nexus             7801b712-dbcd-476d-9aa8-5f188539a209   install dataset   expunged ⏳     fd00:1122:3344:103::21
    nexus             c6bc048f-bfef-40b0-9ebd-763d0714b9e0   install dataset   in service     fd00:1122:3344:103::5 
    oxp_d446e628-c624-4b0b-a617-627449e71681/crypt/zone/oxz_nexus_ae79633f-feee-48f2-b7ad-f14ce5a54e47             7aaa721d-c7b8-45e9-95d2-8dedd52c0f59   in service    none      none          off        
    nexus             ae79633f-feee-48f2-b7ad-f14ce5a54e47   install dataset   in service    fd00:1122:3344:101::6
    oxp_ba8d35c8-c4b0-49e5-b3bc-87dbf005e05e/crypt/zone/oxz_nexus_83bd1f6d-11db-4642-bbc2-a4a4f69755df             d5d4d9c0-6325-4a33-b532-d00a879cbbe9   in service    none      none          off        
    nexus             83bd1f6d-11db-4642-bbc2-a4a4f69755df   install dataset   in service    fd00:1122:3344:102::5

Note the ⏳ -- that means the zone is not yet ready for cleanup.

I expect this has been broken since #7713. After that PR, this test should wait first for an inventory collection to reflect the Nexus zone really gone, then generate a new blueprint that should show the zone ready for cleanup, then make that the target, and then wait for the saga to be recovered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions