Skip to content

Some signers have KES errors during key registration #2771

@jpraynaud

Description

@jpraynaud

Why

Some SPOs have reported on Discord that the key registration of their signer nodes was failing after the upgrade to the version 0.2.273 which has been released in the 2543.0 distribution. The error message reported in the aggregator is:

{
  "msg": "register_signer::failed_signer_registration",
  "v": 0,
  "name": "mithril-aggregator",
  "level": 40,
  "time": "2025-11-06T17:07:14.586128148Z",
  "hostname": "e3d0a95d9f5a",
  "pid": 1,
  "src": "http_server",
  "error": "KeyRegwrapper can not register signer with party_id: 'Some(\"pool1dzxc7pqsqfs7dru7xdrkvkdf9s3kd4y2tqsdv063d2lfcxw6zmg\")'\n\nCaused by:\n    KES signature verification error: CurrentKesPeriod=0, StartKesPeriod=823\n\nStack backtrace:\n   0: <E as anyhow::context::ext::StdError>::ext_context\n   1: <mithril_aggregator::services::signer_registration::verifier::MithrilSignerRegistrationVerifier as mithril_aggregator::services::signer_registration::api::SignerRegistrationVerifier>::verify::{{closure}}\n   2: <mithril_aggregator::services::signer_registration::leader::MithrilSignerRegistrationLeader as mithril_aggregator::services::signer_registration::api::SignerRegisterer>::register_signer::{{closure}}\n   3: <warp::filter::and_then::AndThenFuture<T,F> as core::future::future::Future>::poll\n   4: <warp::filter::or::EitherFuture<T,U> as core::future::future::Future>::poll\n   5: <warp::filter::and::AndFuture<T,U> as core::future::future::Future>::poll\n   6: <warp::filter::recover::RecoverFuture<T,F> as core::future::future::Future>::poll\n   7: <warp::filter::and::AndFuture<T,U> as core::future::future::Future>::poll\n   8: <warp::filters::cors::internal::WrappedFuture<F> as core::future::future::Future>::poll\n   9: <warp::filters::log::internal::WithLogFuture<FN,F> as core::future::future::Future>::poll\n  10: <hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll\n  11: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch\n  12: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll\n  13: <hyper_util::server::conn::auto::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll\n  14: <hyper_util::server::graceful::GracefulConnectionFuture<C,F> as core::future::future::Future>::poll\n  15: <warp::server::run::Graceful<Fut> as warp::server::run::Run>::run::{{closure}}::{{closure}}\n  16: tokio::runtime::task::core::Core<T,S>::poll\n  17: tokio::runtime::task::harness::Harness<T,S>::poll\n  18: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\n  19: tokio::runtime::scheduler::multi_thread::worker::Context::run\n  20: tokio::runtime::context::scoped::Scoped<T>::set\n  21: tokio::runtime::context::runtime::enter_runtime\n  22: tokio::runtime::scheduler::multi_thread::worker::run\n  23: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n  24: tokio::runtime::task::core::Core<T,S>::poll\n  25: tokio::runtime::task::harness::Harness<T,S>::poll\n  26: tokio::runtime::blocking::pool::Inner::run\n  27: std::sys::backtrace::__rust_begin_short_backtrace\n  28: core::ops::function::FnOnce::call_once{{vtable.shim}}\n  29: std::sys::pal::unix::thread::Thread::new::thread_start\n  30: <unknown>\n  31: __clone"
}

Analysis

We have updated the way the storage of the key registration is done in the PR #2739: now the protocol initializer is storedonly once per epoch and reused in case the node is restarted. This avoids using different keys for the registration in the same epoch.

Here is the scenario that explains the problem:

  • The signer registers at the beginning of epoch N and stores its protocol initializer in its database
  • The signer is upgraded to the version 0.2.273 more than 1 KES period after the first registration
  • The signer attempts to re-registers its keys after restart and uses the record that was previously created
  • The record contains the KES signature which is valid at most 1 KES period after creation, and is thus rejected by the aggregator upon reception
  • The signer is then unable to sign until the transition to the next epoch

The duration of the KES period is 1.5 days when epoch length is 5 days on mainnet and preprod which makes this problem occur most likely in the last 2 days of the epoch. As the duration of a KES period is longer than the epoch duration in preview which is 1 day, this has not been detected in the testing-preview and pre-release-preview networks.

It appears that the aggregator is too restrictive in the verification it does when receiving the signer registration. It imposes that the KES period is exactly the one that should be used by the signer if it signed at this instant when it should support all KES period that could occur during the current epoch.

What

Investigate and fix the source of the errors during signer registration.

How

  • Investigate the source of the problem
  • Create a workaround solution for signers having the problem until a fix is released: rollback to previous signer version (0.2.268 released with distribution 2537.0)
  • Create a fix
  • Create a hotfix release 2543.1-hotfix
    • Pre-release
    • Release
  • Update dev blog post of 2543 distribution

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions