-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Why
Some SPOs have reported on Discord that the key registration of their signer nodes was failing after the upgrade to the version 0.2.273 which has been released in the 2543.0 distribution. The error message reported in the aggregator is:
{
"msg": "register_signer::failed_signer_registration",
"v": 0,
"name": "mithril-aggregator",
"level": 40,
"time": "2025-11-06T17:07:14.586128148Z",
"hostname": "e3d0a95d9f5a",
"pid": 1,
"src": "http_server",
"error": "KeyRegwrapper can not register signer with party_id: 'Some(\"pool1dzxc7pqsqfs7dru7xdrkvkdf9s3kd4y2tqsdv063d2lfcxw6zmg\")'\n\nCaused by:\n KES signature verification error: CurrentKesPeriod=0, StartKesPeriod=823\n\nStack backtrace:\n 0: <E as anyhow::context::ext::StdError>::ext_context\n 1: <mithril_aggregator::services::signer_registration::verifier::MithrilSignerRegistrationVerifier as mithril_aggregator::services::signer_registration::api::SignerRegistrationVerifier>::verify::{{closure}}\n 2: <mithril_aggregator::services::signer_registration::leader::MithrilSignerRegistrationLeader as mithril_aggregator::services::signer_registration::api::SignerRegisterer>::register_signer::{{closure}}\n 3: <warp::filter::and_then::AndThenFuture<T,F> as core::future::future::Future>::poll\n 4: <warp::filter::or::EitherFuture<T,U> as core::future::future::Future>::poll\n 5: <warp::filter::and::AndFuture<T,U> as core::future::future::Future>::poll\n 6: <warp::filter::recover::RecoverFuture<T,F> as core::future::future::Future>::poll\n 7: <warp::filter::and::AndFuture<T,U> as core::future::future::Future>::poll\n 8: <warp::filters::cors::internal::WrappedFuture<F> as core::future::future::Future>::poll\n 9: <warp::filters::log::internal::WithLogFuture<FN,F> as core::future::future::Future>::poll\n 10: <hyper_util::service::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll\n 11: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch\n 12: <hyper::server::conn::http1::UpgradeableConnection<I,S> as core::future::future::Future>::poll\n 13: <hyper_util::server::conn::auto::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll\n 14: <hyper_util::server::graceful::GracefulConnectionFuture<C,F> as core::future::future::Future>::poll\n 15: <warp::server::run::Graceful<Fut> as warp::server::run::Run>::run::{{closure}}::{{closure}}\n 16: tokio::runtime::task::core::Core<T,S>::poll\n 17: tokio::runtime::task::harness::Harness<T,S>::poll\n 18: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\n 19: tokio::runtime::scheduler::multi_thread::worker::Context::run\n 20: tokio::runtime::context::scoped::Scoped<T>::set\n 21: tokio::runtime::context::runtime::enter_runtime\n 22: tokio::runtime::scheduler::multi_thread::worker::run\n 23: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n 24: tokio::runtime::task::core::Core<T,S>::poll\n 25: tokio::runtime::task::harness::Harness<T,S>::poll\n 26: tokio::runtime::blocking::pool::Inner::run\n 27: std::sys::backtrace::__rust_begin_short_backtrace\n 28: core::ops::function::FnOnce::call_once{{vtable.shim}}\n 29: std::sys::pal::unix::thread::Thread::new::thread_start\n 30: <unknown>\n 31: __clone"
}
Analysis
We have updated the way the storage of the key registration is done in the PR #2739: now the protocol initializer is storedonly once per epoch and reused in case the node is restarted. This avoids using different keys for the registration in the same epoch.
Here is the scenario that explains the problem:
- The signer registers at the beginning of epoch N and stores its protocol initializer in its database
- The signer is upgraded to the version
0.2.273more than 1 KES period after the first registration - The signer attempts to re-registers its keys after restart and uses the record that was previously created
- The record contains the KES signature which is valid at most 1 KES period after creation, and is thus rejected by the aggregator upon reception
- The signer is then unable to sign until the transition to the next epoch
The duration of the KES period is 1.5 days when epoch length is 5 days on mainnet and preprod which makes this problem occur most likely in the last 2 days of the epoch. As the duration of a KES period is longer than the epoch duration in preview which is 1 day, this has not been detected in the testing-preview and pre-release-preview networks.
It appears that the aggregator is too restrictive in the verification it does when receiving the signer registration. It imposes that the KES period is exactly the one that should be used by the signer if it signed at this instant when it should support all KES period that could occur during the current epoch.
What
Investigate and fix the source of the errors during signer registration.
How
- Investigate the source of the problem
- Create a workaround solution for signers having the problem until a fix is released: rollback to previous signer version (
0.2.268released with distribution2537.0) - Create a fix
- Create a hotfix release
2543.1-hotfix- Pre-release
- Release
- Update dev blog post of
2543distribution