-
Notifications
You must be signed in to change notification settings - Fork 470
Description
Bug report
When running an rclcpp_action::Client<T>
in using spin_some()
and a rclcpp::Rate
, if the server you connect to publishes feedback at a faster rate than your rclcpp::Rate
, your client waits for the goal to be accepted forever, effectively becoming deadlocked.
Required Info:
- Operating System:
- Ubuntu 20.04
- Installation type:
- Binaries
- Version or commit hash:
- Galactic
- DDS implementation:
- Cyclone DDS
- Client library (if applicable):
- rclcpp
Note: based on my understanding of the root cause (see Additional information
), I believe this bug exists in rolling too, but I've only tested on Galactic as that's what I have access to at the moment.
Steps to reproduce issue
The below server and client reproduce 100% of the time on my machine when running over loopback. Compilation settings shouldn't matter. Once built, they can be run with ros2 run ...
or directly with ./path/to/binary
. I'm using action_tutorials_interfaces/action/Fibonacci
, but this should repro with any action interface.
Server
#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>
int main(int argc, char *argv[])
{
rclcpp::init(argc, argv);
auto node = rclcpp::Node::make_shared("fast_feedback_server");
using Action = action_tutorials_interfaces::action::Fibonacci;
std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> goal_handle = nullptr;
auto server = rclcpp_action::create_server<Action>(
node, "fast_feedback",
[&](const rclcpp_action::GoalUUID &, std::shared_ptr<const Action::Goal>)
{ return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE; },
[&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>>)
{
RCLCPP_INFO(node->get_logger(), "Cancellation requested, accepting.");
goal_handle = nullptr;
return rclcpp_action::CancelResponse::ACCEPT;
},
[&](const std::shared_ptr<rclcpp_action::ServerGoalHandle<Action>> gh)
{
RCLCPP_INFO(node->get_logger(), "Goal was accepted");
goal_handle = gh;
});
rclcpp::Rate rate(100.0); // 100 Hz
rclcpp::executors::SingleThreadedExecutor executor;
executor.add_node(node);
RCLCPP_INFO(node->get_logger(), "Server spinning, ready to accept goals.");
while (rclcpp::ok())
{
executor.spin_some();
// NOTE: this server intentionally does not send a result
if (goal_handle)
{
goal_handle->publish_feedback(std::make_unique<Action::Feedback>());
}
rate.sleep();
}
return 0;
}
Notes:
- The server spins at 100 Hz.
- Feedback is published at the same rate.
- The goal is never completed as it's not relevant to this bug.
Client
#include <action_tutorials_interfaces/action/fibonacci.hpp>
#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>
int main(int argc, char *argv[])
{
rclcpp::init(argc, argv);
auto node = rclcpp::Node::make_shared("fast_feedback_client");
using Action = action_tutorials_interfaces::action::Fibonacci;
using Client = rclcpp_action::Client<Action>;
auto client = rclcpp_action::create_client<Action>(node, "fast_feedback");
if (!client->wait_for_action_server())
{
RCLCPP_ERROR(node->get_logger(), "Action server not available after waiting");
return 1;
}
// NOTE: no feedback_callback is set
Client::SendGoalOptions send_goal_options;
bool goal_response = false;
send_goal_options.goal_response_callback = [&](std::shared_future<Client::GoalHandle::SharedPtr>)
{
RCLCPP_INFO(node->get_logger(), "Goal response.");
goal_response = true;
};
RCLCPP_INFO(node->get_logger(), "Sending goal.");
auto future = client->async_send_goal(Action::Goal(), send_goal_options);
rclcpp::Rate rate(10.0); // 10 Hz, 10% of the server's rate
rclcpp::executors::SingleThreadedExecutor executor;
executor.add_node(node);
while (rclcpp::ok() && !goal_response)
{
RCLCPP_INFO_THROTTLE(node->get_logger(), *node->get_clock(), 10000, "Waiting for goal response.");
executor.spin_some();
rate.sleep();
}
RCLCPP_INFO_STREAM(node->get_logger(), "Goal response received: " << std::boolalpha << goal_response);
return 0;
}
Notes
- The client's rate in this example is 10% of the server's rate to make the repro foolproof for different machines/network conditions, but I've seen the bug reproduce reliably when the client is >90% of the server's rate, and even repro most of the time when they have the same rate.
- No feedback callback is set on
SendGoalOptions
as the deadlock is not dependent on one. - The client exits once the goal is accepted since that's all that's relevant to this bug.
Expected behavior
The action client will accept the goal, perhaps after working through a short backlog of feedback up to the queue depth in the feedback subscription's QoS.
Actual behavior
The server has accepted and started working on the goal but the client never sees that. The client continuously logs Waiting for goal response.
in the repro and debug logs have Received feedback for unknown goal. Ignoring...
. Debug logs show Client in wait set is ready
, which I presume to be the goal response, but it's never taken, so the client deadlocks.
Additional information
I did some digging and think I have tracked down the root cause, some other contributing factors, and some related impacts.
rclcpp_action::Client<T>
derives from rclcpp_action::ClientBase
derives from rclcpp::Waitable
, and these are scheduled for work in an rclcpp::Executor
as a single executable, not multiple executables of their constituent parts. When using rclcpp::Executor::spin_some
, which only collects work from ready entities once per call, the rclcpp::Waitable
is scheduled once, regardless of the amount of work it has ready. This should be fine, but the rclcpp::ClientBase::take_data
implementation only yields a single executable per call and therefore the subsequent execute
invoked by rclcpp::Executor
only performs one thing on the client rather than everything that's ready.
rclcpp/rclcpp_action/src/client.cpp
Lines 550 to 687 in 5e14a28
std::shared_ptr<void> | |
ClientBase::take_data() | |
{ | |
if (pimpl_->is_feedback_ready) { | |
std::shared_ptr<void> feedback_message = this->create_feedback_message(); | |
rcl_ret_t ret = rcl_action_take_feedback( | |
pimpl_->client_handle.get(), feedback_message.get()); | |
return std::static_pointer_cast<void>( | |
std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>( | |
ret, feedback_message)); | |
} else if (pimpl_->is_status_ready) { | |
std::shared_ptr<void> status_message = this->create_status_message(); | |
rcl_ret_t ret = rcl_action_take_status( | |
pimpl_->client_handle.get(), status_message.get()); | |
return std::static_pointer_cast<void>( | |
std::make_shared<std::tuple<rcl_ret_t, std::shared_ptr<void>>>( | |
ret, status_message)); | |
} else if (pimpl_->is_goal_response_ready) { | |
rmw_request_id_t response_header; | |
std::shared_ptr<void> goal_response = this->create_goal_response(); | |
rcl_ret_t ret = rcl_action_take_goal_response( | |
pimpl_->client_handle.get(), &response_header, goal_response.get()); | |
return std::static_pointer_cast<void>( | |
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( | |
ret, response_header, goal_response)); | |
} else if (pimpl_->is_result_response_ready) { | |
rmw_request_id_t response_header; | |
std::shared_ptr<void> result_response = this->create_result_response(); | |
rcl_ret_t ret = rcl_action_take_result_response( | |
pimpl_->client_handle.get(), &response_header, result_response.get()); | |
return std::static_pointer_cast<void>( | |
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( | |
ret, response_header, result_response)); | |
} else if (pimpl_->is_cancel_response_ready) { | |
rmw_request_id_t response_header; | |
std::shared_ptr<void> cancel_response = this->create_cancel_response(); | |
rcl_ret_t ret = rcl_action_take_cancel_response( | |
pimpl_->client_handle.get(), &response_header, cancel_response.get()); | |
return std::static_pointer_cast<void>( | |
std::make_shared<std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>( | |
ret, response_header, cancel_response)); | |
} else { | |
throw std::runtime_error("Taking data from action client but nothing is ready"); | |
} | |
} | |
std::shared_ptr<void> | |
ClientBase::take_data_by_entity_id(size_t id) | |
{ | |
// Mark as ready the entity from which we want to take data | |
switch (static_cast<EntityType>(id)) { | |
case EntityType::GoalClient: | |
pimpl_->is_goal_response_ready = true; | |
break; | |
case EntityType::ResultClient: | |
pimpl_->is_result_response_ready = true; | |
break; | |
case EntityType::CancelClient: | |
pimpl_->is_cancel_response_ready = true; | |
break; | |
case EntityType::FeedbackSubscription: | |
pimpl_->is_feedback_ready = true; | |
break; | |
case EntityType::StatusSubscription: | |
pimpl_->is_status_ready = true; | |
break; | |
} | |
return take_data(); | |
} | |
void | |
ClientBase::execute(std::shared_ptr<void> & data) | |
{ | |
if (!data) { | |
throw std::runtime_error("'data' is empty"); | |
} | |
if (pimpl_->is_feedback_ready) { | |
auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data); | |
auto ret = std::get<0>(*shared_ptr); | |
pimpl_->is_feedback_ready = false; | |
if (RCL_RET_OK == ret) { | |
auto feedback_message = std::get<1>(*shared_ptr); | |
this->handle_feedback_message(feedback_message); | |
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { | |
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking feedback"); | |
} | |
} else if (pimpl_->is_status_ready) { | |
auto shared_ptr = std::static_pointer_cast<std::tuple<rcl_ret_t, std::shared_ptr<void>>>(data); | |
auto ret = std::get<0>(*shared_ptr); | |
pimpl_->is_status_ready = false; | |
if (RCL_RET_OK == ret) { | |
auto status_message = std::get<1>(*shared_ptr); | |
this->handle_status_message(status_message); | |
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { | |
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking status"); | |
} | |
} else if (pimpl_->is_goal_response_ready) { | |
auto shared_ptr = std::static_pointer_cast< | |
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); | |
auto ret = std::get<0>(*shared_ptr); | |
pimpl_->is_goal_response_ready = false; | |
if (RCL_RET_OK == ret) { | |
auto response_header = std::get<1>(*shared_ptr); | |
auto goal_response = std::get<2>(*shared_ptr); | |
this->handle_goal_response(response_header, goal_response); | |
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { | |
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking goal response"); | |
} | |
} else if (pimpl_->is_result_response_ready) { | |
auto shared_ptr = std::static_pointer_cast< | |
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); | |
auto ret = std::get<0>(*shared_ptr); | |
pimpl_->is_result_response_ready = false; | |
if (RCL_RET_OK == ret) { | |
auto response_header = std::get<1>(*shared_ptr); | |
auto result_response = std::get<2>(*shared_ptr); | |
this->handle_result_response(response_header, result_response); | |
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { | |
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking result response"); | |
} | |
} else if (pimpl_->is_cancel_response_ready) { | |
auto shared_ptr = std::static_pointer_cast< | |
std::tuple<rcl_ret_t, rmw_request_id_t, std::shared_ptr<void>>>(data); | |
auto ret = std::get<0>(*shared_ptr); | |
pimpl_->is_cancel_response_ready = false; | |
if (RCL_RET_OK == ret) { | |
auto response_header = std::get<1>(*shared_ptr); | |
auto cancel_response = std::get<2>(*shared_ptr); | |
this->handle_cancel_response(response_header, cancel_response); | |
} else if (RCL_RET_ACTION_CLIENT_TAKE_FAILED != ret) { | |
rclcpp::exceptions::throw_from_rcl_error(ret, "error taking cancel response"); | |
} | |
} else { | |
throw std::runtime_error("Executing action client but nothing is ready"); | |
} | |
} |
In most scenarios, this is fine, but it deadlocks in this specific scenario: a server publishing feedback faster than the client is spinning (specifically, faster than it calls rclcpp::Executor::wait_for_work
). This can happen with any of the spin*
implementations but happens most readily with spin_some
. Why? Because rclcpp::ClientBase::take_data
only returns the first ready executable and it prioritizes feedback over responses. The implementation of rclcpp::ClientBase::take_data
and execute
take and execute in this order: feedback, status, goal response, result response, cancel response.
When you combine the implementation of rclcpp::Waitable
that only returns execution data representing a single unit of work, the implicit prioritization mechanism built into ClientBase
, and an action server whose feedback topic publishes faster than the spin rate of the client, you get a client that always has ready feedback for an unknown goal, ironically never executing the goal response that feedback is for.
Some ancillary effects of this: the goal response is never taken, the client never creates the goal handle, and therefore cancellation isn't easy. Cancellation can be done through rclcpp_action::Client<T>::async_cancel_all_goals
but that may have adverse side effects depending on your scenario. Additionally, the client never becomes "result aware" for the goal since that occurs when the goal response is processed, so result responses are permanently lost if they occur during the deadlock. Feedback is also always dropped because it's for an unknown goal.