tf_remap very CPU-intensive

I was wondering what's eating my CPU when replaying bagfiles, and I found out it's tf_remap. In the bagfile, I have TF messages at about 1-2 kHz, and the code I'm running spawns like 10 TF listeners. This is enough to saturate 2 CPU cores on my machine just by running the tf_remap node.

I verified the remapping itself is not the bottleneck - the bottleneck really is the `rospy.Publisher.publish()`, probably not able to publish at such a high rate from python to that many subscribers.

So I've written an experimental [C++ tf_remap](https://github.com/tradr-project/tf_remapper_cpp), and the CPU load is much lower (about 30-40% of one CPU core). This package also solves a few more pain points of tf_remap, like the inability to work with tf_static, or the inability to remove TF frames.

Now the question is: do we want to merge this package as a tool to geometry2 (not to geometry, since it's TF2-native)? Or should I rather release it as a standalone package? _I assume that completely substituting the current Python node with the C++ one is not an option for many reasons..._

Another thing after the previous question is solved: should there be a warning added to `tf/tf_remap` that it's known to be inefficient, and with a link to the C++ node if the user observes performance problems?

---

Here's the code that can easily achieve the high-load situation I'm describing (note that `tf/static_transform_publisher` does not use `/tf_static`, but it periodically publishes to `/tf`, in this case at 1 kHz):

```
#!/bin/bash

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

roscore -p $(echo $ROS_MASTER_URI | cut -d':' -f3) &

sleep 5

for i in `seq 1 10`; do
	rosrun tf tf_monitor 2>/dev/null 1>/dev/null &
done

rosrun tf static_transform_publisher 0 0 0 0 0 0 a b 1 /tf:=/tf_old &
rosrun tf static_transform_publisher 0 0 0 0 0 0 d e 1 /tf:=/tf_old &

rosrun tf tf_remap _mappings:='[{"old": "b", "new": "c"}]' &

htop -p $(ps aux | grep tf_remap | grep -v grep | sed 's/\s\+/ /g' | cut -d' ' -f2)

```

---

For completeness, I attach profiler outputs from profiling the original Python node and the C++ node: [profilers.zip](https://github.com/ros/geometry/files/2353592/profilers.zip) . In Python, `thread.lock.acquire` takes almost all the self time, seconded by `write_data` with 1.5% . In C++, `pthread_mutex_lock` is also the most time-consuming call, but still taking only 17% of the self time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tf_remap very CPU-intensive #175

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tf_remap very CPU-intensive #175

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions