Producer/consumer connect corrupts libuv async queue

We recently switched from KafkaJS to this library, which has mostly been great, but we've been hitting an intermittent hang. I don't have a shareable reproducible test case yet, but I can explain where and why the issue arises. It affects any OS and Node version.

It was a regression introduced in 34e4bd3ddf744839d1a722be884ab67ba4df3386 by moving `ActivateDispatchers` from `HandleOKCallback` to `Execute` in `ProducerConnect` ([here](https://github.com/confluentinc/confluent-kafka-javascript/blob/30995170caf2e9cbf71a560bb5b2b41fd18bf997/src/workers.cc#L205)) and `KafkaConsumerConnect` ([here](https://github.com/confluentinc/confluent-kafka-javascript/blob/30995170caf2e9cbf71a560bb5b2b41fd18bf997/src/workers.cc#L564)). That method results in a call to `uv_async_init`, which is [not thread safe](https://docs.libuv.org/en/v1.x/design.html#the-i-o-loop) and must be called from the event loop thread. `HandleOKCallback` and `HandleErrorCallback` are called from that thread, but `Execute` is called from a worker thread.

To demonstrate the issue, I [added an assertion](https://github.com/confluentinc/confluent-kafka-javascript/compare/master...trevorr:confluent-kafka-javascript:assert_main_thread) that dumps a stack trace whenever `uv_async_init` is called from a different thread. It ends up failing here:

```
0   confluent-kafka-javascript.node     0x000000011a0186f8 _ZN9NodeKafka18assert_main_threadEv + 228
1   confluent-kafka-javascript.node     0x000000011a0047a4 _ZN9NodeKafka9Callbacks10Dispatcher8ActivateEv + 64
2   confluent-kafka-javascript.node     0x000000011a029180 _ZN9NodeKafka4Conf6listenEv + 64
3   confluent-kafka-javascript.node     0x000000011a02e228 _ZN9NodeKafka13KafkaConsumer19ActivateDispatchersEv + 32
4   confluent-kafka-javascript.node     0x000000011a040010 _ZN9NodeKafka7Workers20KafkaConsumerConnect7ExecuteEv + 40
5   confluent-kafka-javascript.node     0x000000011a02c8f8 _ZN3Nan12AsyncExecuteEP9uv_work_s + 44
6   node                                0x00000001033616d4 worker + 248
7   libsystem_pthread.dylib             0x00000001867372e4 _pthread_start + 136
8   libsystem_pthread.dylib             0x00000001867320fc thread_start + 8
Assertion failed: (false), function assert_main_thread, file common.cc, line 46.
```

I believe those `ActivateDispatchers` calls should be moved back to `HandleOKCallback`. If that's too late for the scenario mentioned in the comment, they could possibly be moved to the worker constructor. I'm happy to submit a PR doing one of those things, but wanted to get the discussion started first.

### Background

When I attached a debugger to our hung Node process, it was in an infinite loop [here](https://github.com/nodejs/node/blob/v20.x/deps/uv/src/unix/async.c#L161-L171) in libuv:

```c++
  while (!uv__queue_empty(&queue)) {
    q = uv__queue_head(&queue);
    h = uv__queue_data(q, uv_async_t, queue);

    uv__queue_remove(q);
    uv__queue_insert_tail(&loop->async_handles, q);

    /* Atomically fetch and clear pending flag */
    pending = (_Atomic int*) &h->pending;
    if (atomic_exchange(pending, 0) == 0)
      continue;
```

That loop is supposed to move all current async handles to a temporary queue, then put them back in the async handles queue one at a time, calling the async callback if it is defined and pending is set. However, it ends up with the same handle each time:

```c++
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000105154cc8 node`uv__async_io(loop=0x00000001087a0c90, w=<unavailable>, events=<unavailable>) at async.c:165:5 [opt]
   162 	    q = uv__queue_head(&queue);
   163 	    h = uv__queue_data(q, uv_async_t, queue);
   164 	
-> 165 	    uv__queue_remove(q);
   166 	    uv__queue_insert_tail(&loop->async_handles, q);
   167 	
   168 	    /* Atomically fetch and clear pending flag */
(llnode) p queue
(uv__queue) {
  next = 0x0000000134807748
  prev = 0x000000012381f790
}
(llnode) n
Process 22379 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000105154cd8 node`uv__async_io(loop=0x00000001087a0c90, w=<unavailable>, events=<unavailable>) at async.c:166:5 [opt]
   163 	    h = uv__queue_data(q, uv_async_t, queue);
   164 	
   165 	    uv__queue_remove(q);
-> 166 	    uv__queue_insert_tail(&loop->async_handles, q);
   167 	
   168 	    /* Atomically fetch and clear pending flag */
   169 	    pending = (_Atomic int*) &h->pending;
(llnode) p queue
(uv__queue) {
  next = 0x00000001087a0ee8
  prev = 0x000000012381f790
}
(llnode) n
Process 22379 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000105154cec node`uv__async_io(loop=0x00000001087a0c90, w=<unavailable>, events=<unavailable>) at async.c:169:34 [opt]
   166 	    uv__queue_insert_tail(&loop->async_handles, q);
   167 	
   168 	    /* Atomically fetch and clear pending flag */
-> 169 	    pending = (_Atomic int*) &h->pending;
   170 	    if (atomic_exchange(pending, 0) == 0)
   171 	      continue;
   172 	
(llnode) p queue
(uv__queue) {
  next = 0x0000000134807748 // it's back in queue?????????????????
  prev = 0x000000012381f790
}
```

This should only be possible if both queues share nodes, which they're not supposed to. However, `queue->next == loop->async_handles->prev`:

```c++
(llnode) p queue // temp queue
(uv__queue) {
  next = 0x0000000134807748 // !!!
  prev = 0x000000012381f790
}
(llnode) p loop->async_handles // event loop async queue
(uv__queue) {
  next = 0x00000001087a0dc0
  prev = 0x0000000134807748 // !!!
}
```

I believe this is caused by calling `uv_async_init` from the wrong thread. Our app creates several Kafka producers and consumers at startup, which makes this more likely to hit. A standalone reproduction would likely involve connecting and disconnecting producers and/or consumers in a loop until the process hangs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Producer/consumer connect corrupts libuv async queue #161

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Producer/consumer connect corrupts libuv async queue #161

Description

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions