Description
I have been looking at what the Collector does in erroneous situations, such as when the exporter's destination is temporarily unavailable and found a number of places in our codebase where we are probably not handling the error cases correctly, which may result in data losses when such data losses are preventable.
This is a summary task where I want to capture a list of things I think can be improved.
Receivers
Some (most?) receivers currently do not follow the Error Handling contract which says:
// If the error is non-Permanent then the nextConsumer.Consume*() call should be retried
// with the same data.
Here are open issues for some receivers:
- [otlp] Ensure OTLP receiver handles consume errors correctly #4335
- [otlp] Add throttling / backpressure and correct error handling to OTLP receiver #669
- [filelog] [pkg/stanza] Support back-pressure from downstream consumers opentelemetry-collector-contrib#20511
If you discover more such receivers please add to the list above.
I would also like to add test helpers that allows us to easily verify the compliance of receivers to the contract. Here is a task to add such helper:
Another part of receiver contracts acknowledgement and checkpointing handling. This issue needs to be resolved to ensure that part of the contract is fulfilled:
Exporters
QueuedRetry exporter helper does not have requeuingEnabled=true
when in-memory queue is used. This results in data being dropped after a few retries. We don't want this. We want to keep retrying, just like we do when persistent queue is enabled. Perhaps we make requeuingEnabled as user-visible option if we don't want to hard-code the requeuing behavior.
Here is the list of tasks to resolve:
- Should in-memory QueuedRetry exporter helper use requeuingEnabled=true? #7480
- Add test helper to verify exporter behavior on errors #7479
- [otlpexporter] Add test to verify behavior on destination errors #7481
- [otlphttpexporter] Add test to verify behavior on destination errors #7482
Batching
batchprocessor
currently drops the accumulated batch if the next consumer returns non-permanent error. This is not a correct behavior.
We have 2 possible solutions:
- Quick fix: make sure
batchprocessor
stops accepting new data and blocks the input when it is full and retries sending the accumulated batch to the next consumer if previous attempts return non-permanent errors. - Longer fix: eliminate batch processor and move batching functionality to exporter helper, see proposal here:
End-to-end Testing
There are many places in our codebase where incorrect implementations and bugs may cause loss of data, especially in stressed and erroneous situations. To ensure the Collector works correctly we want to add end-to-end integration tests that verify the operation of the Collector as a whole:
- In typical recommended configuration (memorylimiter and batch processors)
- In memory limited mode and when destination is unavailable.
Without having such test it is hard to be confident that we do not drop the data somewhere due to a bug or due to a component being incorrectly implemented. This should include testing when memory limit is hit.
Here is a task to implement this:
Clarify the Design
Here is the list of issues to resolve: