Skip to content

Conversation

@masseyke
Copy link
Member

This catches StackOverflowError if one is thrown from an especially bad gsub processor pattern on an especially bad document, and rethrows it as something that upstream code handles better.

@masseyke masseyke added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.14.0 labels Mar 27, 2024
@masseyke masseyke marked this pull request as draft March 27, 2024 22:14
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@masseyke masseyke requested a review from joegallo March 28, 2024 20:57
@masseyke masseyke marked this pull request as ready for review March 28, 2024 20:58
@masseyke masseyke requested a review from a team as a code owner March 28, 2024 20:58
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 28, 2024
@masseyke masseyke removed the request for review from a team March 28, 2024 20:58
* StackOverflowError, so we rethrow it as an Exception instead. This way the document fails this processor, but processing
* can carry on.
*/
throw new ElasticsearchException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reminded myself of why this ended up being a little more intricate over on KeyValueProcessor with logAndBuildException . I think we should go with the same approach here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now ongoing in #106931

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may need to hide the underlying SOE. If this exception bubbles up to the threadpool's exeption handler, that searches the stack trace for errors, and would then cause ES to exit. See #102394

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I also realized it's not serializable within the ElasticsearchException. I'm going to call AbtractProcessor.logAndBuildException() from #106931 instead once that goes in.

@masseyke masseyke requested review from joegallo and rjernst March 29, 2024 21:51
* can carry on. The value would be useful to log here, but we do not do so for because we do not want to write potentially
* sensitive data to the logs.
*/
throw logAndBuildException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment over on #106931. I don't think we want to be logging every bad regex, it's not in the server's control, it's a user error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors are exceedingly rare in practice (like I've seen evidence of something like this maybe 4 times in 3 years). It's a combination of a bad regex + bad data. We can't log the data, but logging the regex at least gives us and our support team something to go by. And it could very well be Elastic's own regexes from some integration causing the failure. Right now we just have nothing to go by. We could potentially throttle this, or write these failures to a custom index or something, but that all seems like overkill for something that's pretty rare, doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it may be rare for us to be told about these errors, are we sure they aren't occurring in the wild with more frequency?

If this occurs, presumably the user is receiving the exception to send to support. Why is the log the only expected means to get at the regex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, the StackOverflowError takes down the server and the user received no response. That's why I'm pretty sure it's rare. We can potentially change logAndBuildException to log a different message -- or are you saying that we ought to not be logging anything on this kind of error? It's arguably a bug in the system that we allow a StackOverflowError to happen in the first place inside of the database process, so it seems log-worthy. For example if we start seeing this a lot in the logs we might want to prioritize a different fix.
Just to tie them together, the details of how logAndBuildException works were worked out as part of #99493.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against catching the SOE, we should definitely do that. But I don't think logging the bad regex is necessary. It's a user concern: it's their bad regex. The fact that we must resort to catching SOE is an unfortunate result of using java's Pattern. In painless we try to be more aggressive in rejecting complex regexes (see #63029), but even then we still catch SOE there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share Ryan's concerns here. We should push errors like this back onto the client, which (a) knows a lot more context about the erroneous request than we do within ES and (b) can actually do something to address the problem. If it's our own integrations then we can rely on them handling or exposing this sort of error properly, including sharing all the context needed to work out where things are going wrong. That would help the support team engage directly with the folks that own the problem rather than unnecessarily escalating things to the ES team.

If nobody is noticing these errors then the situation seems fairly hopeless no matter where these things are logged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question:

  • Elastic Agent / Logstash sends a doc in a bulk request to an ingest pipeline. Ingest pipeline contains a bad regex. How would that look like? It will be a failed request in the bulk and therefore agent / logstash should log it, similiarly to a mapping exception?

@ruflin do you know how this here will impact the logs+ with the failure store inside ES instead of logging back to the agent that there was an issue with this document. Would the document then contain a error.ingest_failure or something like that telling us what went wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I will expect when the failure store is enabled, the "bad" document will end up in the failure store and ideally the document itself will contain some info around what broke. What is returned to the client, I need @dakrone to chime on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sounds like a good use for the failure store (and perhaps overlapping with those users that would be likely to use the failure store in the first place). Yes, the failure store document will contain the exception if thrown during indexing (assuming it's enabled of course). I could see a situation where we log it, and a user sends many of these and they pollute the logs a lot. Logging anything on a per-document basis can potentially be really large.

What about changing the logging to be trace-only (so it's still accessible if absolutely necessary), and still returning the message in the regular thrown exception (so it makes it into the failure store)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I changed to trace-only logging, with the regex in the exception message and in the trace-only log message. Does that sound ok to everyone else?

@joegallo joegallo requested review from dakrone and removed request for joegallo April 2, 2024 15:20
@masseyke masseyke requested a review from rjernst April 9, 2024 14:48
*/
String message = "Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]";
logger.trace(message, e);
throw new ElasticsearchException(message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an IllegalArgumentException? I think if ElasticsearchException is used it will result in a 500 instead of a 4xx.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a stretch of the definition of IllegalArgumentException, but I'm willing to do iot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be IAE, but that maps to a 400. You could also create a new subclass of ElasticsearchException and specify the status explicitly, or look for other existing subclasses that return 4xx.

@masseyke masseyke requested a review from rjernst April 10, 2024 15:37
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@masseyke masseyke merged commit ef16be9 into elastic:main Apr 11, 2024
@masseyke masseyke deleted the fix/gsub-stackoverflow branch April 11, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v8.14.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants