-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Catching StackOverflowErrors from bad regexes in GsubProcessor #106851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @masseyke, I've created a changelog YAML for you. |
…earch into fix/gsub-stackoverflow
|
Pinging @elastic/es-data-management (Team:Data Management) |
This reverts commit ff916a6.
| * StackOverflowError, so we rethrow it as an Exception instead. This way the document fails this processor, but processing | ||
| * can carry on. | ||
| */ | ||
| throw new ElasticsearchException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reminded myself of why this ended up being a little more intricate over on KeyValueProcessor with logAndBuildException . I think we should go with the same approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now ongoing in #106931
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may need to hide the underlying SOE. If this exception bubbles up to the threadpool's exeption handler, that searches the stack trace for errors, and would then cause ES to exit. See #102394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I also realized it's not serializable within the ElasticsearchException. I'm going to call AbtractProcessor.logAndBuildException() from #106931 instead once that goes in.
| * can carry on. The value would be useful to log here, but we do not do so for because we do not want to write potentially | ||
| * sensitive data to the logs. | ||
| */ | ||
| throw logAndBuildException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment over on #106931. I don't think we want to be logging every bad regex, it's not in the server's control, it's a user error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These errors are exceedingly rare in practice (like I've seen evidence of something like this maybe 4 times in 3 years). It's a combination of a bad regex + bad data. We can't log the data, but logging the regex at least gives us and our support team something to go by. And it could very well be Elastic's own regexes from some integration causing the failure. Right now we just have nothing to go by. We could potentially throttle this, or write these failures to a custom index or something, but that all seems like overkill for something that's pretty rare, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it may be rare for us to be told about these errors, are we sure they aren't occurring in the wild with more frequency?
If this occurs, presumably the user is receiving the exception to send to support. Why is the log the only expected means to get at the regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR, the StackOverflowError takes down the server and the user received no response. That's why I'm pretty sure it's rare. We can potentially change logAndBuildException to log a different message -- or are you saying that we ought to not be logging anything on this kind of error? It's arguably a bug in the system that we allow a StackOverflowError to happen in the first place inside of the database process, so it seems log-worthy. For example if we start seeing this a lot in the logs we might want to prioritize a different fix.
Just to tie them together, the details of how logAndBuildException works were worked out as part of #99493.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against catching the SOE, we should definitely do that. But I don't think logging the bad regex is necessary. It's a user concern: it's their bad regex. The fact that we must resort to catching SOE is an unfortunate result of using java's Pattern. In painless we try to be more aggressive in rejecting complex regexes (see #63029), but even then we still catch SOE there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share Ryan's concerns here. We should push errors like this back onto the client, which (a) knows a lot more context about the erroneous request than we do within ES and (b) can actually do something to address the problem. If it's our own integrations then we can rely on them handling or exposing this sort of error properly, including sharing all the context needed to work out where things are going wrong. That would help the support team engage directly with the folks that own the problem rather than unnecessarily escalating things to the ES team.
If nobody is noticing these errors then the situation seems fairly hopeless no matter where these things are logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a silly question:
- Elastic Agent / Logstash sends a doc in a bulk request to an ingest pipeline. Ingest pipeline contains a bad regex. How would that look like? It will be a failed request in the bulk and therefore agent / logstash should log it, similiarly to a mapping exception?
@ruflin do you know how this here will impact the logs+ with the failure store inside ES instead of logging back to the agent that there was an issue with this document. Would the document then contain a error.ingest_failure or something like that telling us what went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I will expect when the failure store is enabled, the "bad" document will end up in the failure store and ideally the document itself will contain some info around what broke. What is returned to the client, I need @dakrone to chime on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sounds like a good use for the failure store (and perhaps overlapping with those users that would be likely to use the failure store in the first place). Yes, the failure store document will contain the exception if thrown during indexing (assuming it's enabled of course). I could see a situation where we log it, and a user sends many of these and they pollute the logs a lot. Logging anything on a per-document basis can potentially be really large.
What about changing the logging to be trace-only (so it's still accessible if absolutely necessary), and still returning the message in the regular thrown exception (so it makes it into the failure store)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I changed to trace-only logging, with the regex in the exception message and in the trace-only log message. Does that sound ok to everyone else?
| */ | ||
| String message = "Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]"; | ||
| logger.trace(message, e); | ||
| throw new ElasticsearchException(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an IllegalArgumentException? I think if ElasticsearchException is used it will result in a 500 instead of a 4xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a stretch of the definition of IllegalArgumentException, but I'm willing to do iot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have to be IAE, but that maps to a 400. You could also create a new subclass of ElasticsearchException and specify the status explicitly, or look for other existing subclasses that return 4xx.
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This catches StackOverflowError if one is thrown from an especially bad gsub processor pattern on an especially bad document, and rethrows it as something that upstream code handles better.