Catching StackOverflowErrors from bad regexes in GsubProcessor #106851

masseyke · 2024-03-27T22:13:58Z

This catches StackOverflowError if one is thrown from an especially bad gsub processor pattern on an especially bad document, and rethrows it as something that upstream code handles better.

elasticsearchmachine · 2024-03-27T22:14:22Z

Hi @masseyke, I've created a changelog YAML for you.

…earch into fix/gsub-stackoverflow

elasticsearchmachine · 2024-03-28T20:58:30Z

Pinging @elastic/es-data-management (Team:Data Management)

This reverts commit ff916a6.

joegallo · 2024-03-29T12:42:05Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/GsubProcessor.java

+             * StackOverflowError, so we rethrow it as an Exception instead. This way the document fails this processor, but processing
+             * can carry on.
+             */
+            throw new ElasticsearchException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e);


I've reminded myself of why this ended up being a little more intricate over on KeyValueProcessor with logAndBuildException . I think we should go with the same approach here.

This is now ongoing in #106931

You may need to hide the underlying SOE. If this exception bubbles up to the threadpool's exeption handler, that searches the stack trace for errors, and would then cause ES to exit. See #102394

Yeah I also realized it's not serializable within the ElasticsearchException. I'm going to call AbtractProcessor.logAndBuildException() from #106931 instead once that goes in.

rjernst · 2024-03-29T23:25:43Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/GsubProcessor.java

+             * can carry on. The value would be useful to log here, but we do not do so for because we do not want to write potentially
+             * sensitive data to the logs.
+             */
+            throw logAndBuildException("Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]", e);


I left a comment over on #106931. I don't think we want to be logging every bad regex, it's not in the server's control, it's a user error.

These errors are exceedingly rare in practice (like I've seen evidence of something like this maybe 4 times in 3 years). It's a combination of a bad regex + bad data. We can't log the data, but logging the regex at least gives us and our support team something to go by. And it could very well be Elastic's own regexes from some integration causing the failure. Right now we just have nothing to go by. We could potentially throttle this, or write these failures to a custom index or something, but that all seems like overkill for something that's pretty rare, doesn't it?

While it may be rare for us to be told about these errors, are we sure they aren't occurring in the wild with more frequency?

If this occurs, presumably the user is receiving the exception to send to support. Why is the log the only expected means to get at the regex?

Before this PR, the StackOverflowError takes down the server and the user received no response. That's why I'm pretty sure it's rare. We can potentially change logAndBuildException to log a different message -- or are you saying that we ought to not be logging anything on this kind of error? It's arguably a bug in the system that we allow a StackOverflowError to happen in the first place inside of the database process, so it seems log-worthy. For example if we start seeing this a lot in the logs we might want to prioritize a different fix.
Just to tie them together, the details of how logAndBuildException works were worked out as part of #99493.

I'm not against catching the SOE, we should definitely do that. But I don't think logging the bad regex is necessary. It's a user concern: it's their bad regex. The fact that we must resort to catching SOE is an unfortunate result of using java's Pattern. In painless we try to be more aggressive in rejecting complex regexes (see #63029), but even then we still catch SOE there.

I share Ryan's concerns here. We should push errors like this back onto the client, which (a) knows a lot more context about the erroneous request than we do within ES and (b) can actually do something to address the problem. If it's our own integrations then we can rely on them handling or exposing this sort of error properly, including sharing all the context needed to work out where things are going wrong. That would help the support team engage directly with the folks that own the problem rather than unnecessarily escalating things to the ES team.

If nobody is noticing these errors then the situation seems fairly hopeless no matter where these things are logged.

Maybe a silly question:

Elastic Agent / Logstash sends a doc in a bulk request to an ingest pipeline. Ingest pipeline contains a bad regex. How would that look like? It will be a failed request in the bulk and therefore agent / logstash should log it, similiarly to a mapping exception?

@ruflin do you know how this here will impact the logs+ with the failure store inside ES instead of logging back to the agent that there was an issue with this document. Would the document then contain a error.ingest_failure or something like that telling us what went wrong?

In general I will expect when the failure store is enabled, the "bad" document will end up in the failure store and ideally the document itself will contain some info around what broke. What is returned to the client, I need @dakrone to chime on this one.

I think this sounds like a good use for the failure store (and perhaps overlapping with those users that would be likely to use the failure store in the first place). Yes, the failure store document will contain the exception if thrown during indexing (assuming it's enabled of course). I could see a situation where we log it, and a user sends many of these and they pollute the logs a lot. Logging anything on a per-document basis can potentially be really large.

What about changing the logging to be trace-only (so it's still accessible if absolutely necessary), and still returning the message in the regular thrown exception (so it makes it into the failure store)?

Alright, I changed to trace-only logging, with the regex in the exception message and in the trace-only log message. Does that sound ok to everyone else?

…-stackoverflow

rjernst · 2024-04-10T14:21:14Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/GsubProcessor.java

+             */
+            String message = "Caught a StackOverflowError while processing gsub pattern: [" + pattern + "]";
+            logger.trace(message, e);
+            throw new ElasticsearchException(message);


Can this be an IllegalArgumentException? I think if ElasticsearchException is used it will result in a 500 instead of a 4xx.

It seems like a stretch of the definition of IllegalArgumentException, but I'm willing to do iot.

It doesn't have to be IAE, but that maps to a 400. You could also create a new subclass of ElasticsearchException and specify the status explicitly, or look for other existing subclasses that return 4xx.

…tion

dakrone

LGTM

Catching StackOverflowErrors from bad regexes in GsubProcessor

df83f7d

masseyke added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.14.0 labels Mar 27, 2024

masseyke marked this pull request as draft March 27, 2024 22:14

masseyke added 3 commits March 27, 2024 17:14

Update docs/changelog/106851.yaml

bc14480

improving error message

ff916a6

Merge branch 'fix/gsub-stackoverflow' of github.com:masseyke/elastics…

c3ffb26

…earch into fix/gsub-stackoverflow

masseyke requested a review from joegallo March 28, 2024 20:57

masseyke marked this pull request as ready for review March 28, 2024 20:58

masseyke requested a review from a team as a code owner March 28, 2024 20:58

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 28, 2024

masseyke removed the request for review from a team March 28, 2024 20:58

masseyke added 2 commits March 28, 2024 16:06

Revert "improving error message"

92292f5

This reverts commit ff916a6.

reverting accidental changes

a2e9769

joegallo reviewed Mar 29, 2024

View reviewed changes

masseyke added 2 commits March 29, 2024 16:46

Merge branch 'main' into fix/gsub-stackoverflow

6aa769e

throwing a safer exception

3111eec

masseyke requested review from joegallo and rjernst March 29, 2024 21:51

rjernst reviewed Mar 29, 2024

View reviewed changes

masseyke added 2 commits April 1, 2024 08:14

fixing a unit test

c4f27a6

Merge branch 'main' of github.com:elastic/elasticsearch into fix/gsub…

eb1cbe4

…-stackoverflow

joegallo requested review from dakrone and removed request for joegallo April 2, 2024 15:20

masseyke added 2 commits April 4, 2024 17:24

Merge branch 'main' into fix/gsub-stackoverflow

02b8792

Changing logging to trace only

71187a3

masseyke requested a review from rjernst April 9, 2024 14:48

rjernst reviewed Apr 10, 2024

View reviewed changes

masseyke added 2 commits April 10, 2024 10:28

Merge branch 'main' into fix/gsub-stackoverflow

aeecd39

throwing an IllegalArgumentException instead of an ElasticsearchExcep…

89aeb75

…tion

masseyke requested a review from rjernst April 10, 2024 15:37

dakrone approved these changes Apr 11, 2024

View reviewed changes

masseyke merged commit ef16be9 into elastic:main Apr 11, 2024

masseyke deleted the fix/gsub-stackoverflow branch April 11, 2024 20:59

Catching StackOverflowErrors from bad regexes in GsubProcessor #106851

Catching StackOverflowErrors from bad regexes in GsubProcessor #106851

Uh oh!

Conversation

masseyke commented Mar 27, 2024

Uh oh!

elasticsearchmachine commented Mar 27, 2024

Uh oh!

elasticsearchmachine commented Mar 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants