Skip to content

Conversation

@howardhuanghua
Copy link
Contributor

@howardhuanghua howardhuanghua commented Aug 1, 2020

Issue

In our production environment, we try to rolling restart nodes to update from 6.4.3 to 6.8.2, after upgrading one node to 6.8.2, we found lots of follow query fetch exceptions on this upgraded node:

[2020-07-31T15:18:12,101][DEBUG][o.e.a.s.TransportSearchAction] [1566368687200189609] [187102244] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [1566368687200189509][9.33.214.172:9300][indices:data/read/search[phase/fetch/id]]
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -3
        at java.lang.String.substring(Unknown Source) ~[?:1.8.0_181]
        at org.elasticsearch.common.logging.DeprecationLogger.extractWarningValueFromWarningHeader(DeprecationLogger.java:271) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:1.8.0_181]
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(Unknown Source) ~[?:1.8.0_181]
        at java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:1.8.0_181]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:1.8.0_181]
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(Unknown Source) ~[?:1.8.0_181]
        at java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:1.8.0_181]
        at java.util.stream.ReferencePipeline.collect(Unknown Source) ~[?:1.8.0_181]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ThreadContextStruct.putResponse(ThreadContext.java:491) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ThreadContextStruct.access$1100(ThreadContext.java:366) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext.addResponseHeader(ThreadContext.java:294) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:313) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:298) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.logging.DeprecationLogger.deprecated(DeprecationLogger.java:128) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.search.fetch.subphase.DocValueFieldsFetchSubPhase.hitsExecute(DocValueFieldsFetchSubPhase.java:84) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:225) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:550) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:439) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.action.search.SearchTransportService$11.messageReceived(SearchTransportService.java:436) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1605) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.2.jar:6.8.2-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_181]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]

We also could see several above exceptions on other data node, but not that much as this upgraded node.

Analysis

After try catch the StringIndexOutOfBoundsException exception string on 6.4.3 nodes, we found the exception string is:
299 Elasticsearch-6.8.2-bf84795 "'y' year should be replaced with 'u'. Use 'y' for year-of-era. Prefix your date format with '8' to use the new specifier."

The above string is generated on 6.8.2 node, it's because we used joda date formatter string "yyyy-MM-dd HH:mm:ss" in query request body. The deprecation log will be sent to target data node in query phase response header:

next.addResponseHeader("Warning", warningHeaderValue);

However, 6.4.3 version expect double strings in warning formatter, it has date formatter in the end:

This caused 6.4.3 cannot parse the incoming 6.8.2 deprecation log, penultimateQuote and firstQuote are equal:

final String warningValue = s.substring(firstQuote + 1, penultimateQuote - 2);

This will cause all the queries/fetches with depercation logging failed on 6.4.3 data nodes that received the 6.8.2 depercation log. As the responseHeaders of ThreadContext is common in threadpool, and extractWarningValueFromWarningHeader method always be called before 6.8.2 version.

Solution

To avoid date format performance issue in #37622, add fake date time string at the end of warning headers to bwc with versions before 6.8.

@dliappis dliappis added the :Core/Infra/Logging Log management and logging utilities label Aug 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Logging)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Aug 3, 2020
@howardhuanghua
Copy link
Contributor Author

Based on #37597, use startup time to replace the fake date string. Also backport #38780.

@pgomulka pgomulka self-assigned this Aug 24, 2020
@pgomulka pgomulka added the >bug label Aug 24, 2020
@pgomulka
Copy link
Contributor

pgomulka commented Aug 24, 2020

@howardhuanghua The date part is optional as per https://tools.ietf.org/html/rfc7234#section-5.5
I will investigate your PR as a bug report though
introduced by #37622

@pgomulka
Copy link
Contributor

@howardhuanghua I can see how this can fail in 6.4. Can you please include high level/rest reproduce steps ?

@howardhuanghua
Copy link
Contributor Author

howardhuanghua commented Aug 24, 2020

Hi @pgomulka thanks for looking this issue. The reproduce steps should be simple:

  1. Create 6.4.3 cluster.
  2. Create some indices with timestamp field.
  3. Scale out some 6.8.2 nodes, then the cluster has both 6.4.3 and 6.8.2 nodes, also partital shards should be relocated from 6.4.3 to 6.8.2. (Or you could upgrade partial existing 6.4.3 nodes to 6.8.2 nodes directly)
  4. Send a request use joda date formatter string "yyyy-MM-dd HH:mm:ss" in query request body to some 6.8.2 nodes (must use 6.8.2 as coordinate node), and shard sub query request will transfer to 6.4.3 nodes, you would see the above exception logs on 6.4.3 nodes.

@pgomulka
Copy link
Contributor

pgomulka commented Aug 26, 2020

@howardhuanghua to reproduce there has to be a warning emitted on both upgraded and old parts of the cluster. https://github.com/elastic/elasticsearch/blob/6.4/server/src/main/java/org/elasticsearch/common/util/concurrent/ThreadContext.java#L491
The validation is only done when a new warning response header is added.

If you look closely the stracktrace that you pasted on 1 comment actually emits:

        - 'There are doc-value fields which are not using a format. The output will change in 7.0 when doc value fields get formatted based on mappings by default. It is recommended to pass [format=use_field_mapping] with a doc value field in order to opt in for the future behaviour and ease the migration to 7.0: [count]'

the stactrace lines fits old node, but it is printed in upgraded node logs.
see https://github.com/elastic/elasticsearch/pull/61580/files where I emit a joda warning on upgraded node and use_field_mapping on old node.

@jasontedor do you think we should merge this PR in 6.8?
We have a regex with timestamp as an optional part in both 6.8 and 7.0. This would allow to drop emitting timestamps in 7.

Or maybe we should only send a timestamp when serialising ThreadContext to 6.x node (from 6.8)

@howardhuanghua
Copy link
Contributor Author

@pgomulka I would like to confirm that the warning header is always added based on the follow logic?

next.addResponseHeader("Warning", warningHeaderValue);

the stactrace lines fits old node, but it is printed in upgraded node logs.

Since 6.8 removed the timestamp field from waning header, then the older (6.4) node could not parse the incoming depercation log. That means once user has depercation usage on 6.8, and if cluster also has older version node, then it would be impacted?

@howardhuanghua
Copy link
Contributor Author

Hi @pgomulka would you please help to check that if we could merge this fix to 6.8? It's easy to hit this issue in upgrading case in our production env.

@pgomulka
Copy link
Contributor

@howardhuanghua I discussed this internally and we decided that even thought this is indeed a valid bug report we won't be fixing it.
I understand that situation occurs only within a mixed cluster environment, so once you upgraded fully it should not be a problem anymore.
I will follow up with a known issue doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Core/Infra/Logging Log management and logging utilities Team:Core/Infra Meta label for core/infra team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants