Skip to content

[ML] Data frame cannot be deleted if it was stopped while it was finishing #42995

@dolaru

Description

@dolaru

Elasticsearch version (bin/elasticsearch --version): 7.3.0 and 8.0.0

Plugins installed: ['repository-gcs']

JVM version (java -version): openjdk 11.0.2

OS version (uname -a if on a Unix-like system): Darwin Kernel Version 18.6.0

Description of the problem including expected versus actual behavior:
From Slack discussions:

[Ben Trent]
Getting an NPE in Lucene.
The best we can tell it was added here: 63ee61d...cb527c2

What we are doing in data-frame is running a DBQ terms query against the .data-frame-internal-1. Anytime that DBQ is running, a new doc could be added to the index with that same term.

Unsure if this same thing happens in 7.3 or 7.2. We are still testing

[Henning Andersen]
An educated guess is that it is the prune ID merge policy change that changed this. If you can reproduce this locally, it is easy to disable this at InternalEngine:2180, simply replace new PrunePostingsMergePolicy(mergePolicy, IdFieldMapper.NAME) with mergePolicy. Curious to know the result...
The problems is likely the slice on _id, I think a null check is missing in TermsSliceQuery since we risk a segment with no _id now (but this is way beyond my knowledge)...

Based on investigations so far, it looks like this issue is caused by #40741.

Steps to reproduce:

  1. Create a data frame and start it
  2. Send a stop request while the data frame is finishing
  3. Try to delete the data frame
  4. Notice that the delete request fails with a 500, and the reason is an NPE.

Provide logs (if relevant):

Delete data frame API response
{
	"phase": "query",
	"failed_shards": [{
		"node": "qSY9saZrRjm_7YULzSw-RQ",
		"reason": {
			"reason": null,
			"type": "null_pointer_exception"
		},
		"index": ".data-frame-internal-1",
		"shard": 0
	}],
	"reason": "all shards failed",
	"grouped": true,
	"caused_by": {
		"reason": null,
		"caused_by": {
			"reason": null,
			"type": "null_pointer_exception"
		},
		"type": "null_pointer_exception"
	},
	"suppressed": [{
		"phase": "query",
		"failed_shards": [{
			"node": "qSY9saZrRjm_7YULzSw-RQ",
			"reason": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"index": ".data-frame-internal-1",
			"shard": 0
		}],
		"reason": "all shards failed",
		"grouped": true,
		"caused_by": {
			"reason": null,
			"caused_by": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"type": "null_pointer_exception"
		},
		"type": "search_phase_execution_exception"
	}, {
		"phase": "query",
		"failed_shards": [{
			"node": "qSY9saZrRjm_7YULzSw-RQ",
			"reason": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"index": ".data-frame-internal-1",
			"shard": 0
		}],
		"reason": "all shards failed",
		"grouped": true,
		"caused_by": {
			"reason": null,
			"caused_by": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"type": "null_pointer_exception"
		},
		"type": "search_phase_execution_exception"
	}, {
		"phase": "query",
		"failed_shards": [{
			"node": "qSY9saZrRjm_7YULzSw-RQ",
			"reason": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"index": ".data-frame-internal-1",
			"shard": 0
		}],
		"reason": "all shards failed",
		"grouped": true,
		"caused_by": {
			"reason": null,
			"caused_by": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"type": "null_pointer_exception"
		},
		"type": "search_phase_execution_exception"
	}, {
		"phase": "query",
		"failed_shards": [{
			"node": "qSY9saZrRjm_7YULzSw-RQ",
			"reason": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"index": ".data-frame-internal-1",
			"shard": 0
		}],
		"reason": "all shards failed",
		"grouped": true,
		"caused_by": {
			"reason": null,
			"caused_by": {
				"reason": null,
				"type": "null_pointer_exception"
			},
			"type": "null_pointer_exception"
		},
		"type": "search_phase_execution_exception"
	}],
	"type": "search_phase_execution_exception",
	"root_cause": [{
		"reason": null,
		"type": "null_pointer_exception"
	}]
}
ES Logs
...
[2019-06-07T16:10:56,041][TRACE][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Extracted type for [responsetime] : [float] from index [farequote]
[2019-06-07T16:10:56,041][TRACE][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Extracted type for [airline] : [keyword] from index [farequote]
[2019-06-07T16:10:56,041][DEBUG][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Deduced mapping for: [avg_responsetime], agg type [avg] to [double]
[2019-06-07T16:10:56,041][DEBUG][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Deduced mapping for: [airline] to [keyword]
[2019-06-07T16:10:56,054][INFO ][o.e.c.m.MetaDataCreateIndexService] [dolaru-elastic-mbp.local] [farequote_dft_start] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2019-06-07T16:10:56,055][INFO ][o.e.c.r.a.AllocationService] [dolaru-elastic-mbp.local] updating number_of_replicas to [0] for indices [farequote_dft_start]
[2019-06-07T16:10:56,276][TRACE][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Extracted type for [avg_responsetime] : [double] from index [farequote_dft_start]
[2019-06-07T16:10:56,276][TRACE][o.e.x.d.t.p.SchemaUtil   ] [dolaru-elastic-mbp.local] Extracted type for [airline] : [keyword] from index [farequote_dft_start]
[2019-06-07T16:10:56,277][INFO ][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Updating state for data frame transform [test_dft_start] to [{"task_state":"started","indexer_state":"stopped","checkpoint":0}]
[2019-06-07T16:10:56,317][DEBUG][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Successfully updated state for data frame transform [test_dft_start] to [{"task_state":"started","indexer_state":"stopped","checkpoint":0}]
[2019-06-07T16:10:56,318][DEBUG][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Data frame indexer [data_frame/transforms/schedule_test_dft_start] schedule has triggered, state: [STARTED]
[2019-06-07T16:10:56,319][INFO ][o.e.x.d.t.DataFrameTransformPersistentTasksExecutor] [dolaru-elastic-mbp.local] Successfully completed and scheduled task in node operation
[2019-06-07T16:10:56,376][INFO ][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Finished indexing for data frame transform [test_dft_start] checkpoint [1]
[2019-06-07T16:10:56,379][DEBUG][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Updating persistent state of transform [test_dft_start] to [{"task_state":"started","indexer_state":"started","checkpoint":1,"progress":{"total_docs":86274,"docs_remaining":0,"percent_complete":100.0}}]
[2019-06-07T16:10:56,583][DEBUG][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Updating persistent state of transform [test_dft_start] to [{"task_state":"stopped","indexer_state":"stopped","checkpoint":1,"progress":{"total_docs":86274,"docs_remaining":0,"percent_complete":100.0}}]
[2019-06-07T16:10:56,639][INFO ][o.e.x.d.t.DataFrameTransformTask] [dolaru-elastic-mbp.local] Data frame transform [test_dft_start] indexer has stopped
[2019-06-07T16:10:56,740][DEBUG][o.e.a.s.TransportSearchAction] [dolaru-elastic-mbp.local] [.data-frame-internal-1][0], node[C0FEW7SlS7aMxaIxZygbuQ], [P], s[STARTED], a[id=YVgYHIXPSuC0IJDzsLuWcQ]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[.data-frame-internal-1], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], routing='null', preference='null', requestCache=null, scroll=Scroll{keepAlive=5m}, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"size":1000,"query":{"term":{"id":{"value":"test_dft_start","boost":1.0}}},"version":false,"seq_no_primary_term":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}],"slice":{"field":"_id","id":1,"max":5}}}]
org.elasticsearch.transport.RemoteTransportException: [dolaru-elastic-mbp.local][10.4.20.60:9300][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Query Failed [Failed to execute main query]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:305) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:335) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:360) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService$2.doRun(SearchService.java:1052) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.NullPointerException
	at org.elasticsearch.search.slice.TermsSliceQuery.build(TermsSliceQuery.java:81) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.slice.TermsSliceQuery$1.scorer(TermsSliceQuery.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.search.Weight.scorerSupplier(Weight.java:147) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorerSupplier(LRUQueryCache.java:727) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorerSupplier(IndicesQueryCache.java:157) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:374) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:340) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.Weight.bulkScorer(Weight.java:181) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:334) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:808) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:163) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.internal.ContextIndexSearcher$1.bulkScorer(ContextIndexSearcher.java:166) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:649) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.apache.lucene.search.XIndexSearcher.search(XIndexSearcher.java:44) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:443) ~[lucene-core-8.1.0.jar:8.1.0 dbe5ed0b2f17677ca6c904ebae919363f2d36a0a - ishan - 2019-05-09 19:34:03]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:270) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:335) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:360) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService.lambda$executeQueryPhase$1(SearchService.java:340) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.lambda$map$2(ActionListener.java:145) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.search.SearchService$2.doRun(SearchService.java:1052) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:835) ~[?:?]
...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions