Skip to content

Commit ba62eb3

Browse files
committed
Allow truncation of clean translog (#47866)
Today the `elasticsearch-shard remove-corrupted-data` tool will only truncate a translog it determines to be corrupt. However there may be other cases in which it is desirable to truncate the translog, for instance if an operation in the translog cannot be replayed for some reason other than corruption. This commit adds a `--truncate-clean-translog` option to skip the corruption check on the translog and blindly truncate it.
1 parent 58289a8 commit ba62eb3

File tree

3 files changed

+79
-20
lines changed

3 files changed

+79
-20
lines changed

docs/reference/commands/shard-tool.asciidoc

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,36 @@
11
[[shard-tool]]
22
== elasticsearch-shard
33

4-
In some cases the Lucene index or translog of a shard copy can become
5-
corrupted. The `elasticsearch-shard` command enables you to remove corrupted
6-
parts of the shard if a good copy of the shard cannot be recovered
7-
automatically or restored from backup.
4+
In some cases the Lucene index or translog of a shard copy can become corrupted.
5+
The `elasticsearch-shard` command enables you to remove corrupted parts of the
6+
shard if a good copy of the shard cannot be recovered automatically or restored
7+
from backup.
88

99
[WARNING]
1010
You will lose the corrupted data when you run `elasticsearch-shard`. This tool
1111
should only be used as a last resort if there is no way to recover from another
1212
copy of the shard or restore a snapshot.
1313

14-
When Elasticsearch detects that a shard's data is corrupted, it fails that
15-
shard copy and refuses to use it. Under normal conditions, the shard is
16-
automatically recovered from another copy. If no good copy of the shard is
17-
available and you cannot restore from backup, you can use `elasticsearch-shard`
18-
to remove the corrupted data and restore access to any remaining data in
19-
unaffected segments.
14+
[float]
15+
=== Synopsis
16+
17+
[source,shell]
18+
--------------------------------------------------
19+
bin/elasticsearch-shard remove-corrupted-data
20+
([--index <Index>] [--shard-id <ShardId>] | [--dir <IndexPath>])
21+
[--truncate-clean-translog]
22+
[-E <KeyValuePair>]
23+
[-h, --help] ([-s, --silent] | [-v, --verbose])
24+
--------------------------------------------------
25+
26+
[float]
27+
=== Description
28+
29+
When {es} detects that a shard's data is corrupted, it fails that shard copy and
30+
refuses to use it. Under normal conditions, the shard is automatically recovered
31+
from another copy. If no good copy of the shard is available and you cannot
32+
restore one from a snapshot, you can use `elasticsearch-shard` to remove the
33+
corrupted data and restore access to any remaining data in unaffected segments.
2034

2135
[WARNING]
2236
Stop Elasticsearch before running `elasticsearch-shard`.
@@ -31,7 +45,7 @@ There are two ways to specify the path:
3145
translog files.
3246

3347
[float]
34-
=== Removing corrupted data
48+
==== Removing corrupted data
3549

3650
`elasticsearch-shard` analyses the shard copy and provides an overview of the
3751
corruption found. To proceed you must then confirm that you want to remove the
@@ -91,17 +105,19 @@ POST /_cluster/reroute
91105
]
92106
}
93107
94-
You must accept the possibility of data loss by changing parameter `accept_data_loss` to `true`.
108+
You must accept the possibility of data loss by changing the `accept_data_loss` parameter to `true`.
95109
96110
Deleted corrupt marker corrupted_FzTSBSuxT7i3Tls_TgwEag from /var/lib/elasticsearchdata/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/index/
97111
98112
--------------------------------------------------
99113

100114
When you use `elasticsearch-shard` to drop the corrupted data, the shard's
101115
allocation ID changes. After restarting the node, you must use the
102-
<<cluster-reroute,cluster reroute API>> to tell Elasticsearch to use the new
103-
ID. The `elasticsearch-shard` command shows the request that
104-
you need to submit.
116+
<<cluster-reroute,cluster reroute API>> to tell Elasticsearch to use the new ID.
117+
The `elasticsearch-shard` command shows the request that you need to submit.
105118

106119
You can also use the `-h` option to get a list of all options and parameters
107120
that the `elasticsearch-shard` tool supports.
121+
122+
Finally, you can use the `--truncate-clean-translog` option to truncate the
123+
shard's translog even if it does not appear to be corrupt.

server/src/main/java/org/elasticsearch/index/shard/RemoveCorruptedShardDataCommand.java

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ public class RemoveCorruptedShardDataCommand extends EnvironmentAwareCommand {
8080
private final OptionSpec<String> folderOption;
8181
private final OptionSpec<String> indexNameOption;
8282
private final OptionSpec<Integer> shardIdOption;
83+
static final String TRUNCATE_CLEAN_TRANSLOG_FLAG = "truncate-clean-translog";
8384

8485
private final RemoveCorruptedLuceneSegmentsAction removeCorruptedLuceneSegmentsAction;
8586
private final TruncateTranslogAction truncateTranslogAction;
@@ -99,6 +100,8 @@ public RemoveCorruptedShardDataCommand() {
99100
.withRequiredArg()
100101
.ofType(Integer.class);
101102

103+
parser.accepts(TRUNCATE_CLEAN_TRANSLOG_FLAG, "Truncate the translog even if it is not corrupt");
104+
102105
namedXContentRegistry = new NamedXContentRegistry(ClusterModule.getNamedXWriteables());
103106

104107
removeCorruptedLuceneSegmentsAction = new RemoveCorruptedLuceneSegmentsAction();
@@ -320,8 +323,11 @@ public void write(int b) {
320323
terminal.println("");
321324

322325
////////// Translog
323-
// as translog relies on data stored in an index commit - we have to have non unrecoverable index to truncate translog
324-
if (indexCleanStatus.v1() != CleanStatus.UNRECOVERABLE) {
326+
if (options.has(TRUNCATE_CLEAN_TRANSLOG_FLAG)) {
327+
translogCleanStatus = Tuple.tuple(CleanStatus.OVERRIDDEN,
328+
"Translog was not analysed and will be truncated due to the --" + TRUNCATE_CLEAN_TRANSLOG_FLAG + " flag");
329+
} else if (indexCleanStatus.v1() != CleanStatus.UNRECOVERABLE) {
330+
// translog relies on data stored in an index commit so we have to have a recoverable index to check the translog
325331
terminal.println("");
326332
terminal.println("Opening translog at " + translogPath);
327333
terminal.println("");
@@ -344,7 +350,8 @@ public void write(int b) {
344350
final CleanStatus translogStatus = translogCleanStatus.v1();
345351

346352
if (indexStatus == CleanStatus.CLEAN && translogStatus == CleanStatus.CLEAN) {
347-
throw new ElasticsearchException("Shard does not seem to be corrupted at " + shardPath.getDataPath());
353+
throw new ElasticsearchException("Shard does not seem to be corrupted at " + shardPath.getDataPath()
354+
+ " (pass --" + TRUNCATE_CLEAN_TRANSLOG_FLAG + " to truncate the translog anyway)");
348355
}
349356

350357
if (indexStatus == CleanStatus.UNRECOVERABLE) {
@@ -493,7 +500,7 @@ private void printRerouteCommand(ShardPath shardPath, Terminal terminal, boolean
493500
terminal.println("");
494501
terminal.println("POST /_cluster/reroute\n" + Strings.toString(commands, true, true));
495502
terminal.println("");
496-
terminal.println("You must accept the possibility of data loss by changing parameter `accept_data_loss` to `true`.");
503+
terminal.println("You must accept the possibility of data loss by changing the `accept_data_loss` parameter to `true`.");
497504
terminal.println("");
498505
}
499506

@@ -509,7 +516,8 @@ public enum CleanStatus {
509516
CLEAN("clean"),
510517
CLEAN_WITH_CORRUPTED_MARKER("marked corrupted, but no corruption detected"),
511518
CORRUPTED("corrupted"),
512-
UNRECOVERABLE("corrupted and unrecoverable");
519+
UNRECOVERABLE("corrupted and unrecoverable"),
520+
OVERRIDDEN("to be truncated regardless of whether it is corrupt");
513521

514522
private final String msg;
515523

server/src/test/java/org/elasticsearch/index/shard/RemoveCorruptedShardDataCommandTests.java

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@
5656
import java.util.regex.Matcher;
5757
import java.util.regex.Pattern;
5858

59+
import static org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.TRUNCATE_CLEAN_TRANSLOG_FLAG;
60+
import static org.hamcrest.Matchers.allOf;
5961
import static org.hamcrest.Matchers.containsString;
6062
import static org.hamcrest.Matchers.either;
6163
import static org.hamcrest.Matchers.equalTo;
@@ -349,6 +351,39 @@ public void testResolveIndexDirectory() throws Exception {
349351
shardPath -> assertThat(shardPath.resolveIndex(), equalTo(indexPath)));
350352
}
351353

354+
public void testFailsOnCleanIndex() throws Exception {
355+
indexDocs(indexShard, true);
356+
closeShards(indexShard);
357+
358+
final RemoveCorruptedShardDataCommand command = new RemoveCorruptedShardDataCommand();
359+
final MockTerminal t = new MockTerminal();
360+
final OptionParser parser = command.getParser();
361+
362+
final OptionSet options = parser.parse("-d", translogPath.toString());
363+
t.setVerbosity(Terminal.Verbosity.VERBOSE);
364+
assertThat(expectThrows(ElasticsearchException.class, () -> command.execute(t, options, environment)).getMessage(),
365+
allOf(containsString("Shard does not seem to be corrupted"), containsString("--" + TRUNCATE_CLEAN_TRANSLOG_FLAG)));
366+
assertThat(t.getOutput(), containsString("Lucene index is clean"));
367+
assertThat(t.getOutput(), containsString("Translog is clean"));
368+
}
369+
370+
public void testTruncatesCleanTranslogIfRequested() throws Exception {
371+
indexDocs(indexShard, true);
372+
closeShards(indexShard);
373+
374+
final RemoveCorruptedShardDataCommand command = new RemoveCorruptedShardDataCommand();
375+
final MockTerminal t = new MockTerminal();
376+
final OptionParser parser = command.getParser();
377+
378+
final OptionSet options = parser.parse("-d", translogPath.toString(), "--" + TRUNCATE_CLEAN_TRANSLOG_FLAG);
379+
t.addTextInput("y");
380+
t.setVerbosity(Terminal.Verbosity.VERBOSE);
381+
command.execute(t, options, environment);
382+
assertThat(t.getOutput(), containsString("Lucene index is clean"));
383+
assertThat(t.getOutput(), containsString("Translog was not analysed and will be truncated"));
384+
assertThat(t.getOutput(), containsString("Creating new empty translog"));
385+
}
386+
352387
public void testCleanWithCorruptionMarker() throws Exception {
353388
// index some docs in several segments
354389
final int numDocs = indexDocs(indexShard, true);

0 commit comments

Comments
 (0)