Skip to content

Commit 6943732

Browse files
authored
Allow truncation of clean translog (#47866)
Today the `elasticsearch-shard remove-corrupted-data` tool will only truncate a translog it determines to be corrupt. However there may be other cases in which it is desirable to truncate the translog, for instance if an operation in the translog cannot be replayed for some reason other than corruption. This commit adds a `--truncate-clean-translog` option to skip the corruption check on the translog and blindly truncate it.
1 parent ef9fe6c commit 6943732

File tree

3 files changed

+79
-20
lines changed

3 files changed

+79
-20
lines changed

docs/reference/commands/shard-tool.asciidoc

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,36 @@
11
[[shard-tool]]
22
== elasticsearch-shard
33

4-
In some cases the Lucene index or translog of a shard copy can become
5-
corrupted. The `elasticsearch-shard` command enables you to remove corrupted
6-
parts of the shard if a good copy of the shard cannot be recovered
7-
automatically or restored from backup.
4+
In some cases the Lucene index or translog of a shard copy can become corrupted.
5+
The `elasticsearch-shard` command enables you to remove corrupted parts of the
6+
shard if a good copy of the shard cannot be recovered automatically or restored
7+
from backup.
88

99
[WARNING]
1010
You will lose the corrupted data when you run `elasticsearch-shard`. This tool
1111
should only be used as a last resort if there is no way to recover from another
1212
copy of the shard or restore a snapshot.
1313

14-
When Elasticsearch detects that a shard's data is corrupted, it fails that
15-
shard copy and refuses to use it. Under normal conditions, the shard is
16-
automatically recovered from another copy. If no good copy of the shard is
17-
available and you cannot restore from backup, you can use `elasticsearch-shard`
18-
to remove the corrupted data and restore access to any remaining data in
19-
unaffected segments.
14+
[float]
15+
=== Synopsis
16+
17+
[source,shell]
18+
--------------------------------------------------
19+
bin/elasticsearch-shard remove-corrupted-data
20+
([--index <Index>] [--shard-id <ShardId>] | [--dir <IndexPath>])
21+
[--truncate-clean-translog]
22+
[-E <KeyValuePair>]
23+
[-h, --help] ([-s, --silent] | [-v, --verbose])
24+
--------------------------------------------------
25+
26+
[float]
27+
=== Description
28+
29+
When {es} detects that a shard's data is corrupted, it fails that shard copy and
30+
refuses to use it. Under normal conditions, the shard is automatically recovered
31+
from another copy. If no good copy of the shard is available and you cannot
32+
restore one from a snapshot, you can use `elasticsearch-shard` to remove the
33+
corrupted data and restore access to any remaining data in unaffected segments.
2034

2135
[WARNING]
2236
Stop Elasticsearch before running `elasticsearch-shard`.
@@ -31,7 +45,7 @@ There are two ways to specify the path:
3145
translog files.
3246

3347
[float]
34-
=== Removing corrupted data
48+
==== Removing corrupted data
3549

3650
`elasticsearch-shard` analyses the shard copy and provides an overview of the
3751
corruption found. To proceed you must then confirm that you want to remove the
@@ -91,17 +105,19 @@ POST /_cluster/reroute
91105
]
92106
}
93107
94-
You must accept the possibility of data loss by changing parameter `accept_data_loss` to `true`.
108+
You must accept the possibility of data loss by changing the `accept_data_loss` parameter to `true`.
95109
96110
Deleted corrupt marker corrupted_FzTSBSuxT7i3Tls_TgwEag from /var/lib/elasticsearchdata/indices/P45vf_YQRhqjfwLMUvSqDw/0/index/
97111
98112
--------------------------------------------------
99113

100114
When you use `elasticsearch-shard` to drop the corrupted data, the shard's
101115
allocation ID changes. After restarting the node, you must use the
102-
<<cluster-reroute,cluster reroute API>> to tell Elasticsearch to use the new
103-
ID. The `elasticsearch-shard` command shows the request that
104-
you need to submit.
116+
<<cluster-reroute,cluster reroute API>> to tell Elasticsearch to use the new ID.
117+
The `elasticsearch-shard` command shows the request that you need to submit.
105118

106119
You can also use the `-h` option to get a list of all options and parameters
107120
that the `elasticsearch-shard` tool supports.
121+
122+
Finally, you can use the `--truncate-clean-translog` option to truncate the
123+
shard's translog even if it does not appear to be corrupt.

server/src/main/java/org/elasticsearch/index/shard/RemoveCorruptedShardDataCommand.java

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ public class RemoveCorruptedShardDataCommand extends EnvironmentAwareCommand {
8484
private final OptionSpec<String> folderOption;
8585
private final OptionSpec<String> indexNameOption;
8686
private final OptionSpec<Integer> shardIdOption;
87+
static final String TRUNCATE_CLEAN_TRANSLOG_FLAG = "truncate-clean-translog";
8788

8889
private final RemoveCorruptedLuceneSegmentsAction removeCorruptedLuceneSegmentsAction;
8990
private final TruncateTranslogAction truncateTranslogAction;
@@ -103,6 +104,8 @@ public RemoveCorruptedShardDataCommand() {
103104
.withRequiredArg()
104105
.ofType(Integer.class);
105106

107+
parser.accepts(TRUNCATE_CLEAN_TRANSLOG_FLAG, "Truncate the translog even if it is not corrupt");
108+
106109
namedXContentRegistry = new NamedXContentRegistry(
107110
Stream.of(ClusterModule.getNamedXWriteables().stream(), IndicesModule.getNamedXContents().stream())
108111
.flatMap(Function.identity())
@@ -308,8 +311,11 @@ public void write(int b) {
308311
terminal.println("");
309312

310313
////////// Translog
311-
// as translog relies on data stored in an index commit - we have to have non unrecoverable index to truncate translog
312-
if (indexCleanStatus.v1() != CleanStatus.UNRECOVERABLE) {
314+
if (options.has(TRUNCATE_CLEAN_TRANSLOG_FLAG)) {
315+
translogCleanStatus = Tuple.tuple(CleanStatus.OVERRIDDEN,
316+
"Translog was not analysed and will be truncated due to the --" + TRUNCATE_CLEAN_TRANSLOG_FLAG + " flag");
317+
} else if (indexCleanStatus.v1() != CleanStatus.UNRECOVERABLE) {
318+
// translog relies on data stored in an index commit so we have to have a recoverable index to check the translog
313319
terminal.println("");
314320
terminal.println("Opening translog at " + translogPath);
315321
terminal.println("");
@@ -332,7 +338,8 @@ public void write(int b) {
332338
final CleanStatus translogStatus = translogCleanStatus.v1();
333339

334340
if (indexStatus == CleanStatus.CLEAN && translogStatus == CleanStatus.CLEAN) {
335-
throw new ElasticsearchException("Shard does not seem to be corrupted at " + shardPath.getDataPath());
341+
throw new ElasticsearchException("Shard does not seem to be corrupted at " + shardPath.getDataPath()
342+
+ " (pass --" + TRUNCATE_CLEAN_TRANSLOG_FLAG + " to truncate the translog anyway)");
336343
}
337344

338345
if (indexStatus == CleanStatus.UNRECOVERABLE) {
@@ -481,7 +488,7 @@ private void printRerouteCommand(ShardPath shardPath, Terminal terminal, boolean
481488
terminal.println("");
482489
terminal.println("POST /_cluster/reroute\n" + Strings.toString(commands, true, true));
483490
terminal.println("");
484-
terminal.println("You must accept the possibility of data loss by changing parameter `accept_data_loss` to `true`.");
491+
terminal.println("You must accept the possibility of data loss by changing the `accept_data_loss` parameter to `true`.");
485492
terminal.println("");
486493
}
487494

@@ -497,7 +504,8 @@ public enum CleanStatus {
497504
CLEAN("clean"),
498505
CLEAN_WITH_CORRUPTED_MARKER("marked corrupted, but no corruption detected"),
499506
CORRUPTED("corrupted"),
500-
UNRECOVERABLE("corrupted and unrecoverable");
507+
UNRECOVERABLE("corrupted and unrecoverable"),
508+
OVERRIDDEN("to be truncated regardless of whether it is corrupt");
501509

502510
private final String msg;
503511

server/src/test/java/org/elasticsearch/index/shard/RemoveCorruptedShardDataCommandTests.java

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@
6464
import java.util.regex.Matcher;
6565
import java.util.regex.Pattern;
6666

67+
import static org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.TRUNCATE_CLEAN_TRANSLOG_FLAG;
68+
import static org.hamcrest.Matchers.allOf;
6769
import static org.hamcrest.Matchers.containsString;
6870
import static org.hamcrest.Matchers.either;
6971
import static org.hamcrest.Matchers.equalTo;
@@ -373,6 +375,39 @@ public void testResolveIndexDirectory() throws Exception {
373375
shardPath -> assertThat(shardPath.resolveIndex(), equalTo(indexPath)));
374376
}
375377

378+
public void testFailsOnCleanIndex() throws Exception {
379+
indexDocs(indexShard, true);
380+
closeShards(indexShard);
381+
382+
final RemoveCorruptedShardDataCommand command = new RemoveCorruptedShardDataCommand();
383+
final MockTerminal t = new MockTerminal();
384+
final OptionParser parser = command.getParser();
385+
386+
final OptionSet options = parser.parse("-d", translogPath.toString());
387+
t.setVerbosity(Terminal.Verbosity.VERBOSE);
388+
assertThat(expectThrows(ElasticsearchException.class, () -> command.execute(t, options, environment)).getMessage(),
389+
allOf(containsString("Shard does not seem to be corrupted"), containsString("--" + TRUNCATE_CLEAN_TRANSLOG_FLAG)));
390+
assertThat(t.getOutput(), containsString("Lucene index is clean"));
391+
assertThat(t.getOutput(), containsString("Translog is clean"));
392+
}
393+
394+
public void testTruncatesCleanTranslogIfRequested() throws Exception {
395+
indexDocs(indexShard, true);
396+
closeShards(indexShard);
397+
398+
final RemoveCorruptedShardDataCommand command = new RemoveCorruptedShardDataCommand();
399+
final MockTerminal t = new MockTerminal();
400+
final OptionParser parser = command.getParser();
401+
402+
final OptionSet options = parser.parse("-d", translogPath.toString(), "--" + TRUNCATE_CLEAN_TRANSLOG_FLAG);
403+
t.addTextInput("y");
404+
t.setVerbosity(Terminal.Verbosity.VERBOSE);
405+
command.execute(t, options, environment);
406+
assertThat(t.getOutput(), containsString("Lucene index is clean"));
407+
assertThat(t.getOutput(), containsString("Translog was not analysed and will be truncated"));
408+
assertThat(t.getOutput(), containsString("Creating new empty translog"));
409+
}
410+
376411
public void testCleanWithCorruptionMarker() throws Exception {
377412
// index some docs in several segments
378413
final int numDocs = indexDocs(indexShard, true);

0 commit comments

Comments
 (0)