-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28882 Backup restores are broken if the backup has moved locations #6294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupManifest.java
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7cdd9ee to
bcbd129
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
A question (before having had the time to go through your changes): |
|
Did your test also involve moving the backup location? For example, moving the backup from an S3 bucket in us-east-1 to us-east-2 prior to attempting the restore? Edit to clarify further: the simple backup case that you've described above definitely does work. Taking a backup and then restoring from that backup works well. But the more complicated case of:
Does not work because we stop using the specified location at this point in the restore internals. |
|
Converting to a draft PR while I investigate what looks like a related test failures (testBackupHistory) Edit: I made a precondition too strict for the variety of paths that might be passed into a BackupManifest constructor. Working on an update now, and will test more thoroughly locally |
No, my testcase was "can a fresh HBase cluster restore a backup".
This is where I'm confused. A fresh cluster won't have any record stored of where backups were previously stored. So why is this an issue for a cluster that created the now-moved backup? (To be clear: I believe you saying this is an issue, I just don't understand yet where it comes from.) |
👍
This is a good question that I don't have a certain answer to because the test that I'm running internally actually restores to the existing cluster in-place (it drops the existing table, and then restores from backup). But I believe it would still be a problem for a restore to a new cluster because we don't load the backup manifest from metadata on the HBase cluster — we load the backup manifest from its on disk persistence in the given backup. // load and set manifest field from file content
long len = subFile.getLen();
byte[] pbBytes = new byte[(int) len];
try (FSDataInputStream in = fs.open(subFile.getPath())) {
in.readFully(pbBytes);
} catch (IOException e) {
throw new BackupException(e.getMessage());
}
BackupProtos.BackupImage proto = null;
try {
proto = BackupProtos.BackupImage.parseFrom(pbBytes);
} catch (Exception e) {
throw new BackupException(e);
}For example:
This PR fixes this by intervening in step 5. After we parse the proto we'll hydrate the BackupImage (and its ancestors) with the backup root that was passed into the constructor, and that ensures that we stay aligned on one root |
bcbd129 to
78080c3
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
78080c3 to
8ba12e2
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8ba12e2 to
2d6e9a8
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| } | ||
| Preconditions.checkArgument(backupPath.getName().equals(backupId), | ||
| String.format("Backup path %s must end in backupId %s", backupPath, backupId)); | ||
| return new Path( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simplify this to return backupPath.getParent()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
|
Browsed the code, looks good to me. I now also get why this wasn't an issue for my tests when restoring a backup on a fresh cluster: the backup root hadn't changed in that case, so the paths in the backup manifests were still correct. We could go one step further though. The core issue is that the protofied(?) |
|
Good idea, I don't obviously see a reason to persist the root on the BackupImage either, and any subtraction in this code is a big addition imo. I'll take a look at cleaning that up |
|
@DieterDP-ng I've thought about it a little bit more, and I'm less sure about taking the path out of the BackupManifest. I like that the backups can be used, even if the original cluster and/or its backup system table has been lost. If we removed this metadata from the backup itself, and the system table were lost, then we'd put more burden on the operator to provide additional details/args |
I'm not seeing the detriment. What exactly would become harder if the path is no longer stored inside the manifest?
|
|
Yeah you're right, there wouldn't be a regression in usability — not sure I can even articulate my half-baked thought process from yesterday. Never mind |
hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/BackupInfo.java
Show resolved
Hide resolved
|
Sorry that I've fallen off here. I'll try to get this fixed up for a re-review this week. |
177b8bf to
d70c8dd
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Seems like a legitimate test failure, will dig in |
|
Ok I've fixed the test failure by ensuring that we persist the merged backup's manifest file when deleting the tmp directory. Also added some logging to help with debuggability, and I think it's worth just keeping that around. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
| if ( | ||
| fileName.indexOf(FSTableDescriptors.TABLEINFO_DIR) > 0 | ||
| || fileName.indexOf(HRegionFileSystem.REGION_INFO_FILE) > 0 | ||
| || fileName.indexOf(BackupManifest.MANIFEST_FILE_NAME) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This becomes necessary because the manifest file would implicitly move to the tmp directory during the merge process, since it follows the application's working root dir rather than its original root dir. So, before this changeset, rewriting the manifest would always update the original root dir (where we want this eventually), but with this changeset rewriting the manifest would put it in the tmp dir. Since the tmp dir will be purged, we need to make sure we copy the manifest file out first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test that asserts that we haven't lost any critical components of a backup image during processing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not so specifically I suppose. But the larger tests do implicitly validate that — for example, I realized this change was necessary due to test failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree with this change, I can't find what was changed that previously put the manifest in the correct location. Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, this was confusing to me too. Before we stripped the root directory from the BackupImage, rewriting the manifest would implicitly always use the original root directory. So this only worked because we were unable to relocate backups
ndimiduk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One more ping toward @DieterDP-ng, just in case.
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
…ons (apache#6294) (#134) Reviewed-by: Dieter De Paepe <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
…ons (apache#6294) Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]> Reviewed-by: Dieter De Paepe <[email protected]>
https://issues.apache.org/jira/browse/HBASE-28882
My company runs a few hundred HBase clusters. We want to take backups everyday in one public cloud region, and then use said cloud's native replication solution to "backup our backups" in a secondary region. This is how we plan for region-wide disaster recovery.
This system should work, but doesn't because of the way that BackupManifests are constructed.
Backing up a bit (no pun intended): when we replicate backups verbatim, the manifest file continues to point to the original backup root. This shouldn't matter, because when taking a restore one passes a RestoreRequest to the RestoreTablesClient — and this RestoreRequest includes a BackupRootDir field. This works as you would expect initially, but eventually we build a BackupManifest that fails to interpolate this provided root directory and, instead, falls back to what it finds on disk in the backup (which would point back to the primary backup location, even if reading a replicated backup).
To fix this, I'm proposing that we properly interpolate the request's root directory field when building BackupManifests.
I've added a unit test for the new behavior, and I've used this updated client to successfully run a table restore of a replicated backup in our test environment.
cc @ndimiduk @charlesconnell @aalhour @ksravista @DieterDP-ng