- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.4k
HBASE-22627 Port HBASE-22617 (Recovered WAL directories not getting cleaned up) to branch-1 #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9ed92f3    to
    ced4487      
    Compare
  
    | 💔 -1 overall 
 This message was automatically generated. | 
| 💔 -1 overall 
 This message was automatically generated. | 
ae1d198    to
    fc7dc76      
    Compare
  
    | Updated patch fixes javadoc and checkstyle warn from precommit, also a compile problem not caught by my IDE (unclean workspace I guess) | 
| I missed the original patch, but let me review this here (and try to remember why this was done originally). | 
| @z-york This fixes an issue where a path component to the wal dir is missing after HBASE-20734, so we get files under /<root>/<namespace> instead of /<root>/data/<namespace> with the default config, and the erroneous location is not expected by the hfile archiver or table delete procedure, so they are never cleaned up. We had a production incident where we hit the HDFS directory 1 million entry limit and couldn't online regions. | 
| * Create the region splits directory. | ||
| */ | ||
| void createSplitsDir() throws IOException { | ||
| void createSplitsDir(HRegionInfo daughterA, HRegionInfo daughterB) throws IOException { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me how this is related to the rest of the changes. Could you provide some insight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 included this in the branch-2/master patch, including it here in the backport too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this fixed an issue with a hbck test so precommit for the branch-2 version could be green. The daughter directories would not always be created. While technically an unrelated change we allowed it there and should allow it here. Or do an addendum to break it out into a separate commit (here and in branch-2 and up).
| // Delete the directory on wal filesystem | ||
| FileSystem walFs = mfs.getWALFileSystem(); | ||
| Path tableWALDir = FSUtils.getWALTableDir(env.getMasterConfiguration(), tableName); | ||
| if (walFs.exists(tableWALDir) && !walFs.delete(tableWALDir, true)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this have any unintended side-effects(rollbacks?)? Will this eventually be cleaned up by the cleaners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no rollback provisions for an IO exception late in the delete procedure, that is true. It's specific to branch-1 because branch-2 and up uses the procedure state machine to manage this better. Perhaps we could look at this in a follow up issue.
I added this at the very tail of the procedure so if there is an IO exception here at least it did not interfere with the deletion of the table directories. However if there is an IO exception deleting the table directories, we don't reach this code. We have this concern with the master procedures in branch-1 in a number of places.
| 💔 -1 overall 
 
 
 This message was automatically generated. | 
| +1 from me. It looks like the test failure was due to timeout. Can you run it locally to verify before committing? | 
| 💔 -1 overall 
 
 
 This message was automatically generated. | 
fc7dc76    to
    cd946e8      
    Compare
  
    | TestReplicationDroppedTables timeout could be related, but it is a known flake and doesn't seem healthy in latest branch-1 when I run it locally. I think it should be removed. I filed HBASE-22629 for followup. Pushed an update to fix the reported checkstyle and javadoc issues. | 
cd946e8    to
    0c297b2      
    Compare
  
    …leaned up) to branch-1 HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang)
| Latest patch drops the parts of the branch-2+ patch which automatically clean up the WAL dirs. To be conservative I worry about the potential impact to replication. We will still delete the wrong region dir upon open after replay. Only recovered edits files should have been written there. | 
| 💔 -1 overall 
 
 
 This message was automatically generated. | 
| TestVisibilityLabelsWithACL failure is unrelated. | 
| Thanks @z-york , merging now | 
| 💔 -1 overall 
 
 
 This message was automatically generated. | 
| 
 It should be safe as you said, only recovered edits will be rewritten there, and replication will not read recovered edits. | 
…leaned up) to branch-1 (#339) HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang) Signed-off-by: Zach York <[email protected]>
…leaned up) to branch-1 (#339) HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang) Signed-off-by: Zach York <[email protected]>
HBASE-22617 Recovered WAL directories not getting cleaned up (Duo Zhang)