- 
                Notifications
    You must be signed in to change notification settings 
- Fork 9.1k
HADOOP-18596. Distcp -update to use modification time while checking for file skip. #5308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 🎊 +1 overall 
 
 This message was automatically generated. | 
| if (sameLength && (source.getLen() > 0) && sameBlockSize && | ||
| source.getModificationTime() < target.getModificationTime()) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the addition of the getLen() > 0? We want to always copy if its an empty file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I actually had to add a check of if the file size is 0 to skip it every time before this check, forgot to add it in this version locally 😅. Good catch.
| 🎊 +1 overall 
 
 This message was automatically generated. | 
| 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented, mainly on the test.
I am wondering if we should actually provide a constant allowing a cool that to declare an offset to somehow allow for the source/dest time to be slipped a bit. Either via getTimeDuration() or with a signed integer in seconds for more complex settings (why?)
Then a caller could set some offset value like "1h" and then all files whose source time is up to 1h older than the dest timestamp can be offset. This could compensate for clock drift.
        
          
                ...hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java
          
            Show resolved
            Hide resolved
        
              
          
                ...hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
          
            Show resolved
            Hide resolved
        
      | 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- happy with the test changes
- nit about an import
- i do think we need to update the markdown distcp doc to explain this feature and how to turn off
        
          
                ...hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
          
            Show resolved
            Hide resolved
        
      | 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
patch is good, but the docs made me realise that the timestamp is also used for checksum-based validation. I don't think that should change from what is offered today.
|  | ||
| * `distcp.update.modification.time` can be used alongside the checksum check | ||
| in stores with same checksum algorithm as well. if set to true we check | ||
| both modification time and checksum between the files, but if this property | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? I think if checksums are matching then timestamps shouldn't be compared at all. If two files' checksums match, that is sufficient to say "they are the same"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamps are only used alongside checksums if we have set the config to true, else we would follow the default way that is offered today(So, we can switch off in cases where we know checksums would work).
Since S3A/ABFS has checksums disabled we are returned null for the checksum value, we'll always see true for that case, but it can be true for cases where the checksums actually are identical too, so if we rely on checksum check to be true and then don't compare the timestamp, that can give false skips.
So, should we check the timestamps inside of the checksum check instead? Like if the checksums for both source and target are not null and if we have the property set to true then do the mod time check? This would add few more changes as we would need to change the params inside different classes to pass the config value as well.
We can always have the default value as false and use the property in the cases we want as well to keep the default way as the one offered today too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. and the default option is "don't use checksums". as i was thinking if we would want to have this on automatically if you are on -skipCrc or the formats are incompatible.
but if we leave it something to explicitly ask for, your code looks right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i see your reasoning now. but as check is set as a default, behaviour is changing even on stores with checksums.
we could just say "this comes with -skipCrc", but as distcp switches to that if either of the stores has no checksum algorithm, then i think we should want to be consistent.
but if both stores have checksums, that checksum test should be all that is used -so we are consistent with big distcp jobs today.
        
          
                hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      |  | ||
| * `distcp.update.modification.time` can be used alongside the checksum check | ||
| in stores with same checksum algorithm as well. if set to true we check | ||
| both modification time and checksum between the files, but if this property | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. and the default option is "don't use checksums". as i was thinking if we would want to have this on automatically if you are on -skipCrc or the formats are incompatible.
but if we leave it something to explicitly ask for, your code looks right
        
          
                hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of queries about production code; minor test change
| private boolean maybeUseModTimeToCompare( | ||
| CopyListingFileStatus source, FileStatus target) { | ||
| if (useModTimeToUpdate) { | ||
| return source.getModificationTime() < target.getModificationTime(); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be <= ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good point.
just thinking if there would ever be a scenario when the source file is updated at the same time as it is synced to a different store, so we can have "=" to skip the copy...
| checksumComparison = checksumsAreEqual(sourceFS, source, sourceChecksum, | ||
| targetFS, target, srcLen); | ||
| // If Checksum comparison is false set it to false, else set to true. | ||
| boolean checksumResult = !checksumComparison.equals(CopyMapper.ChecksumComparison.FALSE); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this outcome right. as L632 should be reached for any outcome other than True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll be setting "checksumResult" to be true for both "INCOMPATIBLE" and "TRUE" result from checksumsAreEqual() method else false and go through L632, so, we would be following the same flow as before since incompatible result from this method was true earlier too.
| CopyListingFileStatus sourceCurrStatus = | ||
| new CopyListingFileStatus(fs.getFileStatus(sourcePath)); | ||
| Assert.assertFalse(DistCpUtils.checksumsAreEqual( | ||
| Assert.assertFalse(!DistCpUtils.checksumsAreEqual( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to assertEquals here
| Have made the changes @steveloughran suggested including changing "<" to "<=". Feel like we can have both strictly greater or greater equals for the check, the latter we would be taking a slight risk that the source file may have changed at the same time the last sync took place and we would be skipping the copy in that case, and the former in which we can have an additional copy even if there's no content changed but the mod time is same for both source and target. Shouldn't we prioritize accuracy here? | 
| 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one change to test code; flip the expected/equal values
| fs, new Path(sourceBase + srcFilename), null, | ||
| fs, new Path(targetBase + srcFilename), | ||
| sourceCurrStatus.getLen()), | ||
| CopyMapper.ChecksumComparison.FALSE); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good test, but you need to put the expected value first, so that assertEquals prints the right "expected 1 actual 2" message. bit of PITA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh, this is an old test, I changed the assertFalse to asserEquals but didn't realize the mistake I made. Thanks.
| 🎊 +1 overall 
 
 This message was automatically generated. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
+1
| ok, you can backport to 3.3, but not to the 3.3.5 branch | 
…for file skip. (apache#5308) Adding toggleable support for modification time during distcp -update between two stores with incompatible checksum comparison. Contributed by: Mehakmeet Singh <[email protected]>
…for file skip. (apache#5308) Adding toggleable support for modification time during distcp -update between two stores with incompatible checksum comparison. Contributed by: Mehakmeet Singh <[email protected]>
Description of PR
Using modification time as a way to add more checks to determine if distcp -update should skip a file or not.
In specific cases like the same file name, and size but different content we used to incorrectly skip files in update since there is no checksum comparison between object stores with different algorithm for it, to mitigate this we introduce comparing modification time between the target file and the source.
How was this patch tested?
Manually tested on an environment after reproducing the scenario where we might incorrectly skip a file.
Added a test in
AbstractContractDistCpTest.javato test by changing the target file's modification time to emulate the scenario.Tested on S3A(ap-south-1), ABFS(us-west-2), and LocalFS, and the test was successful.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?