-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-19604. ABFS: BlockId generation based on blockCount along with full blob md5 computation change #7777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
This comment was marked as outdated.
This comment was marked as outdated.
🎊 +1 overall
This message was automatically generated. |
============================================================
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Production code review. Will review test code in separate iteration.
...p-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobBlock.java
Outdated
Show resolved
Hide resolved
...p-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobBlock.java
Show resolved
Hide resolved
...p-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobBlock.java
Outdated
Show resolved
Hide resolved
...p-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobBlock.java
Outdated
Show resolved
Hide resolved
...-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java
Outdated
Show resolved
Hide resolved
...-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java
Outdated
Show resolved
Hide resolved
...-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java
Show resolved
Hide resolved
if (rawBlockId.length() < rawLength) { | ||
rawBlockId = String.format("%-" + rawLength + "s", rawBlockId) | ||
.replace(' ', '_'); | ||
} else if (rawBlockId.length() > rawLength) { | ||
rawBlockId = rawBlockId.substring(0, rawLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use ternary logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will make readability a bit difficult
byte[] digest = null; | ||
String fullBlobMd5 = null; | ||
try { | ||
// Clone the MessageDigest to avoid resetting the original state | ||
MessageDigest clonedMd5 = (MessageDigest) getAbfsOutputStream().getFullBlobContentMd5().clone(); | ||
digest = clonedMd5.digest(); | ||
} catch (CloneNotSupportedException e) { | ||
LOG.warn("Failed to clone MessageDigest instance", e); | ||
} | ||
if (digest != null && digest.length != 0) { | ||
fullBlobMd5 = Base64.getEncoder().encodeToString(digest); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this code is common in both DFS and Blob ingress handler class, maybe we can add it as a protected helper method in the abstract class AzureIngressHandler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
* @param blockCount the count of blocks to set | ||
*/ | ||
public void setBlockCount(final long blockCount) { | ||
protected void setBlockCount(final long blockCount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modifier level was incorrect earlier, corrected it
Mockito.anyString(), | ||
Mockito.nullable(ContextEncryptionAdapter.class), | ||
Mockito.any(TracingContext.class) | ||
Mockito.any(TracingContext.class), Mockito.anyString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be Mockito.nullable(String.class) as the md5 here can be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
md = MessageDigest.getInstance(MD5); | ||
} catch (NoSuchAlgorithmException e) { | ||
// MD5 algorithm not available; md will remain null | ||
// Log this in production code if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
pos += appendWithOffsetHelper(os, client, path, data, fs, pos, ONE_MB); | ||
pos += appendWithOffsetHelper(os, client, path, data, fs, pos, MB_2); | ||
appendWithOffsetHelper(os, client, path, data, fs, pos, MB_4 - 1); | ||
pos += appendWithOffsetHelper(os, client, path, data, fs, pos, 0, getMd5(data, 0, data.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: double spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
* Gets the activeBlock and the blockId. | ||
* | ||
* @param outputStream AbfsOutputStream Instance. | ||
* @param offset Used to generate blockId based on offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add newly added parameter in method comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
* @param leaseId leaseId of the blob to be appended | ||
* @param isExpectHeaderEnabled true if the expect header is enabled | ||
* @param blobParams parameters specific to append operation on Blob Endpoint. | ||
* @param md5 The Base64-encoded MD5 hash of the block for data integrity validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Format can be corrected - extra space before @param and after md5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
this.blockId = generateBlockId(offset); | ||
this.blockIndex = blockIndex; | ||
String streamId = outputStream.getStreamID(); | ||
UUID streamIdGuid = UUID.nameUUIDFromBytes(streamId.getBytes(StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can streamId be null? streamId.getBytes can raise null pointer exception. Better to handle it,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamId can never be null as this is set in constructor of AbfsOutputStream itself, this.outputStreamId = createOutputStreamId();
...p-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobBlock.java
Outdated
Show resolved
Hide resolved
requestHeaders.add(new AbfsHttpHeader(X_MS_LEASE_ID, requestParameters.getLeaseId())); | ||
} | ||
if (isChecksumValidationEnabled()) { | ||
addCheckSumHeaderForWrite(requestHeaders, requestParameters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: formatting required - there must be one tab in the start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
final String leaseId, | ||
final ContextEncryptionAdapter contextEncryptionAdapter, | ||
final TracingContext tracingContext) throws AzureBlobFileSystemException { | ||
final TracingContext tracingContext, String blobMd5) throws AzureBlobFileSystemException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new argument in the comments @param. Please make this change wherever required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
final String eTag, | ||
ContextEncryptionAdapter contextEncryptionAdapter, | ||
final TracingContext tracingContext) throws AzureBlobFileSystemException { | ||
final TracingContext tracingContext, String blobMd5) throws AzureBlobFileSystemException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
* @param tracingContext for tracing the server calls. | ||
* @return executed rest operation containing response from server. | ||
* @throws AzureBlobFileSystemException if rest operation fails. | ||
* Flushes previously uploaded data to the specified path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT - Format can be consistent across places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
AbfsOutputStream out; | ||
out = Mockito.spy(new AbfsOutputStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have this in the same line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
* Test compatibility between ABFS client and WASB client. | ||
*/ | ||
public class ITestWasbAbfsCompatibility extends AbstractAbfsIntegrationTest { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can remove these spaces here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
return new String(Base64.encodeBase64(blockIdByteArray), StandardCharsets.UTF_8); | ||
UUID streamIdGuid = UUID.nameUUIDFromBytes(streamId.getBytes(StandardCharsets.UTF_8)); | ||
long blockIndex = os.getBlockManager().getBlockCount(); | ||
String rawBlockId = String.format("%s-%06d", streamIdGuid, blockIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the constant BLOCK_ID_FORMAT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added review for the test code
* @return The Base64-encoded MD5 checksum of the specified data, or null if the digest is empty. | ||
* @throws IllegalArgumentException If the offset or length is invalid for the given byte array. | ||
*/ | ||
public String getMd5(byte[] data, int off, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar method is present in production code.
Can we use that itself in tests so that production method can also be covered in test flow?
If some issuei is found later then someone might just fix here and production code will still remain buggy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, taken
* @return String representing the block ID generated. | ||
*/ | ||
private String generateBlockId(AbfsOutputStream os, long position) { | ||
private String generateBlockId(AbfsOutputStream os) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use production code in tests as well.
Any issue in code better to catch in production and fix there itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
...s/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AzureIngressHandler.java
Show resolved
Hide resolved
...adoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AzureDFSIngressHandler.java
Show resolved
Hide resolved
...-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBlobClient.java
Show resolved
Hide resolved
...ols/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java
Outdated
Show resolved
Hide resolved
...ols/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java
Outdated
Show resolved
Hide resolved
...ols/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java
Show resolved
Hide resolved
|
||
// Write | ||
try (FSDataOutputStream nativeFsStream = abfs.create(path, true)) { | ||
nativeFsStream.write(TEST_CONTEXT.getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario name says we need to write via ABFS. Seems like we are writing via wasb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outputstream name is confusing corrected it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this correction for all the tests. I still see some places where output strem is created via abfs bu still named native.
...ols/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/ITestWasbAbfsCompatibility.java
Outdated
Show resolved
Hide resolved
* @return Total number of files and directories. | ||
* @throws IOException If an error occurs while accessing the file system. | ||
*/ | ||
public static int listAllFilesAndDirs(FileSystem fs, Path path) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make it private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks for the patch.
Just added a minot comment around variable naming in tests.
The stream created by abfs are named native... sounds confusing. Please fix that for all the tests in ITestWasbAbfsCompatibility
class.
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
============================================================
|
… full blob md5 computation change (apache#7777) Contributed by Anmol Asrani
… full blob md5 computation change (apache#7777) Contributed by Anmol Asrani
Jira :- https://issues.apache.org/jira/browse/HADOOP-19604
BlockId computation to be consistent across clients for PutBlock and PutBlockList so made use of blockCount instead of offset.
Block IDs were previously derived from the data offset, which could lead to inconsistency across different clients. The change now uses blockCount (i.e., the index of the block) to compute the Block ID, ensuring deterministic and consistent ID generation for both PutBlock and PutBlockList operations across clients.
Restrict URL encoding of certain JSON metadata during setXAttr calls.
When setting extended attributes (xAttrs), the JSON metadata (hdi_permission) was previously URL-encoded, which could cause unnecessary escaping or compatibility issues. This change ensures that only required metadata are encoded.
Maintain the MD5 hash of the whole block to validate data integrity during flush.
During flush operations, the MD5 hash of the entire block's data is computed and stored. This hash is later used to validate that the block correctly persisted, ensuring data integrity and helping detect corruption or transmission errors.