Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Initial implementation of RocksDBCheckpointMetadata. It persists the metadata for RocksDBFileManager.

Why are the changes needed?

The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys.
The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - Directory Structure and Format for Files stored in DFS.

Does this PR introduce any user-facing change?

No. Internal implementation only.

How was this patch tested?

New UT added.

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42254/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42254/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42258/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42258/

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137727 has finished for PR 32272 at commit 7ce24ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RocksDBCheckpointMetadata(
  • sealed trait RocksDBImmutableFile

@SparkQA
Copy link

SparkQA commented Apr 21, 2021

Test build #137731 has finished for PR 32272 at commit 4b61526.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @viirya

Comment on lines +43 to +44
// We turn this field into a null to avoid write a empty logFiles field in the json.
val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to avoid it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to the usage for RocksDB, we don't always have log files. But we must have sst files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point here is excluding empty field (correct?) vs leaving empty field with []. Seems like a small optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the logFiles field not always has value.


/**
* A RocksDBImmutableFile maintains a mapping between a local RocksDB file name and the name of
* its copy on DFS. Since these files are immutable, their DFS copies can be reused.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that a DFS copy can be mapped to more than one local file names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we reuse the DFS copies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Can be mapped to more than one local file but for different tasks. The most common scenario is task/stage retry.

Comment on lines +113 to +115
def isSameFile(otherFile: File): Boolean = {
otherFile.getName == localFileName && otherFile.length() == sizeBytes
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a DFS copy can be mapped to more than one local file names, shouldn't two local file names the same one even their local file names are different, if their DFS file names are the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DFS file name contains UUID, it shouldn't be the same. Normally we use the local file name to filter whether the file is existing locally.

@HeartSaVioR
Copy link
Contributor

Sorry to visit this lately. I just went through design doc and left some comments. Probably it'd be nice if we can resolve comments on the design doc and reflect them to current/following PRs. Thanks!

@xuanyuanking
Copy link
Member Author

@HeartSaVioR Thanks for the advice. Comments have been resolved and yes, it makes sense to reflect them to the PRs. The current implementation for RocksDBCheckpointMetadata is the metadata files in the ${batchId}.zip

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts on this PR!

It looks OK in overall, but there're some sorts of uncertainty during reviewing as there's no reference PR. In other words, we are reviewing methods which we don't have idea how these methods will be used.

It would be nice if there's a PR containing everything (OK to be out of sync later during reviewing) so that reviewers could refer it to determine the overall view. I'm also OK to review PRs one by one with uncertainty (with faith) and revisit all changes at the last phase.

mapper.writeValueAsString(nullified)
}

def prettyJson: String = Serialization.writePretty(this)(RocksDBCheckpointMetadata.format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it produce same output with json? Since this doesn't manipulate empty logFiles field. Otherwise is it by intention to handle json and prettyJson differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is the logFiles fields. Actually the prettyJson field is for providing a readable string for log. json field is for files writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see where it is used. Just for logging - got it.

// scalastyle:on line.size.limit
}

private def withTempDirectory(f: File => Unit): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, withTempDir is defined in SparkFunSuite so you can just leverage the method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Let me update.

@xuanyuanking
Copy link
Member Author

there're some sorts of uncertainty during reviewing as there's no reference PR. In other words, we are reviewing methods which we don't have idea how these methods will be used.
I'm also OK to review PRs one by one with uncertainty (with faith) and revisit all changes at the last phase.

Yes, agree on both. I propose that we can mark down the uncertain methods or the ones without the caller side for now in the PR. When I submitting the reference PR, I can link the comment to the newly created PRs. It should help to our review and make sure I don't miss to explain any uncertainty during the review.

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43027/

@SparkQA
Copy link

SparkQA commented May 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43027/

@SparkQA
Copy link

SparkQA commented May 13, 2021

Test build #138507 has finished for PR 32272 at commit 3b91a26.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the file name, RocksDBFileManager.scala doesn't contain any RocksDBFileManager. Shall we rename it?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay per this change. But as @HeartSaVioR said, I think we still need to look at how this is going to be used in the full context.

@xuanyuanking
Copy link
Member Author

About the file name, RocksDBFileManager.scala doesn't contain any RocksDBFileManager. Shall we rename it?

All these checkpointing metadata is for RocksDBFileManager. As my plan, the next PR is for the save path of RocksDBFileManager.

I think we still need to look at how this is going to be used in the full context.

Agree, I plan to use this comment as a demo. In the next PR, I'll reference this comment to provide the full context.

@xuanyuanking
Copy link
Member Author

To provide more context for the functions in this PR, I created the WIP PR (#32582) and referenced the comment there. Please check whether we can ship this for now. Thanks :) @HeartSaVioR @viirya

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
#32582 covers the uncertain things from my side.

I'd still need to wait for @viirya before signing-off to see whether #32582 covers the same for @viirya as well.

@viirya
Copy link
Member

viirya commented May 20, 2021

Thanks @HeartSaVioR. I will take another look with #32582 tomorrow.

@viirya
Copy link
Member

viirya commented May 22, 2021

Sorry for late. I will find some time in the weekend to look at this.

@xuanyuanking
Copy link
Member Author

Sorry for late. I will find some time in the weekend to look at this.

No worries, thanks for the detailed review! Take your time.

@HeartSaVioR
Copy link
Contributor

Looks like there's no further comment so I'm going to merge this once the test passes.

@HeartSaVioR
Copy link
Contributor

retest this, please

@HeartSaVioR
Copy link
Contributor

@xuanyuanking Could you please push an empty commit for the case Jenkins doesn't work? Thanks in advance!

@xuanyuanking
Copy link
Member Author

@HeartSaVioR Sure, thanks for reminding.

@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43535/

@SparkQA
Copy link

SparkQA commented May 27, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43535/

@SparkQA
Copy link

SparkQA commented May 27, 2021

Test build #139018 has finished for PR 32272 at commit f52adac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139018/

@HeartSaVioR
Copy link
Contributor

Jenkins passed. Thanks! Merging to master.

@xuanyuanking xuanyuanking deleted the SPARK-35172 branch May 28, 2021 03:09
@xuanyuanking
Copy link
Member Author

Thanks for the review and help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants