[SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata #32272

xuanyuanking · 2021-04-21T10:02:48Z

What changes were proposed in this pull request?

Initial implementation of RocksDBCheckpointMetadata. It persists the metadata for RocksDBFileManager.

Why are the changes needed?

The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys.
The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - Directory Structure and Format for Files stored in DFS.

Does this PR introduce any user-facing change?

No. Internal implementation only.

How was this patch tested?

New UT added.

SparkQA · 2021-04-21T11:02:31Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42254/

SparkQA · 2021-04-21T11:07:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42254/

SparkQA · 2021-04-21T11:30:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42258/

SparkQA · 2021-04-21T11:30:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42258/

SparkQA · 2021-04-21T14:27:20Z

Test build #137727 has finished for PR 32272 at commit 7ce24ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class RocksDBCheckpointMetadata(
sealed trait RocksDBImmutableFile

SparkQA · 2021-04-21T14:29:12Z

Test build #137731 has finished for PR 32272 at commit 4b61526.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-04-22T03:00:42Z

cc @viirya

viirya · 2021-04-22T04:40:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+    // We turn this field into a null to avoid write a empty logFiles field in the json.
+    val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this


why we need to avoid it?

It's related to the usage for RocksDB, we don't always have log files. But we must have sst files.

I think the point here is excluding empty field (correct?) vs leaving empty field with []. Seems like a small optimization.

Yes, the logFiles field not always has value.

viirya · 2021-04-22T04:44:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+
+/**
+ * A RocksDBImmutableFile maintains a mapping between a local RocksDB file name and the name of
+ * its copy on DFS. Since these files are immutable, their DFS copies can be reused.


Does it mean that a DFS copy can be mapped to more than one local file names?

When do we reuse the DFS copies?

Yes. Can be mapped to more than one local file but for different tasks. The most common scenario is task/stage retry.

viirya · 2021-04-22T04:45:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+  def isSameFile(otherFile: File): Boolean = {
+    otherFile.getName == localFileName && otherFile.length() == sizeBytes
+  }


If a DFS copy can be mapped to more than one local file names, shouldn't two local file names the same one even their local file names are different, if their DFS file names are the same?

The DFS file name contains UUID, it shouldn't be the same. Normally we use the local file name to filter whether the file is existing locally.

HeartSaVioR · 2021-04-27T02:14:09Z

Sorry to visit this lately. I just went through design doc and left some comments. Probably it'd be nice if we can resolve comments on the design doc and reflect them to current/following PRs. Thanks!

xuanyuanking · 2021-04-30T13:00:49Z

@HeartSaVioR Thanks for the advice. Comments have been resolved and yes, it makes sense to reflect them to the PRs. The current implementation for RocksDBCheckpointMetadata is the metadata files in the ${batchId}.zip

HeartSaVioR

Thanks for your efforts on this PR!

It looks OK in overall, but there're some sorts of uncertainty during reviewing as there's no reference PR. In other words, we are reviewing methods which we don't have idea how these methods will be used.

It would be nice if there's a PR containing everything (OK to be out of sync later during reviewing) so that reviewers could refer it to determine the overall view. I'm also OK to review PRs one by one with uncertainty (with faith) and revisit all changes at the last phase.

HeartSaVioR · 2021-05-01T02:49:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

+    mapper.writeValueAsString(nullified)
+  }
+
+  def prettyJson: String = Serialization.writePretty(this)(RocksDBCheckpointMetadata.format)


Would it produce same output with json? Since this doesn't manipulate empty logFiles field. Otherwise is it by intention to handle json and prettyJson differently?

The only difference is the logFiles fields. Actually the prettyJson field is for providing a readable string for log. json field is for files writing.

OK I see where it is used. Just for logging - got it.

HeartSaVioR · 2021-05-01T03:12:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

+    // scalastyle:on line.size.limit
+  }
+
+  private def withTempDirectory(f: File => Unit): Unit = {


If I remember correctly, withTempDir is defined in SparkFunSuite so you can just leverage the method.

Ah yes. Let me update.

xuanyuanking · 2021-05-13T09:43:04Z

there're some sorts of uncertainty during reviewing as there's no reference PR. In other words, we are reviewing methods which we don't have idea how these methods will be used.

I'm also OK to review PRs one by one with uncertainty (with faith) and revisit all changes at the last phase.

Yes, agree on both. I propose that we can mark down the uncertain methods or the ones without the caller side for now in the PR. When I submitting the reference PR, I can link the comment to the newly created PRs. It should help to our review and make sure I don't miss to explain any uncertainty during the review.

SparkQA · 2021-05-13T10:28:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43027/

SparkQA · 2021-05-13T10:28:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43027/

SparkQA · 2021-05-13T13:58:46Z

Test build #138507 has finished for PR 32272 at commit 3b91a26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

About the file name, RocksDBFileManager.scala doesn't contain any RocksDBFileManager. Shall we rename it?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

viirya

Looks okay per this change. But as @HeartSaVioR said, I think we still need to look at how this is going to be used in the full context.

xuanyuanking · 2021-05-17T08:16:02Z

About the file name, RocksDBFileManager.scala doesn't contain any RocksDBFileManager. Shall we rename it?

All these checkpointing metadata is for RocksDBFileManager. As my plan, the next PR is for the save path of RocksDBFileManager.

I think we still need to look at how this is going to be used in the full context.

Agree, I plan to use this comment as a demo. In the next PR, I'll reference this comment to provide the full context.

xuanyuanking · 2021-05-18T15:01:16Z

To provide more context for the functions in this PR, I created the WIP PR (#32582) and referenced the comment there. Please check whether we can ship this for now. Thanks :) @HeartSaVioR @viirya

HeartSaVioR

+1
#32582 covers the uncertain things from my side.

I'd still need to wait for @viirya before signing-off to see whether #32582 covers the same for @viirya as well.

viirya · 2021-05-20T04:57:01Z

Thanks @HeartSaVioR. I will take another look with #32582 tomorrow.

viirya · 2021-05-22T06:29:12Z

Sorry for late. I will find some time in the weekend to look at this.

xuanyuanking · 2021-05-25T01:05:57Z

Sorry for late. I will find some time in the weekend to look at this.

No worries, thanks for the detailed review! Take your time.

HeartSaVioR · 2021-05-27T09:02:04Z

Looks like there's no further comment so I'm going to merge this once the test passes.

HeartSaVioR · 2021-05-27T09:02:15Z

retest this, please

HeartSaVioR · 2021-05-27T09:03:41Z

@xuanyuanking Could you please push an empty commit for the case Jenkins doesn't work? Thanks in advance!

xuanyuanking · 2021-05-27T09:06:15Z

@HeartSaVioR Sure, thanks for reminding.

SparkQA · 2021-05-27T10:17:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43535/

SparkQA · 2021-05-27T10:52:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43535/

SparkQA · 2021-05-27T13:51:44Z

Test build #139018 has finished for PR 32272 at commit f52adac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2021-05-27T13:54:11Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/139018/

HeartSaVioR · 2021-05-27T13:55:36Z

Jenkins passed. Thanks! Merging to master.

xuanyuanking · 2021-05-28T03:09:25Z

Thanks for the review and help!

Implementation for RocksDBCheckpointMetadata

7ce24ce

github-actions bot added SQL STRUCTURED STREAMING labels Apr 21, 2021

Add TD in the author list

4b61526

viirya reviewed Apr 22, 2021

View reviewed changes

HeartSaVioR reviewed May 1, 2021

View reviewed changes

address comment

3b91a26

viirya reviewed May 15, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala Show resolved Hide resolved

viirya reviewed May 15, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala Show resolved Hide resolved

viirya reviewed May 15, 2021

View reviewed changes

xuanyuanking mentioned this pull request May 18, 2021

[SPARK-35436][SS] RocksDBFileManager - save checkpoint to DFS #32582

Closed

HeartSaVioR approved these changes May 20, 2021

View reviewed changes

viirya approved these changes May 25, 2021

View reviewed changes

Trigger jenkins

f52adac

HeartSaVioR closed this in f98a063 May 27, 2021

xuanyuanking deleted the SPARK-35172 branch May 28, 2021 03:09

		// We turn this field into a null to avoid write a empty logFiles field in the json.
		val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this

[SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata #32272

[SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata #32272

Uh oh!

Conversation

xuanyuanking commented Apr 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

SparkQA commented Apr 21, 2021

Uh oh!

dongjoon-hyun commented Apr 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Apr 27, 2021

Uh oh!

xuanyuanking commented Apr 30, 2021

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking commented May 13, 2021

Uh oh!

SparkQA commented May 13, 2021

Uh oh!

SparkQA commented May 13, 2021

Uh oh!

SparkQA commented May 13, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

xuanyuanking commented May 17, 2021

Uh oh!

xuanyuanking commented May 18, 2021

Uh oh!

HeartSaVioR left a comment