-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15698][SQL][Streaming] Add the ability to remove the old MetadataLog in FileStreamSource #13513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #60000 has finished for PR 13513 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move (was $compactInterval) at the end of the message.
|
Jenkins, retest this please. |
|
Test build #60060 has finished for PR 13513 at commit
|
|
Test build #60071 has finished for PR 13513 at commit
|
|
@jerryshao the approach seems good to me. Could you refactor the codes to avoid copying codes from FileStreamSinkLog? It's hard to maintain duplicated codes. |
|
Sure, I will change the code. |
798c450 to
c2aad87
Compare
|
Test build #65244 has finished for PR 13513 at commit
|
|
@zsxwing , thanks a lot for your comments, I did several refactorings:
Please help to review again, thanks a lot. |
|
Test build #65245 has finished for PR 13513 at commit
|
|
Test build #65246 has finished for PR 13513 at commit
|
|
Just noticed that I think we need to store |
|
You could just move the metadata deletion logic from FileStreamSinkLog into CompactibleFileStreamLog. Then FileStreamSource could issue DELETE log records for files that are older than |
|
@zsxwing @frreiss thanks a lot for your comments. I think the semantics of Yes it could be slow to get a batch where it happens to be a compact batch. I think we could have 2 solutions:
Both two solutions need extra works, what do you think? |
|
Ah, now I fully understand @zsxwing's earlier comment about the semantics of the semantics of In the short run, I think that @jerryshao's changes here are ok with respect to |
|
Sorry. Replied a wrong PR. Deleting. |
|
@frreiss SPARK-17165 (#14728) uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make VERSION be a constructor parameter in order to support to change source or sink format separately?
|
@jerryshao here is a test case to show the issue about test("getBatch") {
withTempDirs { case (src, tmp) =>
withSQLConf(
SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2",
// Force deleting the old logs
SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1"
) {
val fileStream = createFileStream("text", src.getCanonicalPath)
val filtered = fileStream.filter($"value" contains "keep")
testStream(filtered)(
AddTextFileData("keep1", src, tmp),
CheckAnswer("keep1"),
AddTextFileData("keep2", src, tmp),
CheckAnswer("keep1", "keep2"),
AddTextFileData("keep3", src, tmp),
CheckAnswer("keep1", "keep2", "keep3"),
AssertOnQuery("check getBatch") { execution: StreamExecution =>
val _sources = PrivateMethod[Seq[Source]]('sources)
val fileSource =
(execution invokePrivate _sources()).head.asInstanceOf[FileStreamSource]
assert(fileSource.getBatch(None, LongOffset(2)).as[String].collect() ===
List("keep1", "keep2", "keep3"))
assert(fileSource.getBatch(Some(LongOffset(0)), LongOffset(2)).as[String].collect() ===
List("keep2", "keep3"))
assert(fileSource.getBatch(Some(LongOffset(1)), LongOffset(2)).as[String].collect() ===
List("keep3"))
}
)
}
}
} |
|
Thanks a lot @zsxwing and @frreiss for your comments. For the slow scan problem of compact batch. Originally I planned to to not merge the latest batch as I did before, also as suggested above. but with several different tries it is hard to implement with small changes. So for now I still choose the same implementation with a simple cache layer to overcome this problem, the basic compaction algorithm is still the same as For the problem of semantics broken. I realized that it is really a problem, but current code didn't touch it. So I changed to scan the compacted batch files to retrieve missing batches. It is a little time-consuming, and the current logic of |
|
Test build #65365 has finished for PR 13513 at commit
|
|
Test build #65368 has finished for PR 13513 at commit
|
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache idea looks good to me. For the current PR, I suggest that using a new log class FileSourceLogEntry for file source as the following benefits:
- Avoid storing unnecessary info for file sink log.
- Avoid changing the format for file sink log.
- The code will be a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop doesn't change the original map.
scala> val m = scala.collection.mutable.LinkedHashMap[Int, Int]()
m: scala.collection.mutable.LinkedHashMap[Int,Int] = Map()
scala>
scala> m(2) = 1
scala> m
res1: scala.collection.mutable.LinkedHashMap[Int,Int] = Map(2 -> 1)
scala> m.drop(1)
res2: scala.collection.mutable.LinkedHashMap[Int,Int] = Map()
scala> m
res3: scala.collection.mutable.LinkedHashMap[Int,Int] = Map(2 -> 1)I think it should be Java LinkedHashMap. This is an example:
spark/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala
Line 45 in 03d46aa
| private[ui] val batchTimeToOutputOpIdSparkJobIdPair = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sorry for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use the following class for FileStreamSource?
case class FileSourceLogEntry(batchId: Long, Seq[FileEntry])
I think this will make the codes here simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here in the parent class CompactibleFileStreamLog we assume that metadata type should be Array[T], which is suitable for both file source and sink log currently. If we change to use FileSourceLogEntry, the base class should be T rather then Array[T], which will make the two inherited class divergent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your clarifying.
cb4194e to
be1abfa
Compare
|
Test build #65547 has finished for PR 13513 at commit
|
| } | ||
|
|
||
| override def add(batchId: Long, logs: Array[FileEntry]): Boolean = { | ||
| if (super.add(batchId, logs) && isCompactionBatch(batchId, compactInterval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong. If super.add(batchId, logs) is false, then we should always return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right, I will fix it.
zsxwing
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just one minor issue.
|
Test build #65628 has finished for PR 13513 at commit
|
|
Test build #65631 has finished for PR 13513 at commit
|
|
LGTM. Thanks! Merging to master and 2.0. |
| * old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it. | ||
| */ | ||
| private val fileCleanupDelayMs = sparkSession.sessionState.conf.fileSinkLogCleanupDelay | ||
| protected override val fileCleanupDelayMs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed some conflicts here. Could you submit a follow up PR to use the previous sparkSession.sessionState.conf.fileSinkLogCleanupDelay? Same as the other confs. This only exists in master branch, so we don't need to fix branch 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry about it, will fix it now.
…ataLog in FileStreamSource (branch-2.0) ## What changes were proposed in this pull request? Backport #13513 to branch 2.0. ## How was this patch tested? Jenkins Author: jerryshao <[email protected]> Closes #15163 from zsxwing/SPARK-15698-spark-2.0.
What changes were proposed in this pull request?
Current
metadataLoginFileStreamSourcewill add a checkpoint file in each batch but do not have the ability to remove/compact, which will lead to large number of small files when running for a long time. So here propose to compact the old logs into one file. This method is quite similar toFileStreamSinkLogbut simpler.How was this patch tested?
Unit test added.