[SPARK-15698][SQL][Streaming] Add the ability to remove the old MetadataLog in FileStreamSource #13513

jerryshao · 2016-06-05T03:51:07Z

What changes were proposed in this pull request?

Current metadataLog in FileStreamSource will add a checkpoint file in each batch but do not have the ability to remove/compact, which will lead to large number of small files when running for a long time. So here propose to compact the old logs into one file. This method is quite similar to FileStreamSinkLog but simpler.

How was this patch tested?

Unit test added.

SparkQA · 2016-06-05T05:12:37Z

Test build #60000 has finished for PR 13513 at commit 2ed1115.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSourceLog(sparkSession: SparkSession, path: String)

jaceklaskowski · 2016-06-05T21:28:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala

I'd move (was $compactInterval) at the end of the message.

jerryshao · 2016-06-06T17:46:11Z

Jenkins, retest this please.

SparkQA · 2016-06-06T19:29:07Z

Test build #60060 has finished for PR 13513 at commit 2ed1115.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSourceLog(sparkSession: SparkSession, path: String)

SparkQA · 2016-06-06T22:47:41Z

Test build #60071 has finished for PR 13513 at commit 798c450.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jerryshao · 2016-06-13T20:43:28Z

@tdas @zsxwing , what is your comment about this PR? Thanks a lot.

zsxwing · 2016-09-08T00:12:10Z

@jerryshao the approach seems good to me. Could you refactor the codes to avoid copying codes from FileStreamSinkLog? It's hard to maintain duplicated codes.

jerryshao · 2016-09-08T01:21:36Z

Sure, I will change the code.

SparkQA · 2016-09-12T07:47:59Z

Test build #65244 has finished for PR 13513 at commit c2aad87.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FileEntry(path: String, timestamp: Timestamp, action: String = ADD_ACTION)
- class FileStreamSourceLog(sparkSession: SparkSession, path: String)

jerryshao · 2016-09-12T07:50:53Z

@zsxwing , thanks a lot for your comments, I did several refactorings:

Abstract and consolidate FileStreamSinkLog and FileStreamSourceLog, now they share same code path to do compaction.
Change FileStreamSourceLog to use json format instead of binary coding, to add the compatibility and flexibility for future extension.
Improve the logics to fetch all metadata logs, now if compact log is existed, only scan compact log.

Please help to review again, thanks a lot.

SparkQA · 2016-09-12T09:41:29Z

Test build #65245 has finished for PR 13513 at commit f179349.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class CompactibleFileStreamLog[T: ClassTag](

SparkQA · 2016-09-12T09:45:59Z

Test build #65246 has finished for PR 13513 at commit 31340b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-09-12T22:37:26Z

Just noticed that FileStreamSource.getBatch(start: Option[Offset], end: Offset) is broken in this PR. start could be an arbitrary offset.

I think we need to store batchId with its file paths together in the metadata log. FileStreamSource.getBatch(start: Option[Offset], end: Offset) could be very slow when all batches are in the same file because we need to parse the whole file to get the mapping from batchId to files. However, in most cases, FileStreamSource.getBatch only queries the latest batch, so if we don't compact the latest metadata file, we can make it pretty fast by reading one small file for most of cases. When recovering from failure, the performance of FileStreamSource.getBatch doesn't really matter.

frreiss · 2016-09-12T23:05:06Z

You could just move the metadata deletion logic from FileStreamSinkLog into CompactibleFileStreamLog. Then FileStreamSource could issue DELETE log records for files that are older than FileStreamSource.lastPurgeTimestamp.

jerryshao · 2016-09-13T00:49:58Z

@zsxwing @frreiss thanks a lot for your comments.

I think the semantics of FileStreamSource.getBatch(start: Option[Offset], end: Offset) still keeps the same, since I overrided the get method in FileStreamSourceLog and filter out some compacted data.

Yes it could be slow to get a batch where it happens to be a compact batch. I think we could have 2 solutions:

doing compact on the next of latest metadata file (as what I did before), then this will help most of the scenarios in FileStreamSource.
We could put the data in this patch at beginning when doing compaction, so we don't need to scan the whole file to get this batch's metadata.

Both two solutions need extra works, what do you think?

frreiss · 2016-09-13T17:27:24Z

Ah, now I fully understand @zsxwing's earlier comment about the semantics of the semantics of Source.getBatch(). Those semantics have a design flaw; see the email thread I started at http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-tt18551.html. Basically, it's impossible to implement a Source to the written API spec without keeping unbounded state. I have an open PR to fix this problem at #14553.

In the short run, I think that @jerryshao's changes here are ok with respect to Source.getBatch. The approach in this PR will work as long as the internal structure of the StreamExecution class doesn't change and as long as Spark does not have to recover from an outage longer than the compaction interval. The recent changes to FileInputStream under SPARK-17165 (#14728) have the same problem, and those changes are already committed.

zsxwing · 2016-09-13T17:41:28Z

Sorry. Replied a wrong PR. Deleting.

zsxwing · 2016-09-13T17:51:16Z

@frreiss SPARK-17165 (#14728) uses SeenFilesMap.lastPurgeTimestamp to ignore files. when recovering from failure, SeenFilesMap.lastPurgeTimestamp will be set via the files in the metadata log. File paths not stored in the memory but older than SeenFilesMap.lastPurgeTimestamp won't be processed. Therefore, it doesn't need to store unbounded state.

zsxwing · 2016-09-13T19:48:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

Could you make VERSION be a constructor parameter in order to support to change source or sink format separately?

zsxwing · 2016-09-13T20:28:55Z

@jerryshao here is a test case to show the issue about getBatch:

  test("getBatch") {
    withTempDirs { case (src, tmp) =>
      withSQLConf(
        SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key -> "2",
        // Force deleting the old logs
        SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY.key -> "1"
      ) {
        val fileStream = createFileStream("text", src.getCanonicalPath)
        val filtered = fileStream.filter($"value" contains "keep")

        testStream(filtered)(
          AddTextFileData("keep1", src, tmp),
          CheckAnswer("keep1"),
          AddTextFileData("keep2", src, tmp),
          CheckAnswer("keep1", "keep2"),
          AddTextFileData("keep3", src, tmp),
          CheckAnswer("keep1", "keep2", "keep3"),
          AssertOnQuery("check getBatch") { execution: StreamExecution =>
            val _sources = PrivateMethod[Seq[Source]]('sources)
            val fileSource =
              (execution invokePrivate _sources()).head.asInstanceOf[FileStreamSource]
            assert(fileSource.getBatch(None, LongOffset(2)).as[String].collect() ===
              List("keep1", "keep2", "keep3"))
            assert(fileSource.getBatch(Some(LongOffset(0)), LongOffset(2)).as[String].collect() ===
              List("keep2", "keep3"))
            assert(fileSource.getBatch(Some(LongOffset(1)), LongOffset(2)).as[String].collect() ===
              List("keep3"))
          }
        )
      }
    }
  }

jerryshao · 2016-09-14T12:41:36Z

Thanks a lot @zsxwing and @frreiss for your comments.

For the slow scan problem of compact batch. Originally I planned to to not merge the latest batch as I did before, also as suggested above. but with several different tries it is hard to implement with small changes. So for now I still choose the same implementation with a simple cache layer to overcome this problem, the basic compaction algorithm is still the same as FileStreamSinkLog. I think it is easier to maintain.

For the problem of semantics broken. I realized that it is really a problem, but current code didn't touch it. So I changed to scan the compacted batch files to retrieve missing batches. It is a little time-consuming, and the current logic of FileStreamSource will not touch this part.

SparkQA · 2016-09-14T14:17:48Z

Test build #65365 has finished for PR 13513 at commit f9a4bcb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FileStreamSinkLog(
- case class FileEntry(path: String, timestamp: Timestamp, batchId: Long = NOT_SET)
- class FileStreamSourceLog(

SparkQA · 2016-09-14T15:26:16Z

Test build #65368 has finished for PR 13513 at commit cb4194e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

The cache idea looks good to me. For the current PR, I suggest that using a new log class FileSourceLogEntry for file source as the following benefits:

Avoid storing unnecessary info for file sink log.
Avoid changing the format for file sink log.
The code will be a bit cleaner.

zsxwing · 2016-09-15T17:55:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

drop doesn't change the original map.

scala> val m = scala.collection.mutable.LinkedHashMap[Int, Int]() m: scala.collection.mutable.LinkedHashMap[Int,Int] = Map() scala> scala> m(2) = 1 scala> m res1: scala.collection.mutable.LinkedHashMap[Int,Int] = Map(2 -> 1) scala> m.drop(1) res2: scala.collection.mutable.LinkedHashMap[Int,Int] = Map() scala> m res3: scala.collection.mutable.LinkedHashMap[Int,Int] = Map(2 -> 1)

I think it should be Java LinkedHashMap. This is an example:

spark/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala

Line 45 in 03d46aa

private[ui] val batchTimeToOutputOpIdSparkJobIdPair =

I see, sorry for this issue.

zsxwing · 2016-09-15T18:10:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

How about we use the following class for FileStreamSource?

case class FileSourceLogEntry(batchId: Long, Seq[FileEntry])

I think this will make the codes here simpler.

I think here in the parent class CompactibleFileStreamLog we assume that metadata type should be Array[T], which is suitable for both file source and sink log currently. If we change to use FileSourceLogEntry, the base class should be T rather then Array[T], which will make the two inherited class divergent.

Thanks for your clarifying.

SparkQA · 2016-09-18T04:05:15Z

Test build #65547 has finished for PR 13513 at commit be1abfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-09-20T00:41:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

+  }
+
+  override def add(batchId: Long, logs: Array[FileEntry]): Boolean = {
+    if (super.add(batchId, logs) && isCompactionBatch(batchId, compactInterval)) {


This is wrong. If super.add(batchId, logs) is false, then we should always return false.

yes, you're right, I will fix it.

zsxwing

Overall LGTM. Just one minor issue.

SparkQA · 2016-09-20T03:40:33Z

Test build #65628 has finished for PR 13513 at commit bddbc7f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-20T05:53:18Z

Test build #65631 has finished for PR 13513 at commit 84d3d27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-09-20T17:22:37Z

LGTM. Thanks! Merging to master and 2.0.

zsxwing · 2016-09-20T17:36:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

-   * old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.
-   */
-  private val fileCleanupDelayMs = sparkSession.sessionState.conf.fileSinkLogCleanupDelay
+  protected override val fileCleanupDelayMs =


I just noticed some conflicts here. Could you submit a follow up PR to use the previous sparkSession.sessionState.conf.fileSinkLogCleanupDelay? Same as the other confs. This only exists in master branch, so we don't need to fix branch 2.0.

Oh, sorry about it, will fix it now.

…ataLog in FileStreamSource (branch-2.0) ## What changes were proposed in this pull request? Backport #13513 to branch 2.0. ## How was this patch tested? Jenkins Author: jerryshao <[email protected]> Closes #15163 from zsxwing/SPARK-15698-spark-2.0.

jaceklaskowski reviewed Jun 5, 2016
View reviewed changes

jerryshao mentioned this pull request Aug 25, 2016

[SPARK-17235][SQL] Support purging of old logs in MetadataLog #14802

Closed

jerryshao force-pushed the SPARK-15698 branch from 798c450 to c2aad87 Compare September 12, 2016 07:38

zsxwing reviewed Sep 13, 2016
View reviewed changes

zsxwing requested changes Sep 15, 2016

View reviewed changes

Add the ability to remove the old MetadataLog in FileStreamSource

6cc43a3

jerryshao added 7 commits September 18, 2016 10:17

Fix flaky test

b1299dd

refactor according to comments

5300d9d

fix compile error

4187999

Remove white space

fb5a72c

Address the comments

bbf7663

readd the test

56a00ae

Address the comments

be1abfa

jerryshao force-pushed the SPARK-15698 branch from cb4194e to be1abfa Compare September 18, 2016 02:18

zsxwing reviewed Sep 20, 2016

View reviewed changes

zsxwing requested changes Sep 20, 2016

View reviewed changes

Address comments

bddbc7f

Fix test compile issue

84d3d27

asfgit closed this in a6aade0 Sep 20, 2016

zsxwing mentioned this pull request Sep 20, 2016

[SPARK-15698][SQL][STREAMING] Add the ability to remove the old MetadataLog in FileStreamSource (branch-2.0) #15163

Closed

zsxwing reviewed Sep 20, 2016

View reviewed changes

zsxwing mentioned this pull request Sep 23, 2016

[SPARK-17640][SQL]Avoid using -1 as the default batchId for FileStreamSource.FileEntry #15206

Closed

[SPARK-15698][SQL][Streaming] Add the ability to remove the old MetadataLog in FileStreamSource #13513

[SPARK-15698][SQL][Streaming] Add the ability to remove the old MetadataLog in FileStreamSource #13513

Uh oh!

Conversation

jerryshao commented Jun 5, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

jerryshao commented Jun 13, 2016

Uh oh!

zsxwing commented Sep 8, 2016

Uh oh!

jerryshao commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 12, 2016

Uh oh!

jerryshao commented Sep 12, 2016

Uh oh!

SparkQA commented Sep 12, 2016

Uh oh!

SparkQA commented Sep 12, 2016

Uh oh!

zsxwing commented Sep 12, 2016

Uh oh!

frreiss commented Sep 12, 2016

Uh oh!

jerryshao commented Sep 13, 2016

Uh oh!

frreiss commented Sep 13, 2016

Uh oh!

zsxwing commented Sep 13, 2016

Uh oh!

zsxwing commented Sep 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Sep 13, 2016

Uh oh!

jerryshao commented Sep 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

SparkQA commented Sep 14, 2016

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 20, 2016

jerryshao commented Sep 14, 2016 •

edited

Loading