[SPARK-16963] [STREAMING] [SQL] Changes to Source trait and related implementation classes #14553

frreiss · 2016-08-09T02:39:01Z

What changes were proposed in this pull request?

This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes:

Added a method commit(end: Offset) that tells the Source that is OK to discard all offsets up end, inclusive.
Changed the semantics of a None value for the getBatch method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer".
Added notes that the upper layers of the system will never call getBatch with a start value less than the last value passed to commit.
Added a lastCommittedOffset method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code.
The scheduler in StreamExecution.scala now calls commit on its stream sources after marking each batch as complete in its checkpoint.
MemoryStream now cleans committed batches out of its internal buffer.
TextSocketSource now cleans committed batches from its internal buffer.

How was this patch tested?

Existing regression tests already exercise the new code.

frreiss · 2016-08-22T23:19:02Z

These changes are now ready for review. The contents of this PR pass regression tests on my machines. Can one of the committers please start a Jenkins build?

wangmiao1981 · 2016-08-23T22:14:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetadataLog.scala

   */
  def getLatest(): Option[(Long, T)]
+
+


Extra blank line

Fixed in my local copy.

…963. Also addressed minor review comments.

frreiss · 2016-08-29T16:35:04Z

@rxin and @marmbrus, would it be possible to get this PR reviewed soon? I can split it into smaller chunks if that would make things easier; I just need to know.

…ork.

frreiss · 2016-08-31T18:25:52Z

@ScrapCodes, would you mind triggering a build of this PR?

ScrapCodes · 2016-09-01T03:30:07Z

ok to test

ScrapCodes · 2016-09-01T03:40:37Z

retest this please

ScrapCodes · 2016-09-06T07:07:45Z

I have tested the PR with my MQTT connector. Looks like I do not have sufficient privilege to command jenkins.

vanzin · 2016-09-07T23:44:36Z

ok to test

SparkQA · 2016-09-08T01:18:21Z

Test build #65062 has finished for PR 14553 at commit 7c6a30d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait Source

zsxwing

LGTM except one nit.

zsxwing · 2016-10-21T21:42:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-        offsetLog.purge(currentBatchId)
+        // the batch before the previous batch, and it is safe to discard the old metadata.
+        // Note that purge is exclusive, i.e. it purges everything before the target ID.
+        offsetLog.purge(currentBatchId - 1)


nit: this can be offsetLog.purge(currentBatchId), it's exclusive, then you can revert changes to StreamingQuerySuite.

I can move this change to another JIRA if you'd like, but we really should change currentBatchId to currentBatchId - 1 at some point. The call to offsetLog.purge(currentBatchId), which I introduced in my PR for SPARK-17513, contains a subtle bug. The recovery logic in populateStartOffsets() reads the last and second-to-last entries in offsetLog. populateStartOffsets() uses those entries to populate availableOffsets and committedOffsets, respectively. Calling offsetLog.purge(currentBatchId) at line 350/366 results in the offsetLog being truncated to one entry, which in turn results in committedOffsets being left empty on recovery, which in turn causes the first call to getBatch() for any source to have None as its first argument. Sources that do not prune buffered data in their commit() methods will return a previously committed data in response to such a getBatch() call.

I see. Thanks for your clarifying.

SparkQA · 2016-10-21T22:37:10Z

Test build #67350 has finished for PR 14553 at commit 47eee52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-24T18:02:32Z

retest this please

zsxwing

LGTM pending tests

SparkQA · 2016-10-24T19:41:43Z

Test build #67463 has finished for PR 14553 at commit 47eee52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-24T20:11:30Z

@frreiss you need to reset lastOffsetCommitted in MemoryStream.reset. That's why the test fails.

brkyvz · 2016-10-25T00:29:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

  }

+  override def commit(end: Offset): Unit = synchronized {
+    if (end.isInstanceOf[LongOffset]) {


nit:

end match { case newOffset: LongOffset => ... case _ => sys.error(...) }

Corrected in my local copy.

brkyvz · 2016-10-25T00:29:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala

+      lastOffsetCommitted = newOffset
+    } else {
+      sys.error(s"MemoryStream.commit() received an offset ($end) that did not originate with " +
+        s"an instance of this class")


nit: s unnecessary.

Corrected in my local copy.

brkyvz · 2016-10-25T00:32:20Z

LGTM as well!

zsxwing · 2016-10-26T00:18:43Z

@frreiss any update?

frreiss · 2016-10-26T22:02:52Z

Updated the branch and addressed new review comments. Looks like my last push missed a one-line change to memory.scala. Tests are running now.

SparkQA · 2016-10-27T00:22:12Z

Test build #67603 has finished for PR 14553 at commit 0a56e4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-27T00:32:34Z

LGMT. Merging to master and 2.0. Thanks!

…lementation classes ## What changes were proposed in this pull request? This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes: * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive. * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer". * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`. * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code. * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint. * `MemoryStream` now cleans committed batches out of its internal buffer. * `TextSocketSource` now cleans committed batches from its internal buffer. ## How was this patch tested? Existing regression tests already exercise the new code. Author: frreiss <[email protected]> Closes #14553 from frreiss/fred-16963. (cherry picked from commit 5b27598) Signed-off-by: Shixiong Zhu <[email protected]>

…ion" ## What changes were proposed in this pull request? A follow up PR for #14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #15661 from zsxwing/fix-StreamingQuerySuite.

…ion" ## What changes were proposed in this pull request? A follow up PR for #14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes #15661 from zsxwing/fix-StreamingQuerySuite. (cherry picked from commit 79fd0cc) Signed-off-by: Shixiong Zhu <[email protected]>

…lementation classes ## What changes were proposed in this pull request? This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes: * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive. * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer". * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`. * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code. * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint. * `MemoryStream` now cleans committed batches out of its internal buffer. * `TextSocketSource` now cleans committed batches from its internal buffer. ## How was this patch tested? Existing regression tests already exercise the new code. Author: frreiss <[email protected]> Closes apache#14553 from frreiss/fred-16963.

…ion" ## What changes were proposed in this pull request? A follow up PR for apache#14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#15661 from zsxwing/fix-StreamingQuerySuite.

…lementation classes ## What changes were proposed in this pull request? This PR contains changes to the Source trait such that the scheduler can notify data sources when it is safe to discard buffered data. Summary of changes: * Added a method `commit(end: Offset)` that tells the Source that is OK to discard all offsets up `end`, inclusive. * Changed the semantics of a `None` value for the `getBatch` method to mean "from the very beginning of the stream"; as opposed to "all data present in the Source's buffer". * Added notes that the upper layers of the system will never call `getBatch` with a start value less than the last value passed to `commit`. * Added a `lastCommittedOffset` method to allow the scheduler to query the status of each Source on restart. This addition is not strictly necessary, but it seemed like a good idea -- Sources will be maintaining their own persistent state, and there may be bugs in the checkpointing code. * The scheduler in `StreamExecution.scala` now calls `commit` on its stream sources after marking each batch as complete in its checkpoint. * `MemoryStream` now cleans committed batches out of its internal buffer. * `TextSocketSource` now cleans committed batches from its internal buffer. ## How was this patch tested? Existing regression tests already exercise the new code. Author: frreiss <[email protected]> Closes apache#14553 from frreiss/fred-16963.

…ion" ## What changes were proposed in this pull request? A follow up PR for apache#14553 to fix the flaky test. It's flaky because the file list API doesn't guarantee any order of the return list. ## How was this patch tested? Jenkins Author: Shixiong Zhu <[email protected]> Closes apache#15661 from zsxwing/fix-StreamingQuerySuite.

frreiss added 9 commits August 8, 2016 19:28

Initial version of changes to Source trait

6c9acde

Changes to files that depend on the Source trait

dae72ff

Merge branch 'master' of https://github.com/apache/spark into fred-16963

f78b4d5

Added method to garbage-collect the metadata log.

cf426fa

Merge branch 'master' of https://github.com/apache/spark into fred-16963

c028432

Fixing problems with building from Maven.

f92a9a7

Various bug fixes.

4cd181d

Merge branch 'master' of https://github.com/apache/spark into fred-16963

fcc90bd

Merge branch 'master' of https://github.com/apache/spark into fred-16963

35cdae9

frreiss changed the title ~~[WIP] [SPARK-16963] Initial version of changes to Source trait~~ [SPARK-16963] Changes to Source trait and related implementation classes Aug 22, 2016

wangmiao1981 reviewed Aug 23, 2016
View reviewed changes

frreiss mentioned this pull request Aug 26, 2016

[SPARK-17235][SQL] Support purging of old logs in MetadataLog #14802

Closed

frreiss added 3 commits August 26, 2016 17:51

Merge branch 'master' of https://github.com/apache/spark into fred-16…

9096c56

…963. Also addressed minor review comments.

Merge branch 'master' of https://github.com/apache/spark into fred-16963

ecaf732

Merge branch 'master' of https://github.com/apache/spark into fred-16963

5638281

frreiss added 4 commits August 29, 2016 09:38

Removed a few blank lines.

43ffbf3

Additional whitespace cleanup.

f5c15f8

Merge branch 'master' of https://github.com/apache/spark into fred-16963

a79c557

Narrowing the size of the diff by moving some changes out to future w…

7c6a30d

…ork.

frreiss added 2 commits October 21, 2016 13:44

Changes to address review comments.

c726549

Merge branch 'master' of https://github.com/apache/spark into fred-16963

47eee52

zsxwing reviewed Oct 21, 2016

View reviewed changes

zsxwing approved these changes Oct 24, 2016

View reviewed changes

zsxwing mentioned this pull request Oct 24, 2016

[SPARK-17604][SQL][Streaming] Supprt purging aged file entries in FileStreamSourceLog #15210

Closed

brkyvz reviewed Oct 25, 2016

View reviewed changes

frreiss added 3 commits October 26, 2016 14:44

Commit before merge.

46f6411

Merge branch 'master' of https://github.com/apache/spark into fred-16963

d9eaf5a

Addressing review comments.

0a56e4a

zsxwing mentioned this pull request Oct 27, 2016

[SPARK-17813][SQL][KAFKA] Maximum data per trigger #15527

Closed

asfgit closed this in 5b27598 Oct 27, 2016

zsxwing mentioned this pull request Oct 27, 2016

[SPARK-16963][SQL]Fix test "StreamExecution metadata garbage collection" #15661

Closed

[SPARK-16963] [STREAMING] [SQL] Changes to Source trait and related implementation classes #14553

[SPARK-16963] [STREAMING] [SQL] Changes to Source trait and related implementation classes #14553

Uh oh!

Conversation

frreiss commented Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

frreiss commented Aug 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frreiss commented Aug 29, 2016

Uh oh!

frreiss commented Aug 31, 2016

Uh oh!

ScrapCodes commented Sep 1, 2016

Uh oh!

ScrapCodes commented Sep 1, 2016

Uh oh!

ScrapCodes commented Sep 6, 2016

Uh oh!

vanzin commented Sep 7, 2016

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

zsxwing commented Oct 24, 2016

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 24, 2016

Uh oh!

zsxwing commented Oct 24, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Oct 25, 2016

Uh oh!

zsxwing commented Oct 26, 2016

Uh oh!

frreiss commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

zsxwing commented Oct 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

frreiss commented Aug 9, 2016 •

edited

Loading