[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name #16113

tdas · 2016-12-02T03:57:11Z

What changes were proposed in this pull request?

Here are the major changes in this PR.

Added the ability to recover StreamingQuery.id from checkpoint location, by writing the id to checkpointLoc/metadata.
Added StreamingQuery.runId which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of id).
Removed auto-generation of StreamingQuery.name. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
Added runId to StreamingQueryListener events and StreamingQueryProgress.

Implementation details

Renamed existing StreamExecutionMetadata to OffsetSeqMetadata, and moved it to the file OffsetSeq.scala, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of .json and .getOrElse("{}")).
Added the id as the new StreamMetadata.
When a StreamingQuery is created it gets or writes the StreamMetadata from checkpointLoc/metadata.
All internal logging in StreamExecution uses (name, id, runId) instead of just name

TODO

Test handling of name=null in json generation of StreamingQueryProgress
Test handling of name=null in json generation of StreamingQueryListener events
Test python API of runId

How was this patch tested?

Updated unit tests and new unit tests

…unId

SparkQA · 2016-12-02T04:24:27Z

Test build #69538 has finished for PR 16113 at commit aca547f.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
\"(class org.apache.spark.sql.streaming.StreamingQueryListener$\") =>
case class UsingJoin(tpe: JoinType, usingColumns: Seq[String]) extends JoinType

SparkQA · 2016-12-02T04:46:32Z

Test build #69540 has finished for PR 16113 at commit bec2fb3.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class StreamMetadata(id: String)

SparkQA · 2016-12-02T21:19:19Z

Test build #69588 has finished for PR 16113 at commit 0554e5e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-02T21:35:19Z

Test build #69589 has finished for PR 16113 at commit 6e1bbdd.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-02T23:41:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

- * Time unit: milliseconds
+ * @param id  unique id of the [[StreamingQuery]] that needs to be persisted across restarts
 */
-case class StreamExecutionMetadata(


This has been renamed to OffsetSeqLog and moved to OffsetSeq.scala

marmbrus

Only minor comments. LGTM

marmbrus · 2016-12-03T00:00:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala

+ * @param batchTimestampMs: The current batch processing timestamp.
+ * Time unit: milliseconds
+ */
+case class OffsetSeqMetadata(var batchWatermarkMs: Long = 0, var batchTimestampMs: Long = 0) {


Can we put this in its own file?

Sure. But its a small class and closely tied with OffsetSeq, so I thought its not worth having a separate file for this.

Not worth moving these 6 lines of code in a new file.

marmbrus · 2016-12-03T00:02:45Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala

+   * Returns the unique id of this query that persists across restarts from checkpoint data.
+   * That is, this id is generated when a query is started for the first time, and
+   * will be the same every time it is restarted from checkpoint data.
+   * There can only be one query with the same id active in a Spark cluster.


This makes it sound like its okay to have more than one running as long as they aren't on the same spark cluster.

I will just remove that line.

marmbrus · 2016-12-03T00:03:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala

-   * JSON string representation of this object.
-   */
-  def json: String = Serialization.write(this)
+case class StreamMetadata(id: String) {


Can we move these to their own file?

SparkQA · 2016-12-03T00:16:11Z

Test build #69590 has finished for PR 16113 at commit c9224ef.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-03T00:31:25Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/StreamMetadataSuite.scala

+      StreamMetadata("d366a8bf-db79-42ca-b5a4-d9ca0a11d63e"))
+  }
+
+  private def readForResource(fileName: String): StreamMetadata = {


note to self: readFromResource

SparkQA · 2016-12-03T01:51:11Z

Test build #69593 has finished for PR 16113 at commit afd5c0f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-03T02:11:58Z

Test build #69594 has finished for PR 16113 at commit 7ee4cf1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-05T22:57:41Z

Test build #69693 has started for PR 16113 at commit 19461e1.

SparkQA · 2016-12-06T01:47:05Z

Test build #69695 has finished for PR 16113 at commit 4041a22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-06T02:16:46Z

Merging to master and 2.1

…tart and not auto-generate StreamingQuery.name Here are the major changes in this PR. - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`. - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`). - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default. - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`. Implementation details - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`). - Added the `id` as the new `StreamMetadata`. - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`. - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name` TODO - [x] Test handling of name=null in json generation of StreamingQueryProgress - [x] Test handling of name=null in json generation of StreamingQueryListener events - [x] Test python API of runId Updated unit tests and new unit tests Author: Tathagata Das <[email protected]> Closes #16113 from tdas/SPARK-18657. (cherry picked from commit bb57bfe) Signed-off-by: Tathagata Das <[email protected]>

…tart and not auto-generate StreamingQuery.name ## What changes were proposed in this pull request? Here are the major changes in this PR. - Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`. - Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`). - Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default. - Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`. Implementation details - Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`). - Added the `id` as the new `StreamMetadata`. - When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`. - All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name` TODO - [x] Test handling of name=null in json generation of StreamingQueryProgress - [x] Test handling of name=null in json generation of StreamingQueryListener events - [x] Test python API of runId ## How was this patch tested? Updated unit tests and new unit tests Author: Tathagata Das <[email protected]> Closes apache#16113 from tdas/SPARK-18657.

tdas added 5 commits December 1, 2016 19:05

Made StreamingQuery.id persist across restart, added StreamingQuery.r…

26f3cf9

…unId

Fixed test

f103def

Made codahale metrics use id instead of name

f55b852

Made name to be default null

c20f4fe

Merge remote-tracking branch 'apache-github/master' into HEAD

aca547f

tdas changed the title ~~[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart, and not auto-generate name~~ [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name Dec 2, 2016

Some more changes

bec2fb3

Added tests

0554e5e

Fix mima

6e1bbdd

tdas added 2 commits December 2, 2016 14:06

Fix python style

c9224ef

Improve docs

afd5c0f

tdas commented Dec 2, 2016

View reviewed changes

Fix indent

7ee4cf1

marmbrus approved these changes Dec 3, 2016

View reviewed changes

tdas commented Dec 3, 2016

View reviewed changes

tdas mentioned this pull request Dec 3, 2016

[SPARK-18694][SS]Add StreamingQuery.explain and exception to Python and fix StreamingQueryException #16125

Closed

tdas added 3 commits December 2, 2016 23:10

Addressed comments

a2a6d90

Removed python test

0e3b0ac

Merge remote-tracking branch 'apache-github/master' into SPARK-18657

19461e1

Fix test

4041a22

asfgit closed this in bb57bfe Dec 6, 2016

[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name #16113

[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name #16113

Uh oh!

Conversation

tdas commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas Dec 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

SparkQA commented Dec 6, 2016

Uh oh!

tdas commented Dec 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdas commented Dec 2, 2016 •

edited

Loading

tdas Dec 3, 2016 •

edited

Loading