Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Dec 2, 2016

What changes were proposed in this pull request?

Here are the major changes in this PR.

  • Added the ability to recover StreamingQuery.id from checkpoint location, by writing the id to checkpointLoc/metadata.
  • Added StreamingQuery.runId which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of id).
  • Removed auto-generation of StreamingQuery.name. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
  • Added runId to StreamingQueryListener events and StreamingQueryProgress.

Implementation details

  • Renamed existing StreamExecutionMetadata to OffsetSeqMetadata, and moved it to the file OffsetSeq.scala, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of .json and .getOrElse("{}")).
  • Added the id as the new StreamMetadata.
  • When a StreamingQuery is created it gets or writes the StreamMetadata from checkpointLoc/metadata.
  • All internal logging in StreamExecution uses (name, id, runId) instead of just name

TODO

  • Test handling of name=null in json generation of StreamingQueryProgress
  • Test handling of name=null in json generation of StreamingQueryListener events
  • Test python API of runId

How was this patch tested?

Updated unit tests and new unit tests

@tdas tdas changed the title [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart, and not auto-generate name [SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name Dec 2, 2016
@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69538 has finished for PR 16113 at commit aca547f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • \"(class org.apache.spark.sql.streaming.StreamingQueryListener$\") =>
  • case class UsingJoin(tpe: JoinType, usingColumns: Seq[String]) extends JoinType

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69540 has finished for PR 16113 at commit bec2fb3.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StreamMetadata(id: String)

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69588 has finished for PR 16113 at commit 0554e5e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69589 has finished for PR 16113 at commit 6e1bbdd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Time unit: milliseconds
* @param id unique id of the [[StreamingQuery]] that needs to be persisted across restarts
*/
case class StreamExecutionMetadata(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been renamed to OffsetSeqLog and moved to OffsetSeq.scala

Copy link
Contributor

@marmbrus marmbrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments. LGTM

* @param batchTimestampMs: The current batch processing timestamp.
* Time unit: milliseconds
*/
case class OffsetSeqMetadata(var batchWatermarkMs: Long = 0, var batchTimestampMs: Long = 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this in its own file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But its a small class and closely tied with OffsetSeq, so I thought its not worth having a separate file for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth moving these 6 lines of code in a new file.

* Returns the unique id of this query that persists across restarts from checkpoint data.
* That is, this id is generated when a query is started for the first time, and
* will be the same every time it is restarted from checkpoint data.
* There can only be one query with the same id active in a Spark cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like its okay to have more than one running as long as they aren't on the same spark cluster.

Copy link
Contributor Author

@tdas tdas Dec 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just remove that line.

* JSON string representation of this object.
*/
def json: String = Serialization.write(this)
case class StreamMetadata(id: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to their own file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69590 has finished for PR 16113 at commit c9224ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

StreamMetadata("d366a8bf-db79-42ca-b5a4-d9ca0a11d63e"))
}

private def readForResource(fileName: String): StreamMetadata = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: readFromResource

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69593 has finished for PR 16113 at commit afd5c0f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69594 has finished for PR 16113 at commit 7ee4cf1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69693 has started for PR 16113 at commit 19461e1.

@SparkQA
Copy link

SparkQA commented Dec 6, 2016

Test build #69695 has finished for PR 16113 at commit 4041a22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Dec 6, 2016

Merging to master and 2.1

@asfgit asfgit closed this in bb57bfe Dec 6, 2016
asfgit pushed a commit that referenced this pull request Dec 6, 2016
…tart and not auto-generate StreamingQuery.name

Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.

Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`

TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId

Updated unit tests and new unit tests

Author: Tathagata Das <[email protected]>

Closes #16113 from tdas/SPARK-18657.

(cherry picked from commit bb57bfe)
Signed-off-by: Tathagata Das <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…tart and not auto-generate StreamingQuery.name

## What changes were proposed in this pull request?
Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.

Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`

TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId

## How was this patch tested?
Updated unit tests and new unit tests

Author: Tathagata Das <[email protected]>

Closes apache#16113 from tdas/SPARK-18657.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…tart and not auto-generate StreamingQuery.name

## What changes were proposed in this pull request?
Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.

Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`

TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId

## How was this patch tested?
Updated unit tests and new unit tests

Author: Tathagata Das <[email protected]>

Closes apache#16113 from tdas/SPARK-18657.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants