Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Feb 17, 2018

What changes were proposed in this pull request?

  • Added clear information about triggers
  • Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@tdas
Copy link
Contributor Author

tdas commented Feb 17, 2018

@zsxwing can you take a look?

@SparkQA
Copy link

SparkQA commented Feb 17, 2018

Test build #87515 has finished for PR 20631 at commit 31bf653.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.trigger(continuous='1 second')
.start()

{% endhighlight %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R examples:

# Default trigger (runs micro-batch as soon as it can)
write.stream(df, "console")

# ProcessingTime trigger with two-second micro-batch interval
write.stream(df, "console", trigger.processingTime = "2 seconds")

# One-time trigger
write.stream(df, "console", trigger.once = TRUE)

# Continuous trigger is not yet supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!
Can you add support for continuous trigger in R APIs?

@SparkQA
Copy link

SparkQA commented Feb 19, 2018

Test build #87542 has finished for PR 20631 at commit 4086237.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Feb 19, 2018 via email


- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is
not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less
likely is the engine going to process it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is not guaranteed to be dropped

This might contradict an earlier statement, from "Handling Late Data and Watermarking", that says

"In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. let me fix it.

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87561 has finished for PR 20631 at commit 4f13e40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some nits

<td>
The query will execute *only one* micro-batch to process all the available data and then
stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
process everything that is available since the last period, and then the shutdown the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: then the shutdown

.format("console")
.start()

// ProcessingTime trigger with two-second micro-batch interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two-seconds

// Continuous trigger with one-second checkpointing interval
df.writeStream
.format("console")
.trigger(Trigger.Continuous())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trigger.Continuous() -> Trigger.Continuous("1 second")

.format("console")
.start();

// ProcessingTime trigger with two-second micro-batch interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// Continuous trigger with one-second checkpointing interval
df.writeStream
.format("console")
.trigger(Trigger.Continuous())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

.format("console") \
.start()

# ProcessingTime trigger with two-second micro-batch interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@zsxwing
Copy link
Member

zsxwing commented Feb 20, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87570 has finished for PR 20631 at commit 6ad07d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Feb 21, 2018
…treaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <[email protected]>

Closes #20631 from tdas/SPARK-23454.

(cherry picked from commit 601d653)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in 601d653 Feb 21, 2018
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…treaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <[email protected]>

Closes apache#20631 from tdas/SPARK-23454.

(cherry picked from commit 601d653)
Signed-off-by: Tathagata Das <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants