[SPARK-25639] [DOCS] Added docs for foreachBatch, python foreach and multiple watermarks #22627

tdas · 2018-10-04T10:37:23Z

What changes were proposed in this pull request?

Added

Python foreach
Scala, Java and Python foreachBatch
Multiple watermark policy
The semantics of what changes are allowed to the streaming between restarts.

How was this patch tested?

No tests

tdas · 2018-10-04T10:39:28Z

@zsxwing

SparkQA · 2018-10-04T10:53:47Z

Test build #96934 has finished for PR 22627 at commit f61c13e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
In Scala, you have to extend the class ForeachWriter ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)).
In Java, you have to extend the class ForeachWriter ([docs](api/java/org/apache/spark/sql/ForeachWriter.html)).

SparkQA · 2018-10-04T10:59:56Z

Test build #96936 has finished for PR 22627 at commit d16cfeb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-10-04T14:58:29Z

I think we should consider this for backport to 2.4 given that it documents new behaviour in 2.4 unless folks object.

zsxwing

Overall looks good. Left some comments.

zsxwing · 2018-10-04T20:17:02Z

docs/structured-streaming-programming-guide.md

+
+{% highlight java %}
+streamingDatasetOfString.writeStream.foreachBatch(
+  new VoidFunction2<Dataset<String>, long> {


long -> Long. I noticed the current Java API actually is wrong. Submitted #22633 to fix it.

zsxwing · 2018-10-04T20:17:24Z

docs/structured-streaming-programming-guide.md

+{% highlight java %}
+streamingDatasetOfString.writeStream.foreachBatch(
+  new VoidFunction2<Dataset<String>, long> {
+    void call(Dataset<String> dataset, long batchId) {


zsxwing · 2018-10-04T20:17:39Z

docs/structured-streaming-programming-guide.md

+<div data-lang="java"  markdown="1">
+
+{% highlight java %}
+streamingDatasetOfString.writeStream.foreachBatch(


nit: writeStream()

zsxwing · 2018-10-04T20:21:05Z

docs/structured-streaming-programming-guide.md

+      batchDF.cache()
+      batchDF.write.format(...).save(...)  // location 1
+      batchDF.write.format(...).save(...)  // location 2
+      batchDF.uncache()


uncache() -> unpersist()

zsxwing · 2018-10-04T20:23:30Z

docs/structured-streaming-programming-guide.md

+
+In Java, you have to extend the class `ForeachWriter` ([docs](api/java/org/apache/spark/sql/ForeachWriter.html)).
+{% highlight java %}
+streamingDF.writeStream.foreach(


streamingDF.writeStream -> streamingDatasetOfString.writeStream()

zsxwing · 2018-10-04T20:27:53Z

docs/structured-streaming-programming-guide.md

+              // Write row to connection. This method is NOT optional in Python.
+
+          def close(self, error):
+              // Close the connection. This method in optional in Python.


zsxwing · 2018-10-04T20:32:15Z

docs/structured-streaming-programming-guide.md

+  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same data. 
+  Hence, (partition_id, epoch_id) can be used to deduplicate and/or transactionally commit 
+  data and achieve exactly-once guarantees. However, if the streaming query is being executed 
+  in the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.


I think continuous processing will always reprocess the whole epoch after recovery and the user should be able to use (partition_id, epoch_id) to deduplicate. Is it not true?

If my understanding is right, continuous processing doesn't guarantee same epoch id processes same offset range of source (since it will process as many as possible just before it receives epoch marker), so epoch id can't be used for deduplicate.

I agree with @HeartSaVioR
I continuous processing, when a epoch is reprocessed, the engine and offset tracking will ensure that the starting offset of that epoch is same as what was recorded with the previous epoch's offset, but the ending offset is not guaranteed to be the same as what was processed before the failure. It may so happen that the epoch E of partition P processed offsets X to Y (and the output of partition P was written), but the query failed before Y was recorded (as other partitions may not have completed epoch E). So after restarting, it may so happens that the re-executed epoch E may process offsets X to Y + Z before the epoch is incremented.

Gotcha. Thanks for your explanation.

zsxwing · 2018-10-04T20:35:21Z

docs/structured-streaming-programming-guide.md

+
+  - Changes to output directory of a file sink is not allowed: `sdf.writeStream.format("parquet").option("path", "/somePath")` to `sdf.writeStream.format("parquet").option("path", "/anotherPath")`
+
+  - Changes to output topic is allowed: `sdf.writeStream.format("kafka").option("topic", "someTopic")` to `sdf.writeStream.format("kafka").option("path", "anotherTopic")`


nit: path -> topic

zsxwing · 2018-10-04T20:36:34Z

docs/structured-streaming-programming-guide.md

+
+  - Addition / deletion of filters is allowed: `sdf.selectExpr("a")` to `sdf.where(...).selectExpr("a").filter(...)`.
+
+  - Changes in projections with same output schema is allowed: `sdf.selectExpr("stringColumn AS json").writeStream` to `sdf.select(to_json(...).as("json")).writeStream`.


this example changes the schema. Right? From string to struct?

zsxwing · 2018-10-04T20:37:12Z

docs/structured-streaming-programming-guide.md

+
+- *Changes in stateful operations*: Some operations in streaming queries need to maintain
+  state data in order to continuously update the result. Structured Streaming automatically checkpoints
+  the state data to fault-tolerant storage (for example, DBFS, AWS S3, Azure Blob storage) and restores it after restart.


remove DBFS?

replaced with HDFS

srowen · 2018-10-05T01:41:58Z

docs/structured-streaming-programming-guide.md

+
+- **Apply additional DataFrame operations** - Many DataFrame and Dataset operations are not supported 
+  in streaming DataFrames because Spark does not support generating incremental plans in those cases. 
+  Using foreachBatch() you can apply some of these operations on each micro-batch output. However, you will have to reason about the end-to-end semantics of doing that operation yourself.


Not a big deal but methods like foreachBatch are sometimes rendered that way, and sometimes without code font like foreachBatch(). It's nice to back-tick-quote class and method names if you are doing another pass.

Yes. I missed a few, and I want to fix them all.

HyukjinKwon · 2018-10-05T05:16:39Z

docs/structured-streaming-programming-guide.md

+      // Open connection
+    }
+
+    def process(record: String) = {


nit: return type Unit

HyukjinKwon · 2018-10-05T05:18:52Z

docs/structured-streaming-programming-guide.md

+1. The function takes a row as input.
+
+  {% highlight python %}
+      def processRow(row):


processRow -> process_row

HyukjinKwon · 2018-10-05T05:21:11Z

docs/structured-streaming-programming-guide.md

+
+{% highlight python %}
+def foreachBatchFunction(df, epochId):
+  # Transform and write batchDF


4 space indentation

HyukjinKwon · 2018-10-05T05:21:28Z

docs/structured-streaming-programming-guide.md

+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+def foreachBatchFunction(df, epochId):


foreachBatchFunction -> foreach_batch_function

epochId -> epoch_id

tdas · 2018-10-08T18:22:50Z

@holdenk yeah, i intend to backport this to 2.4

SparkQA · 2018-10-08T20:34:29Z

Test build #97129 has finished for PR 22627 at commit 222bfc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

LGTM except some nits.

zsxwing · 2018-10-08T20:40:22Z

docs/structured-streaming-programming-guide.md

+In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)).
+
+{% highlight scala %}
+streamingDF.writeStream.foreach(


nit: streamingDF -> streamingDatasetOfString.

zsxwing · 2018-10-08T20:40:30Z

docs/structured-streaming-programming-guide.md

+
+In Java, you have to extend the class `ForeachWriter` ([docs](api/java/org/apache/spark/sql/ForeachWriter.html)).
+{% highlight java %}
+streamingDF.writeStream().foreach(


zsxwing · 2018-10-08T20:41:35Z

docs/structured-streaming-programming-guide.md

+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+def foreachBatchFunction(df, epoch_id):


nit: foreachBatchFunction -> foreach_batch_function

zsxwing · 2018-10-08T20:41:39Z

docs/structured-streaming-programming-guide.md

+    # Transform and write batchDF
+    pass
+
+streamingDF.writeStream.foreachBatch(foreachBatchFunction).start()   


nit: foreachBatchFunction -> foreach_batch_function

zsxwing · 2018-10-08T20:52:55Z

LGTM

SparkQA · 2018-10-08T21:09:05Z

Test build #97131 has finished for PR 22627 at commit 9d60534.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ultiple watermarks ## What changes were proposed in this pull request? Added - Python foreach - Scala, Java and Python foreachBatch - Multiple watermark policy - The semantics of what changes are allowed to the streaming between restarts. ## How was this patch tested? No tests Closes #22627 from tdas/SPARK-25639. Authored-by: Tathagata Das <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit f9935a3) Signed-off-by: Tathagata Das <[email protected]>

…ultiple watermarks ## What changes were proposed in this pull request? Added - Python foreach - Scala, Java and Python foreachBatch - Multiple watermark policy - The semantics of what changes are allowed to the streaming between restarts. ## How was this patch tested? No tests Closes apache#22627 from tdas/SPARK-25639. Authored-by: Tathagata Das <[email protected]> Signed-off-by: Tathagata Das <[email protected]>

tdas added 2 commits October 4, 2018 11:33

Added docs

f61c13e

Updated

d16cfeb

zsxwing requested changes Oct 4, 2018

View reviewed changes

srowen reviewed Oct 5, 2018

View reviewed changes

HyukjinKwon reviewed Oct 5, 2018

View reviewed changes

Addressed comments

222bfc6

zsxwing approved these changes Oct 8, 2018

View reviewed changes

Addressed more comments

9d60534

asfgit closed this in f9935a3 Oct 8, 2018


		- Changes to output directory of a file sink is not allowed: `sdf.writeStream.format("parquet").option("path", "/somePath")` to `sdf.writeStream.format("parquet").option("path", "/anotherPath")`

		- Changes to output topic is allowed: `sdf.writeStream.format("kafka").option("topic", "someTopic")` to `sdf.writeStream.format("kafka").option("path", "anotherTopic")`


		- Addition / deletion of filters is allowed: `sdf.selectExpr("a")` to `sdf.where(...).selectExpr("a").filter(...)`.

		- Changes in projections with same output schema is allowed: `sdf.selectExpr("stringColumn AS json").writeStream` to `sdf.select(to_json(...).as("json")).writeStream`.

[SPARK-25639] [DOCS] Added docs for foreachBatch, python foreach and multiple watermarks #22627

[SPARK-25639] [DOCS] Added docs for foreachBatch, python foreach and multiple watermarks #22627

Uh oh!

Conversation

tdas commented Oct 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas commented Oct 4, 2018

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

SparkQA commented Oct 4, 2018

Uh oh!

holdenk commented Oct 4, 2018

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

zsxwing left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 4, 2018 •

edited

Loading