-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options #16468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70852 has finished for PR 16468 at commit
|
|
Test build #70853 has finished for PR 16468 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For structured-streaming-watermark-update-mode.png, in the third table (under 12:20), 12:00 - 12:10 cat 1 should be gray.
Do you want to document that without aggregation or window operators, Never mind. Just found that update mode is same as append mode?update mode requires aggregation.
| <td>Append</td> | ||
| <td>Append, Update</td> | ||
| <td> | ||
| Complete mode note supported as it is infeasible to keep all data in the Result Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Complete mode not supported
| <tr> | ||
| <td colspan="2" valign="middle"><br/>Queries without aggregation</td> | ||
| <td>Append</td> | ||
| <td>Append, Update</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Good catch about non-aggregation queries. we should support update mode, which is same as append mode. I will fix that in a follow up PR. |
| (<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>). | ||
| E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code></td> | ||
| <td>Yes</td> | ||
| <td>Supports regular expressions, but does not support multiple comma-separated paths/expressions.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: regular expressions -> glob paths
|
LGTM |
|
Test build #70895 has finished for PR 16468 at commit
|
|
@tdas I think it should be “not all of the operations …. are supported in … yet” instead of “all of the operations … are not supported in … yet". You might want to fix this minor issue in this PR. |
|
@david-weiluo-ren yeah the wording can be better. maybe "all of the operations ... are not yet supported" |
|
|
||
| - **Memory sink (for debugging)** - The output is stored in memory as an in-memory table. | ||
| Both, Append and Complete output modes, are supported. This should be used for debugging purposes | ||
| on low data volumes as the entire output is collected and stored in the driver's memory after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly repetitive, it says "[...] the entire output is collected and stored in the driver's memory [...]" and then says: "Note that the current implementations saves all the data in the driver memory".
If we want to say this twice to make sure people read it; maybe we can move the "note" reminder into the Notes column in the table a few lines down? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree it repetitive, but i dont want people to miss this. I will rewrite this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah sorry I misunderstood. I thought the note in above the table and the Notes in the table was the repetition. But that's not the case. My bad.
| <tr> | ||
| <th>Source</th> | ||
| <th>Options</th> | ||
| <th>Fault-tolerant</th> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we link back to #fault-tolerance-semantics here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont want to make the table heading a link, but I will do something.
| <th>Supported Output Modes</th> | ||
| <th style="width:30%">Usage</th> | ||
| <th>Options</th> | ||
| <th>Fault-tolerant</th> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
|
Thank you very much @thomaso-mirodin @david-weiluo-ren @zsxwing |
|
Test build #70952 has finished for PR 16468 at commit
|
…e for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- ---------------------------  --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- ---------------------------    Author: Tathagata Das <[email protected]> Closes #16468 from tdas/SPARK-19074. (cherry picked from commit b59cdda) Signed-off-by: Tathagata Das <[email protected]>
…e for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- ---------------------------  --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- ---------------------------    Author: Tathagata Das <[email protected]> Closes apache#16468 from tdas/SPARK-19074.
|
Hi TD, As part of 2.1.0 release, Kafka as a source is added. Reason for asking kinesis support is, we use kinesis spark streaming with spark 1.6 as of now and are planning to upgrade to Spark 2 Structured Streaming. So, kinda eager to know when can we expect kinesis support in StructuredStreaming. Thanks in Advance, |
…e for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- ---------------------------  --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- ---------------------------    Author: Tathagata Das <[email protected]> Closes apache#16468 from tdas/SPARK-19074.
What changes were proposed in this pull request?
Updates