From 42951371259cc1ef1dd39f1e6a2ebb5867326704 Mon Sep 17 00:00:00 2001 From: Shixiong Zhu Date: Wed, 23 Dec 2015 11:11:56 -0800 Subject: [PATCH 1/5] Update Streaming configurations for 1.6 --- docs/configuration.md | 27 +++++++++++++++++++++++++++ docs/streaming-programming-guide.md | 5 +++++ 2 files changed, 32 insertions(+) diff --git a/docs/configuration.md b/docs/configuration.md index a9ef37a9b1cd9..9dc427964cba2 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1600,6 +1600,33 @@ Apart from these, the following properties are also available, and may be useful How many batches the Spark Streaming UI and status APIs remember before garbage collecting. + + spark.streaming.driver.writeAheadLog.closeFileAfterWrite + false + + Whether to close the file after writing a write ahead log record in driver. Because S3 doesn't + support flushing of data, when using S3 for checkpointing, you should enable it to achieve read + after write consistency. + + + + spark.streaming.receiver.writeAheadLog.closeFileAfterWrite + false + + Whether to close the file after writing a write ahead log record in receivers. Because S3 + doesn't support flushing of data, when using S3 for checkpointing, you should enable it to + achieve read after write consistency. + + + + spark.streaming.driver.writeAheadLog.allowBatching + false + + Whether to batch write ahead logs in driver to write. When using S3 for checkpointing, write + operations in driver usually take too long. Enable batching write ahead logs will improve + the performance of writing. + + #### SparkR diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 3b071c7da5596..202f39db30e64 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -2029,6 +2029,11 @@ If the data is being received by the receivers faster than what can be processed you can limit the rate by setting the [configuration parameter](configuration.html#spark-streaming) `spark.streaming.receiver.maxRate`. +If using S3 for checkpointing, please remember to enable `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` +and `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. You can also enable +`spark.streaming.driver.writeAheadLog.allowBatching` to improve the performance of writing write +ahead logs in driver. See [Spark Streaming Configuration](configuration.html#spark-streaming) or more details. + *** ## Monitoring Applications From bce7a29de2966024103258031eeecb369e6d45b4 Mon Sep 17 00:00:00 2001 From: Shixiong Zhu Date: Tue, 29 Dec 2015 15:59:10 -0800 Subject: [PATCH 2/5] Address comments --- docs/configuration.md | 12 ++++++------ docs/streaming-programming-guide.md | 6 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 9dc427964cba2..7c94f9058b325 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1604,7 +1604,7 @@ Apart from these, the following properties are also available, and may be useful spark.streaming.driver.writeAheadLog.closeFileAfterWrite false - Whether to close the file after writing a write ahead log record in driver. Because S3 doesn't + Whether to close the file after writing a write ahead log record on the driver. Because S3 doesn't support flushing of data, when using S3 for checkpointing, you should enable it to achieve read after write consistency. @@ -1613,18 +1613,18 @@ Apart from these, the following properties are also available, and may be useful spark.streaming.receiver.writeAheadLog.closeFileAfterWrite false - Whether to close the file after writing a write ahead log record in receivers. Because S3 + Whether to close the file after writing a write ahead log record on the receivers. Because S3 doesn't support flushing of data, when using S3 for checkpointing, you should enable it to achieve read after write consistency. spark.streaming.driver.writeAheadLog.allowBatching - false + true - Whether to batch write ahead logs in driver to write. When using S3 for checkpointing, write - operations in driver usually take too long. Enable batching write ahead logs will improve - the performance of writing. + Whether to batch write ahead logs on the driver to write. When using S3 for checkpointing, write + operations on the driver usually take too long. Enabling batching write ahead logs will improve + the performance of write operations. diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 202f39db30e64..b97035d4fef31 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -2030,9 +2030,9 @@ you can limit the rate by setting the [configuration parameter](configuration.ht `spark.streaming.receiver.maxRate`. If using S3 for checkpointing, please remember to enable `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` -and `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. You can also enable -`spark.streaming.driver.writeAheadLog.allowBatching` to improve the performance of writing write -ahead logs in driver. See [Spark Streaming Configuration](configuration.html#spark-streaming) or more details. +and `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. By default, +`spark.streaming.driver.writeAheadLog.allowBatching` is enabled to improve the performance of writing write +ahead logs on the driver. See [Spark Streaming Configuration](configuration.html#spark-streaming) for more details. *** From 7d9b0389ddc2b03b259f9f2fa6b657b93cd5f3ea Mon Sep 17 00:00:00 2001 From: Shixiong Zhu Date: Wed, 30 Dec 2015 15:04:28 -0800 Subject: [PATCH 3/5] Address Burak's comment --- docs/configuration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 7c94f9058b325..dba3d3a165ba7 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1624,7 +1624,8 @@ Apart from these, the following properties are also available, and may be useful Whether to batch write ahead logs on the driver to write. When using S3 for checkpointing, write operations on the driver usually take too long. Enabling batching write ahead logs will improve - the performance of write operations. + the performance of write operations. Moreover, it's also very helpful to scale to a large number + of receivers. From 4d55b03af0c6cfb73833c8fe86fb7bf97f7c2c38 Mon Sep 17 00:00:00 2001 From: Shixiong Zhu Date: Thu, 7 Jan 2016 13:32:35 -0800 Subject: [PATCH 4/5] Address TD's comments --- docs/configuration.md | 22 ++++++---------------- docs/streaming-programming-guide.md | 17 +++++------------ 2 files changed, 11 insertions(+), 28 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index dba3d3a165ba7..aed64334ca538 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1604,28 +1604,18 @@ Apart from these, the following properties are also available, and may be useful spark.streaming.driver.writeAheadLog.closeFileAfterWrite false - Whether to close the file after writing a write ahead log record on the driver. Because S3 doesn't - support flushing of data, when using S3 for checkpointing, you should enable it to achieve read - after write consistency. + Whether to close the file after writing a write ahead log record on the driver. Set this to 'true' + when you want to use S3 (or any file system that does not support flushing) for the metadata WAL + on the driver. spark.streaming.receiver.writeAheadLog.closeFileAfterWrite false - Whether to close the file after writing a write ahead log record on the receivers. Because S3 - doesn't support flushing of data, when using S3 for checkpointing, you should enable it to - achieve read after write consistency. - - - - spark.streaming.driver.writeAheadLog.allowBatching - true - - Whether to batch write ahead logs on the driver to write. When using S3 for checkpointing, write - operations on the driver usually take too long. Enabling batching write ahead logs will improve - the performance of write operations. Moreover, it's also very helpful to scale to a large number - of receivers. + Whether to close the file after writing a write ahead log record on the receivers. Set this to 'true' + when you want to use S3 (or any file system that does not support flushing) for the data WAL + on the receivers. diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index b97035d4fef31..b005bcb1e0cc1 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -1985,7 +1985,11 @@ To run a Spark Streaming applications, you need to have the following. to increase aggregate throughput. Additionally, it is recommended that the replication of the received data within Spark be disabled when the write ahead log is enabled as the log is already stored in a replicated storage system. This can be done by setting the storage level for the - input stream to `StorageLevel.MEMORY_AND_DISK_SER`. + input stream to `StorageLevel.MEMORY_AND_DISK_SER`. While using S3 (or any file system that + does not support flushing) for Write Ahead Logs, please remember to enable + `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` and + `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. See + [Spark Streaming Configuration](configuration.html#spark-streaming) for more details. - *Setting the max receiving rate* - If the cluster resources is not large enough for the streaming application to process data as fast as it is being received, the receivers can be rate limited @@ -2023,17 +2027,6 @@ contains serialized Scala/Java/Python objects and trying to deserialize objects modified classes may lead to errors. In this case, either start the upgraded app with a different checkpoint directory, or delete the previous checkpoint directory. -### Other Considerations -{:.no_toc} -If the data is being received by the receivers faster than what can be processed, -you can limit the rate by setting the [configuration parameter](configuration.html#spark-streaming) -`spark.streaming.receiver.maxRate`. - -If using S3 for checkpointing, please remember to enable `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` -and `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. By default, -`spark.streaming.driver.writeAheadLog.allowBatching` is enabled to improve the performance of writing write -ahead logs on the driver. See [Spark Streaming Configuration](configuration.html#spark-streaming) for more details. - *** ## Monitoring Applications From 28a750d61c058e537a8ca44babb3ff0f4b54f3b3 Mon Sep 17 00:00:00 2001 From: Shixiong Zhu Date: Thu, 7 Jan 2016 14:47:57 -0800 Subject: [PATCH 5/5] Address more --- docs/streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index b005bcb1e0cc1..1edc0fe34706b 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -1986,7 +1986,7 @@ To run a Spark Streaming applications, you need to have the following. received data within Spark be disabled when the write ahead log is enabled as the log is already stored in a replicated storage system. This can be done by setting the storage level for the input stream to `StorageLevel.MEMORY_AND_DISK_SER`. While using S3 (or any file system that - does not support flushing) for Write Ahead Logs, please remember to enable + does not support flushing) for _write ahead logs_, please remember to enable `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` and `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. See [Spark Streaming Configuration](configuration.html#spark-streaming) for more details.