@@ -428,9 +428,9 @@ KafkaUtils.createStream(javaStreamingContext, kafkaParams, ...);
428428</div >
429429</div >
430430
431- For more details on these additional sources, see the corresponding [ API documentation]
432- (#where-to-go-from-here). Furthermore, you can also implement your own custom receiver
433- for your sources. See the [ Custom Receiver Guide] ( streaming-custom-receivers.html ) .
431+ For more details on these additional sources, see the corresponding [ API documentation] ( #where-to-go-from-here ) .
432+ Furthermore, you can also implement your own custom receiver for your sources. See the
433+ [ Custom Receiver Guide] ( streaming-custom-receivers.html ) .
434434
435435## Operations
436436There are two kinds of DStream operations - _ transformations_ and _ output operations_ . Similar to
@@ -520,9 +520,8 @@ The last two transformations are worth highlighting again.
520520
521521<h4 >UpdateStateByKey Operation</h4 >
522522
523- The ` updateStateByKey ` operation allows
524- you to main arbitrary stateful computation, where you want to maintain some state data and
525- continuously update it with new information. To use this, you will have to do two steps.
523+ The ` updateStateByKey ` operation allows you to maintain arbitrary state while continuously updating
524+ it with new information. To use this, you will have to do two steps.
526525
5275261 . Define the state - The state can be of arbitrary data type.
5285271 . Define the state update function - Specify with a function how to update the state using the
@@ -925,7 +924,7 @@ exception saying so.
925924## Monitoring
926925Besides Spark's in-built [ monitoring capabilities] ( monitoring.html ) ,
927926the progress of a Spark Streaming program can also be monitored using the [ StreamingListener]
928- (streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
927+ (api/ streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
929928which allows you to get statistics of batch processing times, queueing delays,
930929and total end-to-end delays. Note that this is still an experimental API and it is likely to be
931930improved upon (i.e., more information reported) in the future.
@@ -1000,11 +999,11 @@ Since all data is modeled as RDDs with their lineage of deterministic operations
1000999 for output operations.
10011000
10021001## Failure of the Driver Node
1003- To allows a streaming application to operate 24/7, Spark Streaming allows a streaming computation
1002+ For a streaming application to operate 24/7, Spark Streaming allows a streaming computation
10041003to be resumed even after the failure of the driver node. Spark Streaming periodically writes the
10051004metadata information of the DStreams setup through the ` StreamingContext ` to a
10061005HDFS directory (can be any Hadoop-compatible filesystem). This periodic
1007- * checkpointing* can be enabled by setting a the checkpoint
1006+ * checkpointing* can be enabled by setting the checkpoint
10081007directory using ` ssc.checkpoint(<checkpoint directory>) ` as described
10091008[ earlier] ( #rdd-checkpointing ) . On failure of the driver node,
10101009the lost ` StreamingContext ` can be recovered from this information, and restarted.
@@ -1105,8 +1104,8 @@ classes. So, if you are using `getOrCreate`, then make sure that the checkpoint
11051104explicitly deleted every time recompiled code needs to be launched.
11061105
11071106This failure recovery can be done automatically using Spark's
1108- [ standalone cluster mode] ( spark-standalone.html ) , which allows any Spark
1109- application's driver to be as well as, ensures automatic restart of the driver on failure (see
1107+ [ standalone cluster mode] ( spark-standalone.html ) , which allows the driver of any Spark application
1108+ to be launched within the cluster and be restarted on failure (see
11101109[ supervise mode] ( spark-standalone.html#launching-applications-inside-the-cluster ) ). This can be
11111110tested locally by launching the above example using the supervise mode in a
11121111local standalone cluster and killing the java process running the driver (will be shown as
@@ -1123,7 +1122,7 @@ There are two different failure behaviors based on which input sources are used.
112311221 . _ Using HDFS files as input source_ - Since the data is reliably stored on HDFS, all data can
11241123re-computed and therefore no data will be lost due to any failure.
112511241 . _ Using any input source that receives data through a network_ - The received input data is
1126- replicated in memory to multiple nodes. Since, all the data in the Spark worker's memory is lost
1125+ replicated in memory to multiple nodes. Since all the data in the Spark worker's memory is lost
11271126when the Spark driver fails, the past input data will not be accessible and driver recovers.
11281127Hence, if stateful and window-based operations are used
11291128(like ` updateStateByKey ` , ` window ` , ` countByValueAndWindow ` , etc.), then the intermediate state
@@ -1133,11 +1132,11 @@ In future releases, we will support full recoverability for all input sources. N
11331132non-stateful transformations like ` map ` , ` count ` , and ` reduceByKey ` , with _ all_ input streams,
11341133the system, upon restarting, will continue to receive and process new data.
11351134
1136- To better understand the behavior of the system under driver failure with a HDFS source, lets
1135+ To better understand the behavior of the system under driver failure with a HDFS source, let's
11371136consider what will happen with a file input stream. Specifically, in the case of the file input
11381137stream, it will correctly identify new files that were created while the driver was down and
11391138process them in the same way as it would have if the driver had not failed. To explain further
1140- in the case of file input stream, we shall use an example. Lets say, files are being generated
1139+ in the case of file input stream, we shall use an example. Let's say, files are being generated
11411140every second, and a Spark Streaming program reads every new file and output the number of lines
11421141in the file. This is what the sequence of outputs would be with and without a driver failure.
11431142
0 commit comments