Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions docs/streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -522,9 +522,9 @@ common ones are as follows.
<td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
<td> When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the
values for each key are aggregated using the given reduce function. <b>Note:</b> By default,
this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to
do the grouping. You can pass an optional <code>numTasks</code> argument to set a different
number of tasks.</td>
this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number
is determined by the config property <code>spark.default.parallelism</code>) to do the grouping.
You can pass an optional <code>numTasks</code> argument to set a different number of tasks.</td>
</tr>
<tr>
<td> <b>join</b>(<i>otherStream</i>, [<i>numTasks</i>]) </td>
Expand Down Expand Up @@ -743,8 +743,9 @@ said two parameters - <i>windowLength</i> and <i>slideInterval</i>.
<td> When called on a DStream of (K, V) pairs, returns a new DStream of (K, V)
pairs where the values for each key are aggregated using the given reduce function <i>func</i>
over batches in a sliding window. <b>Note:</b> By default, this uses Spark's default number of
parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional
<code>numTasks</code> argument to set a different number of tasks.
parallel tasks (2 for local mode, and in cluster mode the number is determined by the config
property <code>spark.default.parallelism</code>) to do the grouping. You can pass an optional
<code>numTasks</code> argument to set a different number of tasks.
</td>
</tr>
<tr>
Expand Down Expand Up @@ -956,9 +957,10 @@ before further processing.
### Level of Parallelism in Data Processing
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
parallelism as an argument (see the
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property]
(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of
parallelism as an argument (see [`PairDStreamFunctions`]
(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
documentation), or set the [config property](configuration.html#spark-properties)
`spark.default.parallelism` to change the default.

Expand Down