Skip to content

Commit 7482c7b

Browse files
zsxwingtdas
authored andcommitted
[SPARK-12814][DOCUMENT] Add deploy instructions for Python in flume integration doc
This PR added instructions to get flume assembly jar for Python users in the flume integration page like Kafka doc. Author: Shixiong Zhu <[email protected]> Closes #10746 from zsxwing/flume-doc. (cherry picked from commit a973f48) Signed-off-by: Tathagata Das <[email protected]>
1 parent 8c2b67f commit 7482c7b

File tree

2 files changed

+13
-4
lines changed

2 files changed

+13
-4
lines changed

docs/streaming-flume-integration.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,16 @@ configuring Flume agents.
7171
cluster (Mesos, YARN or Spark Standalone), so that resource allocation can match the names and launch
7272
the receiver in the right machine.
7373

74-
3. **Deploying:** Package `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
74+
3. **Deploying:** As with any Spark applications, `spark-submit` is used to launch your application. However, the details are slightly different for Scala/Java applications and Python applications.
75+
76+
For Scala and Java applications, if you are using SBT or Maven for project management, then package `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies into the application JAR. Make sure `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` are marked as `provided` dependencies as those are already present in a Spark installation. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
77+
78+
For Python applications which lack SBT/Maven project management, `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies can be directly added to `spark-submit` using `--packages` (see [Application Submission Guide](submitting-applications.html)). That is,
79+
80+
./bin/spark-submit --packages org.apache.spark:spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
81+
82+
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-flume-assembly` from the
83+
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-flume-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
7584

7685
## Approach 2: Pull-based Approach using a Custom Sink
7786
Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.
@@ -157,7 +166,7 @@ configuring Flume agents.
157166

158167
Note that each input DStream can be configured to receive data from multiple sinks.
159168

160-
3. **Deploying:** Package `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
169+
3. **Deploying:** This is same as the first approach.
161170

162171

163172

docs/streaming-kafka-integration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Next, we discuss how to use this approach in your streaming application.
7171
./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
7272

7373
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-kafka-assembly` from the
74-
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_2.10%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
74+
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kafka-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.
7575

7676
## Approach 2: Direct Approach (No Receivers)
7777
This new receiver-less "direct" approach has been introduced in Spark 1.3 to ensure stronger end-to-end guarantees. Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka's simple consumer API is used to read the defined ranges of offsets from Kafka (similar to read files from a file system). Note that this is an experimental feature introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API.
@@ -207,4 +207,4 @@ Next, we discuss how to use this approach in your streaming application.
207207

208208
Another thing to note is that since this approach does not use Receivers, the standard receiver-related (that is, [configurations](configuration.html) of the form `spark.streaming.receiver.*` ) will not apply to the input DStreams created by this approach (will apply to other input DStreams though). Instead, use the [configurations](configuration.html) `spark.streaming.kafka.*`. An important one is `spark.streaming.kafka.maxRatePerPartition` which is the maximum rate (in messages per second) at which each Kafka partition will be read by this direct API.
209209

210-
3. **Deploying:** This is same as the first approach, for Scala, Java and Python.
210+
3. **Deploying:** This is same as the first approach.

0 commit comments

Comments
 (0)