Skip to content

Commit b9e41f4

Browse files
andrewor14pwendell
authored andcommitted
[SPARK-1753 / 1773 / 1814] Update outdated docs for spark-submit, YARN, standalone etc.
YARN - SparkPi was updated to not take in master as an argument; we should update the docs to reflect that. - The default YARN build guide should be in maven, not sbt. - This PR also adds a paragraph on steps to debug a YARN application. Standalone - Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`. - The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made. In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there. Author: Andrew Or <[email protected]> Closes #701 from andrewor14/yarn-docs and squashes the following commits: e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814) 25cfe7b [Andrew Or] Merge in the warning from SPARK-1753 a8c39c5 [Andrew Or] Minor changes 336bbd9 [Andrew Or] Tabs -> spaces 4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html 3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions 5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc. c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 381fe32 [Andrew Or] Update docs for standalone mode 757c184 [Andrew Or] Add a note about the requirements for the debugging trick f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 924f04c [Andrew Or] Revert addition of --deploy-mode d5fe17b [Andrew Or] Update the YARN docs (cherry picked from commit 2ffd1ea) Signed-off-by: Patrick Wendell <[email protected]>
1 parent 5ef24a0 commit b9e41f4

13 files changed

+184
-125
lines changed

conf/spark-defaults.conf.template

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
# This is useful for setting default environmental settings.
33

44
# Example:
5-
# spark.master spark://master:7077
5+
# spark.master spark://master:7077
66
# spark.eventLog.enabled true
77
# spark.eventLog.dir hdfs://namenode:8021/directory
8+
# spark.serializer org.apache.spark.serializer.KryoSerializer

conf/spark-env.sh.template

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,11 @@
3030

3131
# Options for the daemons used in the standalone deploy mode:
3232
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
33-
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
33+
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
3434
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
3535
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
3636
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
37-
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
37+
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
3838
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
3939
# - SPARK_WORKER_DIR, to set the working directory of worker processes
4040
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")

docs/building-with-maven.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,13 @@ Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spi
129129
For these tests to run your system must have a JDK 8 installation.
130130
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.
131131

132+
## Building for PySpark on YARN ##
133+
134+
PySpark on YARN is only supported if the jar is built with maven. Further, there is a known problem
135+
with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to
136+
run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere,
137+
then ship it over to the cluster. We are investigating the exact cause for this.
138+
132139
## Packaging without Hadoop dependencies for deployment on YARN ##
133140

134141
The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.

docs/cluster-overview.md

Lines changed: 45 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -66,62 +66,76 @@ script as shown here while passing your jar.
6666
For Python, you can use the `pyFiles` argument of SparkContext
6767
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
6868

69-
### Launching Applications with ./bin/spark-submit
69+
### Launching Applications with Spark submit
7070

7171
Once a user application is bundled, it can be launched using the `spark-submit` script located in
7272
the bin directory. This script takes care of setting up the classpath with Spark and its
73-
dependencies, and can support different cluster managers and deploy modes that Spark supports.
74-
It's usage is
73+
dependencies, and can support different cluster managers and deploy modes that Spark supports:
7574

76-
./bin/spark-submit --class path.to.your.Class [options] <app jar> [app options]
75+
./bin/spark-submit \
76+
--class <main-class>
77+
--master <master-url> \
78+
--deploy-mode <deploy-mode> \
79+
... // other options
80+
<application-jar>
81+
[application-arguments]
7782

78-
When calling `spark-submit`, `[app options]` will be passed along to your application's
79-
main class. To enumerate all options available to `spark-submit` run it with
80-
the `--help` flag. Here are a few examples of common options:
83+
main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
84+
master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
85+
deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
86+
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
87+
application-arguments: Space delimited arguments passed to the main method of <main-class>, if any
88+
89+
To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
90+
examples of common options:
8191

8292
{% highlight bash %}
8393
# Run application locally
8494
./bin/spark-submit \
85-
--class my.main.ClassName
95+
--class org.apache.spark.examples.SparkPi
8696
--master local[8] \
87-
my-app.jar
97+
/path/to/examples.jar \
98+
100
8899

89100
# Run on a Spark standalone cluster
90101
./bin/spark-submit \
91-
--class my.main.ClassName
92-
--master spark://mycluster:7077 \
102+
--class org.apache.spark.examples.SparkPi
103+
--master spark://207.184.161.138:7077 \
93104
--executor-memory 20G \
94105
--total-executor-cores 100 \
95-
my-app.jar
106+
/path/to/examples.jar \
107+
1000
96108

97109
# Run on a YARN cluster
98-
HADOOP_CONF_DIR=XX /bin/spark-submit \
99-
--class my.main.ClassName
110+
HADOOP_CONF_DIR=XX ./bin/spark-submit \
111+
--class org.apache.spark.examples.SparkPi
100112
--master yarn-cluster \ # can also be `yarn-client` for client mode
101113
--executor-memory 20G \
102114
--num-executors 50 \
103-
my-app.jar
115+
/path/to/examples.jar \
116+
1000
104117
{% endhighlight %}
105118

106119
### Loading Configurations from a File
107120

108-
The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
109-
onto your application. By default it will read configuration options from
110-
`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the
111-
application when run. They can obviate the need for certain flags to `spark-submit`: for
112-
instance, if `spark.master` property is set, you can safely omit the
121+
The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
122+
properties file and pass them on to your application. By default it will read configuration options
123+
from `conf/spark-defaults.conf`. For more detail, see the section on
124+
[loading default configurations](configuration.html#loading-default-configurations).
125+
126+
Loading default Spark configurations this way can obviate the need for certain flags to
127+
`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
113128
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
114-
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values
115-
in the defaults file.
129+
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
130+
defaults file.
116131

117-
If you are ever unclear where configuration options are coming from. fine-grained debugging
118-
information can be printed by adding the `--verbose` option to `./spark-submit`.
132+
If you are ever unclear where configuration options are coming from, you can print out fine-grained
133+
debugging information by running `spark-submit` with the `--verbose` option.
119134

120135
### Advanced Dependency Management
121-
When using `./bin/spark-submit` the app jar along with any jars included with the `--jars` option
122-
will be automatically transferred to the cluster. `--jars` can also be used to distribute .egg and .zip
123-
libraries for Python to executors. Spark uses the following URL scheme to allow different
124-
strategies for disseminating jars:
136+
When using `spark-submit`, the application jar along with any jars included with the `--jars` option
137+
will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
138+
different strategies for disseminating jars:
125139

126140
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
127141
every executor pulls the file from the driver HTTP server.
@@ -135,6 +149,9 @@ This can use up a significant amount of space over time and will need to be clea
135149
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
136150
`spark.worker.cleanup.appDataTtl` property.
137151

152+
For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
153+
to executors.
154+
138155
# Monitoring
139156

140157
Each driver program has a web UI, typically on port 4040, that displays information about running

docs/configuration.md

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,51 @@ title: Spark Configuration
55

66
Spark provides three locations to configure the system:
77

8-
* [Spark properties](#spark-properties) control most application parameters and can be set by passing
9-
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
10-
system properties.
8+
* [Spark properties](#spark-properties) control most application parameters and can be set by
9+
passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
10+
or through the `conf/spark-defaults.conf` properties file.
1111
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
1212
the IP address, through the `conf/spark-env.sh` script on each node.
1313
* [Logging](#configuring-logging) can be configured through `log4j.properties`.
1414

1515

1616
# Spark Properties
1717

18-
Spark properties control most application settings and are configured separately for each application.
19-
The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
20-
class to your SparkContext constructor.
21-
Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
22-
of Spark.
23-
24-
SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and
25-
application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could
26-
initialize an application as follows:
18+
Spark properties control most application settings and are configured separately for each
19+
application. The preferred way is to set them through
20+
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
21+
SparkContext. SparkConf allows you to configure most of the common properties to initialize a
22+
cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the
23+
`set()` method. For example, we could initialize an application as follows:
2724

2825
{% highlight scala %}
29-
val conf = new SparkConf().
30-
setMaster("local").
31-
setAppName("My application").
32-
set("spark.executor.memory", "1g")
26+
val conf = new SparkConf
27+
.setMaster("local")
28+
.setAppName("CountingSheep")
29+
.set("spark.executor.memory", "1g")
3330
val sc = new SparkContext(conf)
3431
{% endhighlight %}
3532

36-
Most of the properties control internal settings that have reasonable default values. However,
33+
## Loading Default Configurations
34+
35+
In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
36+
the configuration properties through SparkConf. However, you can still set configuration properties
37+
through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
38+
will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
39+
key and a value separated by whitespace. For example,
40+
41+
spark.master spark://5.6.7.8:7077
42+
spark.executor.memory 512m
43+
spark.eventLog.enabled true
44+
spark.serializer org.apache.spark.serializer.KryoSerializer
45+
46+
Any values specified in the file will be passed on to the application, and merged with those
47+
specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
48+
and SparkConf, then the latter will take precedence as it is the most application-specific.
49+
50+
## All Configuration Properties
51+
52+
Most of the properties that control internal settings have reasonable default values. However,
3753
there are at least five properties that you will commonly want to control:
3854

3955
<table class="table">
@@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful
101117
<td>spark.default.parallelism</td>
102118
<td>
103119
<ul>
120+
<li>Local mode: number of cores on the local machine</li>
104121
<li>Mesos fine grained mode: 8</li>
105-
<li>Local mode: core number of the local machine</li>
106-
<li>Others: total core number of all executor nodes or 2, whichever is larger</li>
122+
<li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
107123
</ul>
108124
</td>
109125
<td>
@@ -187,7 +203,7 @@ Apart from these, the following properties are also available, and may be useful
187203
Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
188204
standard javax servlet Filter. Parameters to each filter can also be specified by setting a
189205
java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
190-
(e.g.-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
206+
(e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
191207
</td>
192208
</tr>
193209
<tr>
@@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful
696712
## Viewing Spark Properties
697713

698714
The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
699-
This is a useful place to check to make sure that your properties have been set correctly.
715+
This is a useful place to check to make sure that your properties have been set correctly. Note
716+
that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
717+
appear. For all other configuration properties, you can assume the default value is used.
700718

701719
# Environment Variables
702720

@@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`:
714732
* `PYSPARK_PYTHON`, the Python binary to use for PySpark
715733
* `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
716734
* `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines.
717-
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
718-
to use on each machine and maximum memory.
735+
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts),
736+
such as number of cores to use on each machine and maximum memory.
719737

720738
Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
721739
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.

docs/hadoop-third-party-distributions.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,14 @@ with these distributions:
99

1010
# Compile-time Hadoop Version
1111

12-
When compiling Spark, you'll need to
13-
[set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions):
12+
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
13+
property. For certain versions, you will need to specify additional profiles. For more detail,
14+
see the guide on [building with maven](building-with-maven.html#specifying-the-hadoop-version):
1415

15-
SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly
16+
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
17+
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
1618

17-
The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that
19+
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
1820
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
1921
distribution may "just work" without you needing to compile. That said, we recommend compiling with
2022
the _exact_ Hadoop version you are running to avoid any compatibility errors.
@@ -46,6 +48,10 @@ the _exact_ Hadoop version you are running to avoid any compatibility errors.
4648
</tr>
4749
</table>
4850

51+
In SBT, the equivalent can be achieved by setting the SPARK_HADOOP_VERSION flag:
52+
53+
SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly
54+
4955
# Linking Applications to the Hadoop Version
5056

5157
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that

docs/index.md

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -24,21 +24,31 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/
2424

2525
# Running the Examples and Shell
2626

27-
Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory.
28-
To run one of the Java or Scala sample programs, use `./bin/run-example <class> <params>` in the top-level Spark directory
29-
(the `bin/run-example` script sets up the appropriate paths and launches that program).
30-
For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`.
31-
To run a Python sample program, use `./bin/pyspark <sample-program> <params>`. For example, try `./bin/pyspark ./examples/src/main/python/pi.py local`.
27+
Spark comes with several sample programs. Scala, Java and Python examples are in the
28+
`examples/src/main` directory. To run one of the Java or Scala sample programs, use
29+
`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
30+
invokes the more general
31+
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
32+
launching applications). For example,
3233

33-
Each example prints usage help when run with no parameters.
34+
./bin/run-example SparkPi 10
3435

35-
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
36-
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
37-
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
38-
`local` for testing.
36+
You can also run Spark interactively through modified versions of the Scala shell. This is a
37+
great way to learn the framework.
3938

40-
Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
41-
Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
39+
./bin/spark-shell --master local[2]
40+
41+
The `--master` option specifies the
42+
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
43+
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
44+
`local` for testing. For a full list of options, run Spark shell with the `--help` option.
45+
46+
Spark also provides a Python interface. To run an example Spark application written in Python, use
47+
`bin/pyspark <program> [params]`. For example,
48+
49+
./bin/pyspark examples/src/main/python/pi.py local[2] 10
50+
51+
or simply `bin/pyspark` without any arguments to run Spark interactively in a python interpreter.
4252

4353
# Launching on a Cluster
4454

docs/java-programming-guide.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,4 @@ Spark includes several sample programs using the Java API in
215215
[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the
216216
`bin/run-example` script included in Spark; for example:
217217

218-
./bin/run-example org.apache.spark.examples.JavaWordCount
219-
220-
Each example program prints usage help when run
221-
without any arguments.
218+
./bin/run-example JavaWordCount README.md

docs/python-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,6 @@ some example applications.
164164
PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
165165
You can run them by passing the files to `pyspark`; e.g.:
166166

167-
./bin/spark-submit examples/src/main/python/wordcount.py
167+
./bin/spark-submit examples/src/main/python/wordcount.py local[2] README.md
168168

169169
Each program prints usage help when run without arguments.

0 commit comments

Comments
 (0)