Skip to content

Commit 3cc0649

Browse files
committed
Detail how to set configurations + remove legacy instructions
This commit removes the section on using org.apache.spark.deploy.Client to launch an application. This is subsumed by the Spark submit section immediately preceding it. This commit also clarifies how we set the Spark configuration properties in the 1.0 world. Previously it was pretty unclear, and the necessary details were in the wrong page (cluster-overview.html) instead of where it is supposed to be (configuration.html).
1 parent 5b7140a commit 3cc0649

File tree

4 files changed

+58
-70
lines changed

4 files changed

+58
-70
lines changed

conf/spark-defaults.conf.template

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@
55
# spark.master spark://master:7077
66
# spark.eventLog.enabled true
77
# spark.eventLog.dir hdfs://namenode:8021/directory
8+
# spark.serializer org.apache.spark.serializer.KryoSerializer

docs/cluster-overview.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -105,23 +105,19 @@ HADOOP_CONF_DIR=XX ./bin/spark-submit \
105105

106106
### Loading Configurations from a File
107107

108-
The `spark-submit` script can load default [Spark configuration values](configuration.html) from
109-
a properties file and pass them on to your application. By default it will read configuration
110-
options from `conf/spark-defaults.conf`, in which each line consists of a key and a value separated
111-
by whitespace. For example,
112-
113-
spark.master spark://5.6.7.8:7077
114-
spark.executor.memory 512m
115-
spark.eventLog.enabled true
116-
117-
Any values specified in the file will be passed on to the application. Loading default Spark
118-
configurations this way can obviate the need for certain flags to `spark-submit`. For instance,
119-
if `spark.master` property is set, you can safely omit the `--master` flag from `spark-submit`.
120-
In general, configuration values explicitly set on a `SparkConf` take the highest precedence,
121-
then flags passed to `spark-submit`, then values in the defaults file.
122-
123-
If you are ever unclear where configuration options are coming from. fine-grained debugging
124-
information can be printed by running `spark-submit` with the `--verbose` option.
108+
The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
109+
properties file and pass them on to your application. By default it will read configuration options
110+
from `conf/spark-defaults.conf`. For more detail, see the section on
111+
[loading default configurations](configuration.html#loading-default-configurations).
112+
113+
Loading default Spark configurations this way can obviate the need for certain flags to
114+
`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
115+
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
116+
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
117+
defaults file.
118+
119+
If you are ever unclear where configuration options are coming from, you can print out fine-grained
120+
debugging information by running `spark-submit` with the `--verbose` option.
125121

126122
### Advanced Dependency Management
127123
When using `spark-submit`, the application jar along with any jars included with the `--jars` option

docs/configuration.md

Lines changed: 40 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,35 +5,51 @@ title: Spark Configuration
55

66
Spark provides three locations to configure the system:
77

8-
* [Spark properties](#spark-properties) control most application parameters and can be set by passing
9-
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
10-
system properties.
8+
* [Spark properties](#spark-properties) control most application parameters and can be set by
9+
passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
10+
or through the `conf/spark-defaults.conf` properties file.
1111
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
1212
the IP address, through the `conf/spark-env.sh` script on each node.
1313
* [Logging](#configuring-logging) can be configured through `log4j.properties`.
1414

1515

1616
# Spark Properties
1717

18-
Spark properties control most application settings and are configured separately for each application.
19-
The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
20-
class to your SparkContext constructor.
21-
Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
22-
of Spark.
23-
24-
SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and
25-
application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could
26-
initialize an application as follows:
18+
Spark properties control most application settings and are configured separately for each
19+
application. The preferred way is to set them through
20+
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
21+
SparkContext. SparkConf lets you configure most of the common properties to initialize a cluster
22+
(e.g., master URL and application name), as well as arbitrary key-value pairs through the `set()`
23+
method. For example, we could initialize an application as follows:
2724

2825
{% highlight scala %}
29-
val conf = new SparkConf().
30-
setMaster("local").
31-
setAppName("My application").
32-
set("spark.executor.memory", "1g")
26+
val conf = new SparkConf
27+
.setMaster("local")
28+
.setAppName("CountingSheep")
29+
.set("spark.executor.memory", "1g")
3330
val sc = new SparkContext(conf)
3431
{% endhighlight %}
3532

36-
Most of the properties control internal settings that have reasonable default values. However,
33+
## Loading Default Configurations
34+
35+
In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
36+
the configuration properties through SparkConf. However, you can still set configuration properties
37+
through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
38+
will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
39+
key and a value separated by whitespace. For example,
40+
41+
spark.master spark://5.6.7.8:7077
42+
spark.executor.memory 512m
43+
spark.eventLog.enabled true
44+
spark.serializer org.apache.spark.serializer.KryoSerializer
45+
46+
Any values specified in the file will be passed on to the application, and merged with those
47+
specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
48+
and SparkConf, then the latter will take precedence as it is most application-specific.
49+
50+
## All Configuration Properties
51+
52+
Most of the properties that control internal settings have reasonable default values. However,
3753
there are at least five properties that you will commonly want to control:
3854

3955
<table class="table">
@@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful
101117
<td>spark.default.parallelism</td>
102118
<td>
103119
<ul>
120+
<li>Local mode: number of cores on the local machine</li>
104121
<li>Mesos fine grained mode: 8</li>
105-
<li>Local mode: core number of the local machine</li>
106-
<li>Others: total core number of all executor nodes or 2, whichever is larger</li>
122+
<li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
107123
</ul>
108124
</td>
109125
<td>
@@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful
696712
## Viewing Spark Properties
697713

698714
The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
699-
This is a useful place to check to make sure that your properties have been set correctly.
715+
This is a useful place to check to make sure that your properties have been set correctly. Note
716+
that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
717+
appear. For all other configuration properties, you can assume the default value is used.
700718

701719
# Environment Variables
702720

@@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`:
714732
* `PYSPARK_PYTHON`, the Python binary to use for PySpark
715733
* `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
716734
* `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines.
717-
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
718-
to use on each machine and maximum memory.
735+
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts),
736+
such as number of cores to use on each machine and maximum memory.
719737

720738
Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
721739
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.

docs/spark-standalone.md

Lines changed: 4 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -178,37 +178,10 @@ The spark-submit script provides the most straightforward way to submit a compil
178178
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
179179
application-arguments: Arguments passed to the main method of <main-class>
180180

181-
Behind the scenes, this invokes the standalone Client to launch your application, which is also the legacy way to launch your application before Spark 1.0.
182-
183-
./bin/spark-class org.apache.spark.deploy.Client launch
184-
[client-options] \
185-
<master-url> <application-jar> <main-class> \
186-
[application-arguments]
187-
188-
client-options:
189-
--memory <count> (amount of memory, in MB, allocated for your driver program)
190-
--cores <count> (number of cores allocated for your driver program)
191-
--supervise (whether to automatically restart your driver on application or node failure)
192-
--verbose (prints increased logging output)
193-
194-
Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways:
195-
196-
* __Environment variables__: These are captured from the environment within which the client
197-
is launched and applied when launching the driver program. These environment variables should be
198-
exported in `conf/spark-env.sh`.
199-
* __Java options__: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in
200-
which you launch the submission client. (_Note_: as of Spark 1.0, application specific
201-
[Spark configuration properties](configuration.html#spark-properties) should be specified through
202-
`conf/spark-defaults.conf` loaded by `spark-submit`.)
203-
* __Dependencies__: If your application is launched through `spark-submit`, then the application
204-
jar is automatically distributed to all worker nodes. Otherwise, you'll need to explicitly add the
205-
jar through `sc.addJars`.
206-
207-
Once you submit a driver program, it will appear in the cluster management UI at port 8080 and
208-
be assigned an identifier. If you'd like to prematurely terminate the program, you can do so as
209-
follows:
210-
211-
./bin/spark-class org.apache.spark.deploy.Client kill <driverId>
181+
If your application is launched through `spark-submit`, then the application jar is automatically
182+
distributed to all worker nodes. Otherwise, you'll need to explicitly add the jar through
183+
`sc.addJars`. To control the application's configuration or execution environment, see
184+
[Spark Configuration](configuration.html).
212185

213186
# Resource Scheduling
214187

0 commit comments

Comments
 (0)