Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion conf/spark-defaults.conf.template
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
4 changes: 2 additions & 2 deletions conf/spark-env.sh.template
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@

# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
Expand Down
7 changes: 7 additions & 0 deletions docs/building-with-maven.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,13 @@ Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spi
For these tests to run your system must have a JDK 8 installation.
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.

## Building for PySpark on YARN ##

PySpark on YARN is only supported if the jar is built with maven. Further, there is a known problem
with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to
run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere,
then ship it over to the cluster. We are investigating the exact cause for this.

## Packaging without Hadoop dependencies for deployment on YARN ##

The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
Expand Down
73 changes: 45 additions & 28 deletions docs/cluster-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,62 +66,76 @@ script as shown here while passing your jar.
For Python, you can use the `pyFiles` argument of SparkContext
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.

### Launching Applications with ./bin/spark-submit
### Launching Applications with Spark submit

Once a user application is bundled, it can be launched using the `spark-submit` script located in
the bin directory. This script takes care of setting up the classpath with Spark and its
dependencies, and can support different cluster managers and deploy modes that Spark supports.
It's usage is
dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit --class path.to.your.Class [options] <app jar> [app options]
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
... // other options
<application-jar>
[application-arguments]

When calling `spark-submit`, `[app options]` will be passed along to your application's
main class. To enumerate all options available to `spark-submit` run it with
the `--help` flag. Here are a few examples of common options:
main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
application-arguments: Space delimited arguments passed to the main method of <main-class>, if any

To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
examples of common options:

{% highlight bash %}
# Run application locally
./bin/spark-submit \
--class my.main.ClassName
--class org.apache.spark.examples.SparkPi
--master local[8] \
my-app.jar
/path/to/examples.jar \
100

# Run on a Spark standalone cluster
./bin/spark-submit \
--class my.main.ClassName
--master spark://mycluster:7077 \
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
my-app.jar
/path/to/examples.jar \
1000

# Run on a YARN cluster
HADOOP_CONF_DIR=XX /bin/spark-submit \
--class my.main.ClassName
HADOOP_CONF_DIR=XX ./bin/spark-submit \
--class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
my-app.jar
/path/to/examples.jar \
1000
{% endhighlight %}

### Loading Configurations from a File

The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
onto your application. By default it will read configuration options from
`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the
application when run. They can obviate the need for certain flags to `spark-submit`: for
instance, if `spark.master` property is set, you can safely omit the
The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
properties file and pass them on to your application. By default it will read configuration options
from `conf/spark-defaults.conf`. For more detail, see the section on
[loading default configurations](configuration.html#loading-default-configurations).

Loading default Spark configurations this way can obviate the need for certain flags to
`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values
in the defaults file.
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
defaults file.

If you are ever unclear where configuration options are coming from. fine-grained debugging
information can be printed by adding the `--verbose` option to `./spark-submit`.
If you are ever unclear where configuration options are coming from, you can print out fine-grained
debugging information by running `spark-submit` with the `--verbose` option.

### Advanced Dependency Management
When using `./bin/spark-submit` the app jar along with any jars included with the `--jars` option
will be automatically transferred to the cluster. `--jars` can also be used to distribute .egg and .zip
libraries for Python to executors. Spark uses the following URL scheme to allow different
strategies for disseminating jars:
When using `spark-submit`, the application jar along with any jars included with the `--jars` option
will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
different strategies for disseminating jars:

- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
every executor pulls the file from the driver HTTP server.
Expand All @@ -135,6 +149,9 @@ This can use up a significant amount of space over time and will need to be clea
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
`spark.worker.cleanup.appDataTtl` property.

For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
to executors.

# Monitoring

Each driver program has a web UI, typically on port 4040, that displays information about running
Expand Down
64 changes: 41 additions & 23 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,51 @@ title: Spark Configuration

Spark provides three locations to configure the system:

* [Spark properties](#spark-properties) control most application parameters and can be set by passing
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
system properties.
* [Spark properties](#spark-properties) control most application parameters and can be set by
passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
or through the `conf/spark-defaults.conf` properties file.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node.
* [Logging](#configuring-logging) can be configured through `log4j.properties`.


# Spark Properties

Spark properties control most application settings and are configured separately for each application.
The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
class to your SparkContext constructor.
Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
of Spark.

SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and
application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could
initialize an application as follows:
Spark properties control most application settings and are configured separately for each
application. The preferred way is to set them through
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
SparkContext. SparkConf allows you to configure most of the common properties to initialize a
cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the
`set()` method. For example, we could initialize an application as follows:

{% highlight scala %}
val conf = new SparkConf().
setMaster("local").
setAppName("My application").
set("spark.executor.memory", "1g")
val conf = new SparkConf
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
{% endhighlight %}

Most of the properties control internal settings that have reasonable default values. However,
## Loading Default Configurations

In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
the configuration properties through SparkConf. However, you can still set configuration properties
through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
key and a value separated by whitespace. For example,

spark.master spark://5.6.7.8:7077
spark.executor.memory 512m
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer

Any values specified in the file will be passed on to the application, and merged with those
specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
and SparkConf, then the latter will take precedence as it is the most application-specific.

## All Configuration Properties

Most of the properties that control internal settings have reasonable default values. However,
there are at least five properties that you will commonly want to control:

<table class="table">
Expand Down Expand Up @@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful
<td>spark.default.parallelism</td>
<td>
<ul>
<li>Local mode: number of cores on the local machine</li>
<li>Mesos fine grained mode: 8</li>
<li>Local mode: core number of the local machine</li>
<li>Others: total core number of all executor nodes or 2, whichever is larger</li>
<li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
</ul>
</td>
<td>
Expand Down Expand Up @@ -187,7 +203,7 @@ Apart from these, the following properties are also available, and may be useful
Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
standard javax servlet Filter. Parameters to each filter can also be specified by setting a
java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
(e.g.-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
(e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
</td>
</tr>
<tr>
Expand Down Expand Up @@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful
## Viewing Spark Properties

The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
This is a useful place to check to make sure that your properties have been set correctly.
This is a useful place to check to make sure that your properties have been set correctly. Note
that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
appear. For all other configuration properties, you can assume the default value is used.

# Environment Variables

Expand All @@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`:
* `PYSPARK_PYTHON`, the Python binary to use for PySpark
* `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
* `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines.
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
to use on each machine and maximum memory.
* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts),
such as number of cores to use on each machine and maximum memory.

Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
Expand Down
14 changes: 10 additions & 4 deletions docs/hadoop-third-party-distributions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@ with these distributions:

# Compile-time Hadoop Version

When compiling Spark, you'll need to
[set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions):
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
property. For certain versions, you will need to specify additional profiles. For more detail,
see the guide on [building with maven](building-with-maven.html#specifying-the-hadoop-version):

SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package

The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
distribution may "just work" without you needing to compile. That said, we recommend compiling with
the _exact_ Hadoop version you are running to avoid any compatibility errors.
Expand Down Expand Up @@ -46,6 +48,10 @@ the _exact_ Hadoop version you are running to avoid any compatibility errors.
</tr>
</table>

In SBT, the equivalent can be achieved by setting the SPARK_HADOOP_VERSION flag:

SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly

# Linking Applications to the Hadoop Version

In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
Expand Down
34 changes: 22 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,21 +24,31 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/

# Running the Examples and Shell

Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory.
To run one of the Java or Scala sample programs, use `./bin/run-example <class> <params>` in the top-level Spark directory
(the `bin/run-example` script sets up the appropriate paths and launches that program).
For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`.
To run a Python sample program, use `./bin/pyspark <sample-program> <params>`. For example, try `./bin/pyspark ./examples/src/main/python/pi.py local`.
Spark comes with several sample programs. Scala, Java and Python examples are in the
`examples/src/main` directory. To run one of the Java or Scala sample programs, use
`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
invokes the more general
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
launching applications). For example,

Each example prints usage help when run with no parameters.
./bin/run-example SparkPi 10

Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing.
You can also run Spark interactively through modified versions of the Scala shell. This is a
great way to learn the framework.

Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
./bin/spark-shell --master local[2]

The `--master` option specifies the
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing. For a full list of options, run Spark shell with the `--help` option.

Spark also provides a Python interface. To run an example Spark application written in Python, use
`bin/pyspark <program> [params]`. For example,

./bin/pyspark examples/src/main/python/pi.py local[2] 10

or simply `bin/pyspark` without any arguments to run Spark interactively in a python interpreter.

# Launching on a Cluster

Expand Down
5 changes: 1 addition & 4 deletions docs/java-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,4 @@ Spark includes several sample programs using the Java API in
[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the
`bin/run-example` script included in Spark; for example:

./bin/run-example org.apache.spark.examples.JavaWordCount

Each example program prints usage help when run
without any arguments.
./bin/run-example JavaWordCount README.md
2 changes: 1 addition & 1 deletion docs/python-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,6 @@ some example applications.
PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
You can run them by passing the files to `pyspark`; e.g.:

./bin/spark-submit examples/src/main/python/wordcount.py
./bin/spark-submit examples/src/main/python/wordcount.py local[2] README.md

Each program prints usage help when run without arguments.
Loading