From d5fe17b64bed383a26d56f40ad48b3bd314ea9af Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Thu, 8 May 2014 15:19:05 -0700 Subject: [PATCH 01/11] Update the YARN docs --- docs/hadoop-third-party-distributions.md | 11 +++++++---- docs/running-on-yarn.md | 16 ++++++++++++---- 2 files changed, 19 insertions(+), 8 deletions(-) diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md index 454877a7fa8a..2558c3c49e6d 100644 --- a/docs/hadoop-third-party-distributions.md +++ b/docs/hadoop-third-party-distributions.md @@ -9,12 +9,11 @@ with these distributions: # Compile-time Hadoop Version -When compiling Spark, you'll need to -[set the SPARK_HADOOP_VERSION flag](index.html#a-note-about-hadoop-versions): +When compiling Spark, you'll need to specify the Hadoop version by [defining the hadoop.version property](building-with-maven.html): - SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly + mvn -Dhadoop.version=1.0.4 -DskipTests clean package -The table below lists the corresponding `SPARK_HADOOP_VERSION` code for each CDH/HDP release. Note that +The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that some Hadoop releases are binary compatible across client versions. This means the pre-built Spark distribution may "just work" without you needing to compile. That said, we recommend compiling with the _exact_ Hadoop version you are running to avoid any compatibility errors. @@ -46,6 +45,10 @@ the _exact_ Hadoop version you are running to avoid any compatibility errors. +In SBT, the equivalent can be achieved by setting the SPARK_HADOOP_VERSION flag: + + SPARK_HADOOP_VERSION=1.0.4 sbt/sbt assembly + # Linking Applications to the Hadoop Version In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 68183ee8b461..c3a56e676f62 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -43,18 +43,19 @@ Unlike in Spark standalone and Mesos mode, in which the master's address is spec To launch a Spark application in yarn-cluster mode: - ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] [app options] + ./bin/spark-submit --class path.to.your.Class --master yarn-cluster --deploy-mode cluster [options] [app options] For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ + --deploy-mode cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 examples/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-examples-assembly-{{site.SPARK_VERSION}}.jar \ - yarn-cluster 5 + 10 The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Viewing Logs" section below for how to see driver and executor logs. @@ -68,11 +69,12 @@ In yarn-cluster mode, the driver runs on a different machine than the client, so $ ./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ + --deploy-mode cluster \ --jars my-other-jar.jar,my-other-other-jar.jar my-main-jar.jar - yarn-cluster 5 + [app arguments] -# Viewing logs +# Debugging your Application In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the "yarn logs" command. @@ -82,6 +84,12 @@ will print out the contents of all log files from all containers from the given When log aggregation isn't turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. +To review per container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a +large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs +on the nodes on which containers are launched. This directory contains the launch script, jars, and +all environment variables used for launching each container. This process is useful for debugging +classpath problems in particular. + # Important notes - Before Hadoop 2.2, YARN does not support cores in container resource requests. Thus, when running against an earlier version, the numbers of cores given via command line arguments cannot be passed to YARN. Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured. From 924f04cbc3801fcc915e7bd49ed64e34e509f535 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Thu, 8 May 2014 16:12:56 -0700 Subject: [PATCH 02/11] Revert addition of --deploy-mode --- docs/running-on-yarn.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index c3a56e676f62..ebaae115ffc4 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -43,13 +43,12 @@ Unlike in Spark standalone and Mesos mode, in which the master's address is spec To launch a Spark application in yarn-cluster mode: - ./bin/spark-submit --class path.to.your.Class --master yarn-cluster --deploy-mode cluster [options] [app options] + ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] [app options] For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ - --deploy-mode cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ @@ -69,10 +68,9 @@ In yarn-cluster mode, the driver runs on a different machine than the client, so $ ./bin/spark-submit --class my.main.Class \ --master yarn-cluster \ - --deploy-mode cluster \ --jars my-other-jar.jar,my-other-other-jar.jar my-main-jar.jar - [app arguments] + app_arg1 app_arg2 # Debugging your Application From 757c1842dfb83cb07fb22b16661af4b2b1aa3365 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Fri, 9 May 2014 12:11:49 -0700 Subject: [PATCH 03/11] Add a note about the requirements for the debugging trick --- docs/running-on-yarn.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index be1488247d66..1331453ba602 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -82,11 +82,12 @@ will print out the contents of all log files from all containers from the given When log aggregation isn't turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. -To review per container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a +To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, jars, and all environment variables used for launching each container. This process is useful for debugging -classpath problems in particular. +classpath problems in particular. (Note that enabling this requires admin privileges on cluster +settings and a restart of all node managers. Thus, this is not applicable to hosted clusters). # Important notes From 381fe32606491fdc5ea1fc73a653034cbd76c0b5 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Fri, 9 May 2014 13:51:09 -0700 Subject: [PATCH 04/11] Update docs for standalone mode --- conf/spark-env.sh.template | 4 +-- docs/spark-standalone.md | 58 ++++++++++++++++++++++++++------------ 2 files changed, 42 insertions(+), 20 deletions(-) diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template index f906be611a93..5ec35ffed633 100755 --- a/conf/spark-env.sh.template +++ b/conf/spark-env.sh.template @@ -30,11 +30,11 @@ # Options for the daemons used in the standalone deploy mode: # - SPARK_MASTER_IP, to bind the master to a different IP address or hostname -# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports +# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master # - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y") # - SPARK_WORKER_CORES, to set the number of cores to use on this machine # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g) -# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT +# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y") diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index dc7f206e0399..d1ffdc434124 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -70,7 +70,7 @@ Once you've set up this file, you can launch or stop your cluster with the follo - `sbin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file. - `sbin/start-all.sh` - Starts both a master and a number of slaves as described above. - `sbin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script. -- `sbin/stop-slaves.sh` - Stops the slave instances that were started via `bin/start-slaves.sh`. +- `sbin/stop-slaves.sh` - Stops all slave instances the machines specified in the `conf/slaves` file. - `sbin/stop-all.sh` - Stops both the master and the slaves as described above. Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine. @@ -92,12 +92,8 @@ You can optionally configure the cluster further by setting environment variable Port for the master web UI (default: 8080). - SPARK_WORKER_PORT - Start the Spark worker on a specific port (default: random). - - - SPARK_WORKER_DIR - Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work). + SPARK_MASTER_OPTS + Configuration properties that apply only to the master in the form "-Dx=y" (default: none). SPARK_WORKER_CORES @@ -107,6 +103,10 @@ You can optionally configure the cluster further by setting environment variable SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m, 2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memory property. + + SPARK_WORKER_PORT + Start the Spark worker on a specific port (default: random). + SPARK_WORKER_WEBUI_PORT Port for the worker web UI (default: 8081). @@ -120,13 +120,25 @@ You can optionally configure the cluster further by setting environment variable or else each worker will try to use all the cores. + + SPARK_WORKER_DIR + Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work). + + + SPARK_WORKER_OPTS + Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). + SPARK_DAEMON_MEMORY Memory to allocate to the Spark master and worker daemons themselves (default: 512m). SPARK_DAEMON_JAVA_OPTS - JVM options for the Spark master and worker daemons themselves (default: none). + JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none). + + + SPARK_PUBLIC_DNS + The public DNS name of the Spark master and workers (default: none). @@ -150,20 +162,30 @@ You can also pass an option `--cores ` to control the number of cores Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster. -The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For info on the lower-level invocations used to launch an app inside the cluster, read ahead. +The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For more detail, see the [cluster mode overview](cluster-overview.html). + + ./bin/spark-submit \ + --class + --master \ + --deploy-mode \ + ... // other options + + [application-arguments] -## Launching Applications Inside the Cluster + main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) + master-url: The URL of the master node (e.g. spark://23.195.26.187:7077) + deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client) + application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. + application-arguments: Arguments passed to the main method of + +Behind the scenes, this invokes the standalone Client to launch your application, which is also the legacy way to launch your application before Spark 1.0. ./bin/spark-class org.apache.spark.deploy.Client launch [client-options] \ - \ - [application-options] - - cluster-url: The URL of the master node. - application-jar-url: Path to a bundled jar including your application and all dependencies. Currently, the URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. - main-class: The entry point for your application. + \ + [application-arguments] - Client Options: + client-options: --memory (amount of memory, in MB, allocated for your driver program) --cores (number of cores allocated for your driver program) --supervise (whether to automatically restart your driver on application or node failure) @@ -172,7 +194,7 @@ The spark-submit script described in the [cluster mode overview](cluster-overvie Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways: * _Environment variables_: These will be captured from the environment in which you launch the client and applied when launching the driver program. - * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client. + * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client. (Note: as of Spark 1.0, spark options should be specified through `conf/spark-defaults.conf`, which is only loaded through spark-submit.) * _Dependencies_: You'll still need to call `sc.addJar` inside of your program to make your bundled application jar visible on all worker nodes. Once you submit a driver program, it will appear in the cluster management UI at port 8080 and From 85a51fc68cb5ac68c3e343b80829bd4e8c16fd3b Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Fri, 9 May 2014 19:04:18 -0700 Subject: [PATCH 05/11] Update run-example, spark-shell, configuration etc. --- docs/cluster-overview.md | 38 ++++++++++++++---------- docs/hadoop-third-party-distributions.md | 5 +++- docs/index.md | 23 +++++++------- docs/java-programming-guide.md | 5 +--- docs/python-programming-guide.md | 2 +- docs/quick-start.md | 4 ++- docs/running-on-yarn.md | 2 +- docs/scala-programming-guide.md | 13 ++++---- docs/spark-standalone.md | 19 ++++++++---- 9 files changed, 67 insertions(+), 44 deletions(-) diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 162c415b5883..207ae7bd4918 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -66,7 +66,7 @@ script as shown here while passing your jar. For Python, you can use the `pyFiles` argument of SparkContext or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed. -### Launching Applications with ./bin/spark-submit +### Launching Applications with Spark submit Once a user application is bundled, it can be launched using the `spark-submit` script located in the bin directory. This script takes care of setting up the classpath with Spark and its @@ -95,7 +95,7 @@ the `--help` flag. Here are a few examples of common options: my-app.jar # Run on a YARN cluster -HADOOP_CONF_DIR=XX /bin/spark-submit \ +HADOOP_CONF_DIR=XX ./bin/spark-submit \ --class my.main.ClassName --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ @@ -105,23 +105,28 @@ HADOOP_CONF_DIR=XX /bin/spark-submit \ ### Loading Configurations from a File -The `spark-submit` script can load default `SparkConf` values from a properties file and pass them -onto your application. By default it will read configuration options from -`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the -application when run. They can obviate the need for certain flags to `spark-submit`: for -instance, if `spark.master` property is set, you can safely omit the -`--master` flag from `spark-submit`. In general, configuration values explicitly set on a -`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values -in the defaults file. +The `spark-submit` script can load default [Spark configuration values](configuration.html) from +a properties file and pass them on to your application. By default it will read configuration +options from `conf/spark-defaults.conf`, in which each line consists of a key and a value separated +by whitespace. For example, + + spark.master spark://5.6.7.8:7077 + spark.executor.memory 512m + spark.eventLog.enabled true + +Any values specified in the file will be passed on to the application. Loading default Spark +configurations this way can obviate the need for certain flags to `spark-submit`. For instance, +if `spark.master` property is set, you can safely omit the `--master` flag from `spark-submit`. +In general, configuration values explicitly set on a `SparkConf` take the highest precedence, +then flags passed to `spark-submit`, then values in the defaults file. If you are ever unclear where configuration options are coming from. fine-grained debugging -information can be printed by adding the `--verbose` option to `./spark-submit`. +information can be printed by running `spark-submit` with the `--verbose` option. ### Advanced Dependency Management -When using `./bin/spark-submit` the app jar along with any jars included with the `--jars` option -will be automatically transferred to the cluster. `--jars` can also be used to distribute .egg and .zip -libraries for Python to executors. Spark uses the following URL scheme to allow different -strategies for disseminating jars: +When using `spark-submit`, the application jar along with any jars included with the `--jars` option +will be automatically transferred to the cluster. Spark uses the following URL scheme to allow +different strategies for disseminating jars: - **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and every executor pulls the file from the driver HTTP server. @@ -135,6 +140,9 @@ This can use up a significant amount of space over time and will need to be clea is handled automatically, and with Spark standalone, automatic cleanup can be configured with the `spark.worker.cleanup.appDataTtl` property. +For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries +to executors. + # Monitoring Each driver program has a web UI, typically on port 4040, that displays information about running diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md index 2558c3c49e6d..a0aeab5727bd 100644 --- a/docs/hadoop-third-party-distributions.md +++ b/docs/hadoop-third-party-distributions.md @@ -9,9 +9,12 @@ with these distributions: # Compile-time Hadoop Version -When compiling Spark, you'll need to specify the Hadoop version by [defining the hadoop.version property](building-with-maven.html): +When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version` +property. For certain versions, you will need to specify additional profiles. For more detail, +see the guide on [building with maven](building-with-maven.html#specifying-the-hadoop-version): mvn -Dhadoop.version=1.0.4 -DskipTests clean package + mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that some Hadoop releases are binary compatible across client versions. This means the pre-built Spark diff --git a/docs/index.md b/docs/index.md index a2f1a84371ff..f5ad0eec8766 100644 --- a/docs/index.md +++ b/docs/index.md @@ -25,20 +25,23 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/ # Running the Examples and Shell Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory. -To run one of the Java or Scala sample programs, use `./bin/run-example ` in the top-level Spark directory -(the `bin/run-example` script sets up the appropriate paths and launches that program). -For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`. -To run a Python sample program, use `./bin/pyspark `. For example, try `./bin/pyspark ./examples/src/main/python/pi.py local`. +To run one of the Java or Scala sample programs, use `bin/run-example [params]` in the top-level Spark directory. For example, -Each example prints usage help when run with no parameters. + ./bin/run-example SparkPi 10 -Note that all of the sample programs take a `` parameter specifying the cluster URL -to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls), +You can also run Spark interactively through modified versions of the Scala shell. This is a great way to learn the framework. + + ./bin/spark-shell --master local[2] + +The `--master` option specifies the [master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using -`local` for testing. +`local` for testing. For a full list of options, run Spark shell with the `--help` option. + +If Scala is not your cup of tea, you can also try out Spark using the python interface with `bin/pyspark [params]`. For example, + + ./bin/pyspark examples/src/main/python/pi.py local[2] 10 -Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or -Python interpreter (`./bin/pyspark`). These are a great way to learn the framework. +or simply `bin/pyspark` without any arguments to run Spark interactively in a python interpreter. # Launching on a Cluster diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md index c34eb28fc06a..943fdd9d019f 100644 --- a/docs/java-programming-guide.md +++ b/docs/java-programming-guide.md @@ -215,7 +215,4 @@ Spark includes several sample programs using the Java API in [`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the `bin/run-example` script included in Spark; for example: - ./bin/run-example org.apache.spark.examples.JavaWordCount - -Each example program prints usage help when run -without any arguments. + ./bin/run-example JavaWordCount README.md diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index 6813963bb080..e8fbcba64b80 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -164,6 +164,6 @@ some example applications. PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python). You can run them by passing the files to `pyspark`; e.g.: - ./bin/spark-submit examples/src/main/python/wordcount.py + ./bin/spark-submit examples/src/main/python/wordcount.py local[2] README.md Each program prints usage help when run without arguments. diff --git a/docs/quick-start.md b/docs/quick-start.md index 478b790f92e1..a4d01487bb49 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -18,7 +18,9 @@ you can download a package for any version of Hadoop. ## Basics Spark's interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively. -Start the shell by running `./bin/spark-shell` in the Spark directory. +Start the shell by running the following in the Spark directory. + + ./bin/spark-shell Spark's primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let's make a new RDD from the text of the README file in the Spark source directory: diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 1331453ba602..66c330fdee73 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -60,7 +60,7 @@ The above starts a YARN client program which starts the default Application Mast To launch a Spark application in yarn-client mode, do the same, but replace "yarn-cluster" with "yarn-client". To run spark-shell: - $ MASTER=yarn-client ./bin/spark-shell + $ ./bin/spark-shell --master yarn-client ## Adding additional jars diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index f25e9cca8852..3ed86e460c01 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -56,7 +56,7 @@ The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster UR to connect to, or a special "local" string to run in local mode, as described below. `appName` is a name for your application, which will be shown in the cluster web UI. It's also possible to set these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file) -which avoids hard-coding the master name in your application. +which avoids hard-coding the master url in your application. In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the @@ -74,6 +74,11 @@ Or, to also add `code.jar` to its classpath, use: $ ./bin/spark-shell --master local[4] --jars code.jar {% endhighlight %} +For a complete list of options, run Spark shell with the `--help` option. Behind the scenes, +Spark shell invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) +used for launching applications, and passes on all of its parameters. As a result, these two scripts +share the same parameters. + ### Master URLs The master URL passed to Spark can be in one of the following formats: @@ -98,7 +103,7 @@ cluster mode. The cluster location will be inferred based on the local Hadoop co -If no master URL is specified, the spark shell defaults to "local[*]". +If no master URL is specified, the spark shell defaults to `local[*]`. # Resilient Distributed Datasets (RDDs) @@ -432,9 +437,7 @@ res2: Int = 10 You can see some [example Spark programs](http://spark.apache.org/examples.html) on the Spark website. In addition, Spark includes several samples in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `bin/run-example` script included in Spark; for example: - ./bin/run-example org.apache.spark.examples.SparkPi - -Each example program prints usage help when run without any arguments. + ./bin/run-example SparkPi For help on optimizing your program, the [configuration](configuration.html) and [tuning](tuning.html) guides provide information on best practices. They are especially important for diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index d1ffdc434124..d498d7ebfb86 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -193,13 +193,20 @@ Behind the scenes, this invokes the standalone Client to launch your application Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways: - * _Environment variables_: These will be captured from the environment in which you launch the client and applied when launching the driver program. - * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client. (Note: as of Spark 1.0, spark options should be specified through `conf/spark-defaults.conf`, which is only loaded through spark-submit.) - * _Dependencies_: You'll still need to call `sc.addJar` inside of your program to make your bundled application jar visible on all worker nodes. + * __Environment variables__: These are captured from the environment within which the client + is launched and applied when launching the driver program. These environment variables should be + exported in `conf/spark-env.sh`. + * __Java options__: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in + which you launch the submission client. (_Note_: as of Spark 1.0, application specific + [Spark configuration properties](configuration.html#spark-properties) should be specified through + `conf/spark-defaults.conf` loaded by `spark-submit`.) + * __Dependencies__: If your application is launched through `spark-submit`, then the application + jar is automatically distributed to all worker nodes. Otherwise, you'll need to explicitly add the + jar through `sc.addJars`. Once you submit a driver program, it will appear in the cluster management UI at port 8080 and -be assigned an identifier. If you'd like to prematurely terminate the program, you can do so using -the same client: +be assigned an identifier. If you'd like to prematurely terminate the program, you can do so as +follows: ./bin/spark-class org.apache.spark.deploy.Client kill @@ -225,7 +232,7 @@ default for applications that don't set `spark.cores.max` to something less than Do this by adding the following to `conf/spark-env.sh`: {% highlight bash %} -export SPARK_JAVA_OPTS="-Dspark.deploy.defaultCores=" +export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=" {% endhighlight %} This is useful on shared clusters where users might not have configured a maximum number of cores From 3cc0649b3abe00c3b70f64f69b2c3f940a86b768 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 17:35:58 -0700 Subject: [PATCH 06/11] Detail how to set configurations + remove legacy instructions This commit removes the section on using org.apache.spark.deploy.Client to launch an application. This is subsumed by the Spark submit section immediately preceding it. This commit also clarifies how we set the Spark configuration properties in the 1.0 world. Previously it was pretty unclear, and the necessary details were in the wrong page (cluster-overview.html) instead of where it is supposed to be (configuration.html). --- conf/spark-defaults.conf.template | 1 + docs/cluster-overview.md | 30 +++++++-------- docs/configuration.md | 62 ++++++++++++++++++++----------- docs/spark-standalone.md | 35 ++--------------- 4 files changed, 58 insertions(+), 70 deletions(-) diff --git a/conf/spark-defaults.conf.template b/conf/spark-defaults.conf.template index f840ff681d01..9bfec767767a 100644 --- a/conf/spark-defaults.conf.template +++ b/conf/spark-defaults.conf.template @@ -5,3 +5,4 @@ # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory +# spark.serializer org.apache.spark.serializer.KryoSerializer diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 207ae7bd4918..c0a2d36f5142 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -105,23 +105,19 @@ HADOOP_CONF_DIR=XX ./bin/spark-submit \ ### Loading Configurations from a File -The `spark-submit` script can load default [Spark configuration values](configuration.html) from -a properties file and pass them on to your application. By default it will read configuration -options from `conf/spark-defaults.conf`, in which each line consists of a key and a value separated -by whitespace. For example, - - spark.master spark://5.6.7.8:7077 - spark.executor.memory 512m - spark.eventLog.enabled true - -Any values specified in the file will be passed on to the application. Loading default Spark -configurations this way can obviate the need for certain flags to `spark-submit`. For instance, -if `spark.master` property is set, you can safely omit the `--master` flag from `spark-submit`. -In general, configuration values explicitly set on a `SparkConf` take the highest precedence, -then flags passed to `spark-submit`, then values in the defaults file. - -If you are ever unclear where configuration options are coming from. fine-grained debugging -information can be printed by running `spark-submit` with the `--verbose` option. +The `spark-submit` script can load default [Spark configuration values](configuration.html) from a +properties file and pass them on to your application. By default it will read configuration options +from `conf/spark-defaults.conf`. For more detail, see the section on +[loading default configurations](configuration.html#loading-default-configurations). + +Loading default Spark configurations this way can obviate the need for certain flags to +`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the +`--master` flag from `spark-submit`. In general, configuration values explicitly set on a +`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the +defaults file. + +If you are ever unclear where configuration options are coming from, you can print out fine-grained +debugging information by running `spark-submit` with the `--verbose` option. ### Advanced Dependency Management When using `spark-submit`, the application jar along with any jars included with the `--jars` option diff --git a/docs/configuration.md b/docs/configuration.md index 5b034e3cb3d4..d0e0e94cf26a 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -5,9 +5,9 @@ title: Spark Configuration Spark provides three locations to configure the system: -* [Spark properties](#spark-properties) control most application parameters and can be set by passing - a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java - system properties. +* [Spark properties](#spark-properties) control most application parameters and can be set by + passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, + or through the `conf/spark-defaults.conf` properties file. * [Environment variables](#environment-variables) can be used to set per-machine settings, such as the IP address, through the `conf/spark-env.sh` script on each node. * [Logging](#configuring-logging) can be configured through `log4j.properties`. @@ -15,25 +15,41 @@ Spark provides three locations to configure the system: # Spark Properties -Spark properties control most application settings and are configured separately for each application. -The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) -class to your SparkContext constructor. -Alternatively, Spark will also load them from Java system properties, for compatibility with old versions -of Spark. - -SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and -application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could -initialize an application as follows: +Spark properties control most application settings and are configured separately for each +application. The preferred way is to set them through +[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your +SparkContext. SparkConf lets you configure most of the common properties to initialize a cluster +(e.g., master URL and application name), as well as arbitrary key-value pairs through the `set()` +method. For example, we could initialize an application as follows: {% highlight scala %} -val conf = new SparkConf(). - setMaster("local"). - setAppName("My application"). - set("spark.executor.memory", "1g") +val conf = new SparkConf + .setMaster("local") + .setAppName("CountingSheep") + .set("spark.executor.memory", "1g") val sc = new SparkContext(conf) {% endhighlight %} -Most of the properties control internal settings that have reasonable default values. However, +## Loading Default Configurations + +In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control +the configuration properties through SparkConf. However, you can still set configuration properties +through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`) +will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a +key and a value separated by whitespace. For example, + + spark.master spark://5.6.7.8:7077 + spark.executor.memory 512m + spark.eventLog.enabled true + spark.serializer org.apache.spark.serializer.KryoSerializer + +Any values specified in the file will be passed on to the application, and merged with those +specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf` +and SparkConf, then the latter will take precedence as it is most application-specific. + +## All Configuration Properties + +Most of the properties that control internal settings have reasonable default values. However, there are at least five properties that you will commonly want to control: @@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 2a8ba9aefff8..eb3211b6b0e4 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -70,7 +70,7 @@ Once you've set up this file, you can launch or stop your cluster with the follo - `sbin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file. - `sbin/start-all.sh` - Starts both a master and a number of slaves as described above. - `sbin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script. -- `sbin/stop-slaves.sh` - Stops all slave instances the machines specified in the `conf/slaves` file. +- `sbin/stop-slaves.sh` - Stops all slave instances on the machines specified in the `conf/slaves` file. - `sbin/stop-all.sh` - Stops both the master and the slaves as described above. Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine. From 25cfe7b2fec608db98769c62e09002434e081030 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 18:20:13 -0700 Subject: [PATCH 10/11] Merge in the warning from SPARK-1753 --- docs/building-with-maven.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md index b6dd553bbe06..8b44535d8240 100644 --- a/docs/building-with-maven.md +++ b/docs/building-with-maven.md @@ -129,6 +129,13 @@ Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spi For these tests to run your system must have a JDK 8 installation. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests. +## Building for PySpark on YARN ## + +PySpark on YARN is only supported if the jar is built with maven. Further, there is a known problem +with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to +run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere, +then ship it over to the cluster. We are investigating the exact cause for this. + ## Packaging without Hadoop dependencies for deployment on YARN ## The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself. From e2c2312b07120ef09f3a9d1906ed8bc841402893 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 18:50:01 -0700 Subject: [PATCH 11/11] Merge in changes in #752 (SPARK-1814) --- docs/index.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/docs/index.md b/docs/index.md index f5ad0eec8766..48182a27d28a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -24,20 +24,27 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/ # Running the Examples and Shell -Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory. -To run one of the Java or Scala sample programs, use `bin/run-example [params]` in the top-level Spark directory. For example, +Spark comes with several sample programs. Scala, Java and Python examples are in the +`examples/src/main` directory. To run one of the Java or Scala sample programs, use +`bin/run-example [params]` in the top-level Spark directory. (Behind the scenes, this +invokes the more general +[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for +launching applications). For example, ./bin/run-example SparkPi 10 -You can also run Spark interactively through modified versions of the Scala shell. This is a great way to learn the framework. +You can also run Spark interactively through modified versions of the Scala shell. This is a +great way to learn the framework. ./bin/spark-shell --master local[2] -The `--master` option specifies the [master URL for a distributed cluster](scala-programming-guide.html#master-urls), -or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using +The `--master` option specifies the +[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run +locally with one thread, or `local[N]` to run locally with N threads. You should start by using `local` for testing. For a full list of options, run Spark shell with the `--help` option. -If Scala is not your cup of tea, you can also try out Spark using the python interface with `bin/pyspark [params]`. For example, +Spark also provides a Python interface. To run an example Spark application written in Python, use +`bin/pyspark [params]`. For example, ./bin/pyspark examples/src/main/python/pi.py local[2] 10
spark.default.parallelism
    +
  • Local mode: number of cores on the local machine
  • Mesos fine grained mode: 8
  • -
  • Local mode: core number of the local machine
  • -
  • Others: total core number of all executor nodes or 2, whichever is larger
  • +
  • Others: total number of cores on all executor nodes or 2, whichever is larger
@@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful ## Viewing Spark Properties The application web UI at `http://:4040` lists Spark properties in the "Environment" tab. -This is a useful place to check to make sure that your properties have been set correctly. +This is a useful place to check to make sure that your properties have been set correctly. Note +that only values explicitly specified through either `spark-defaults.conf` or SparkConf will +appear. For all other configuration properties, you can assume the default value is used. # Environment Variables @@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`: * `PYSPARK_PYTHON`, the Python binary to use for PySpark * `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to. * `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines. -* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores - to use on each machine and maximum memory. +* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), + such as number of cores to use on each machine and maximum memory. Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface. diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index d498d7ebfb86..86aadb27b8d6 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -178,37 +178,10 @@ The spark-submit script provides the most straightforward way to submit a compil application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. application-arguments: Arguments passed to the main method of -Behind the scenes, this invokes the standalone Client to launch your application, which is also the legacy way to launch your application before Spark 1.0. - - ./bin/spark-class org.apache.spark.deploy.Client launch - [client-options] \ - \ - [application-arguments] - - client-options: - --memory (amount of memory, in MB, allocated for your driver program) - --cores (number of cores allocated for your driver program) - --supervise (whether to automatically restart your driver on application or node failure) - --verbose (prints increased logging output) - -Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways: - - * __Environment variables__: These are captured from the environment within which the client - is launched and applied when launching the driver program. These environment variables should be - exported in `conf/spark-env.sh`. - * __Java options__: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in - which you launch the submission client. (_Note_: as of Spark 1.0, application specific - [Spark configuration properties](configuration.html#spark-properties) should be specified through - `conf/spark-defaults.conf` loaded by `spark-submit`.) - * __Dependencies__: If your application is launched through `spark-submit`, then the application - jar is automatically distributed to all worker nodes. Otherwise, you'll need to explicitly add the - jar through `sc.addJars`. - -Once you submit a driver program, it will appear in the cluster management UI at port 8080 and -be assigned an identifier. If you'd like to prematurely terminate the program, you can do so as -follows: - - ./bin/spark-class org.apache.spark.deploy.Client kill +If your application is launched through `spark-submit`, then the application jar is automatically +distributed to all worker nodes. Otherwise, you'll need to explicitly add the jar through +`sc.addJars`. To control the application's configuration or execution environment, see +[Spark Configuration](configuration.html). # Resource Scheduling From 041017ab61a9a13cdde2c1e8a36725b529346f9e Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 17:54:46 -0700 Subject: [PATCH 07/11] Abstract Spark submit documentation to cluster-overview.html --- docs/cluster-overview.md | 39 ++++++++++++++++++++++++++------------- docs/spark-standalone.md | 31 ++++++++++--------------------- 2 files changed, 36 insertions(+), 34 deletions(-) diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index c0a2d36f5142..f05a755de7fe 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -70,37 +70,50 @@ or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed Once a user application is bundled, it can be launched using the `spark-submit` script located in the bin directory. This script takes care of setting up the classpath with Spark and its -dependencies, and can support different cluster managers and deploy modes that Spark supports. -It's usage is +dependencies, and can support different cluster managers and deploy modes that Spark supports: - ./bin/spark-submit --class path.to.your.Class [options] [app options] + ./bin/spark-submit \ + --class + --master \ + --deploy-mode \ + ... // other options + + [application-arguments] -When calling `spark-submit`, `[app options]` will be passed along to your application's -main class. To enumerate all options available to `spark-submit` run it with -the `--help` flag. Here are a few examples of common options: + main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) + master-url: The URL of the master node (e.g. spark://23.195.26.187:7077) + deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client) + application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. + application-arguments: Space delimited arguments passed to the main method of , if any + +To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few +examples of common options: {% highlight bash %} # Run application locally ./bin/spark-submit \ - --class my.main.ClassName + --class org.apache.spark.examples.SparkPi --master local[8] \ - my-app.jar + /path/to/examples.jar \ + 100 # Run on a Spark standalone cluster ./bin/spark-submit \ - --class my.main.ClassName - --master spark://mycluster:7077 \ + --class org.apache.spark.examples.SparkPi + --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ - my-app.jar + /path/to/examples.jar \ + 1000 # Run on a YARN cluster HADOOP_CONF_DIR=XX ./bin/spark-submit \ - --class my.main.ClassName + --class org.apache.spark.examples.SparkPi --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \ --num-executors 50 \ - my-app.jar + /path/to/examples.jar \ + 1000 {% endhighlight %} ### Loading Configurations from a File diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 86aadb27b8d6..2a8ba9aefff8 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -160,27 +160,16 @@ You can also pass an option `--cores ` to control the number of cores # Launching Compiled Spark Applications -Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster. - -The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For more detail, see the [cluster mode overview](cluster-overview.html). - - ./bin/spark-submit \ - --class - --master \ - --deploy-mode \ - ... // other options - - [application-arguments] - - main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi) - master-url: The URL of the master node (e.g. spark://23.195.26.187:7077) - deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client) - application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. - application-arguments: Arguments passed to the main method of - -If your application is launched through `spark-submit`, then the application jar is automatically -distributed to all worker nodes. Otherwise, you'll need to explicitly add the jar through -`sc.addJars`. To control the application's configuration or execution environment, see +Spark supports two deploy modes: applications may run with the driver inside the client process or +entirely inside the cluster. The +[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) provides the +most straightforward way to submit a compiled Spark application to the cluster in either deploy +mode. + +If your application is launched through Spark submit, then the application jar is automatically +distributed to all worker nodes. For any additional jars that your application depends on, you +should specify them through the `--jars` flag using comma as a delimiter (e.g. `--jars jar1,jar2`). +To control the application's configuration or execution environment, see [Spark Configuration](configuration.html). # Resource Scheduling From 336bbd9d660cd2a486b4bbe526eb4a8beaee211d Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 17:56:28 -0700 Subject: [PATCH 08/11] Tabs -> spaces --- conf/spark-defaults.conf.template | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/conf/spark-defaults.conf.template b/conf/spark-defaults.conf.template index 9bfec767767a..2779342769c1 100644 --- a/conf/spark-defaults.conf.template +++ b/conf/spark-defaults.conf.template @@ -2,7 +2,7 @@ # This is useful for setting default environmental settings. # Example: -# spark.master spark://master:7077 +# spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer From a8c39c539a7030cc0ad65f763cfccfbe964a3aa8 Mon Sep 17 00:00:00 2001 From: Andrew Or Date: Mon, 12 May 2014 18:18:24 -0700 Subject: [PATCH 09/11] Minor changes --- docs/configuration.md | 10 +++++----- docs/spark-standalone.md | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index d0e0e94cf26a..2eed96f704a4 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -18,9 +18,9 @@ Spark provides three locations to configure the system: Spark properties control most application settings and are configured separately for each application. The preferred way is to set them through [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your -SparkContext. SparkConf lets you configure most of the common properties to initialize a cluster -(e.g., master URL and application name), as well as arbitrary key-value pairs through the `set()` -method. For example, we could initialize an application as follows: +SparkContext. SparkConf allows you to configure most of the common properties to initialize a +cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the +`set()` method. For example, we could initialize an application as follows: {% highlight scala %} val conf = new SparkConf @@ -45,7 +45,7 @@ key and a value separated by whitespace. For example, Any values specified in the file will be passed on to the application, and merged with those specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf` -and SparkConf, then the latter will take precedence as it is most application-specific. +and SparkConf, then the latter will take precedence as it is the most application-specific. ## All Configuration Properties @@ -203,7 +203,7 @@ Apart from these, the following properties are also available, and may be useful Comma separated list of filter class names to apply to the Spark web ui. The filter should be a standard javax servlet Filter. Parameters to each filter can also be specified by setting a java system property of spark.<class name of filter>.params='param1=value1,param2=value2' - (e.g.-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing') + (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')