Update docs for standalone mode

andrewor14 · andrewor14 · commit 381fe3260649 · 2014-05-09T13:51:09.000-07:00
diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template
@@ -30,11 +30,11 @@
 
 # Options for the daemons used in the standalone deploy mode:
 # - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
-# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
+# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
 # - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
 # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
 # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
-# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
+# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
 # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
 # - SPARK_WORKER_DIR, to set the working directory of worker processes
 # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
@@ -70,7 +70,7 @@ Once you've set up this file, you can launch or stop your cluster with the follo
 - `sbin/start-slaves.sh` - Starts a slave instance on each machine specified in the `conf/slaves` file.
 - `sbin/start-all.sh` - Starts both a master and a number of slaves as described above.
 - `sbin/stop-master.sh` - Stops the master that was started via the `bin/start-master.sh` script.
-- `sbin/stop-slaves.sh` - Stops the slave instances that were started via `bin/start-slaves.sh`.
+- `sbin/stop-slaves.sh` - Stops all slave instances the machines specified in the `conf/slaves` file.
 - `sbin/stop-all.sh` - Stops both the master and the slaves as described above.
 
 Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
@@ -92,12 +92,8 @@ You can optionally configure the cluster further by setting environment variable
     <td>Port for the master web UI (default: 8080).</td>
   </tr>
   <tr>
-    <td><code>SPARK_WORKER_PORT</code></td>
-    <td>Start the Spark worker on a specific port (default: random).</td>
-  </tr>
-  <tr>
-    <td><code>SPARK_WORKER_DIR</code></td>
-    <td>Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).</td>
+    <td><code>SPARK_MASTER_OPTS</code></td>
+    <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
   </tr>
   <tr>
     <td><code>SPARK_WORKER_CORES</code></td>
@@ -107,6 +103,10 @@ You can optionally configure the cluster further by setting environment variable
     <td><code>SPARK_WORKER_MEMORY</code></td>
     <td>Total amount of memory to allow Spark applications to use on the machine, e.g. <code>1000m</code>, <code>2g</code> (default: total memory minus 1 GB); note that each application's <i>individual</i> memory is configured using its <code>spark.executor.memory</code> property.</td>
   </tr>
+  <tr>
+    <td><code>SPARK_WORKER_PORT</code></td>
+    <td>Start the Spark worker on a specific port (default: random).</td>
+  </tr>
   <tr>
     <td><code>SPARK_WORKER_WEBUI_PORT</code></td>
     <td>Port for the worker web UI (default: 8081).</td>
@@ -120,13 +120,25 @@ You can optionally configure the cluster further by setting environment variable
       or else each worker will try to use all the cores.
     </td>
   </tr>
+  <tr>
+    <td><code>SPARK_WORKER_DIR</code></td>
+    <td>Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).</td>
+  </tr>
+  <tr>
+    <td><code>SPARK_WORKER_OPTS</code></td>
+    <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
+  </tr>
   <tr>
     <td><code>SPARK_DAEMON_MEMORY</code></td>
     <td>Memory to allocate to the Spark master and worker daemons themselves (default: 512m).</td>
   </tr>
   <tr>
     <td><code>SPARK_DAEMON_JAVA_OPTS</code></td>
-    <td>JVM options for the Spark master and worker daemons themselves (default: none).</td>
+    <td>JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).</td>
+  </tr>
+  <tr>
+    <td><code>SPARK_PUBLIC_DNS</code></td>
+    <td>The public DNS name of the Spark master and workers (default: none).</td>
   </tr>
 </table>
 
@@ -150,20 +162,30 @@ You can also pass an option `--cores <numCores>` to control the number of cores
 
 Spark supports two deploy modes. Spark applications may run with the driver inside the client process or entirely inside the cluster.
 
-The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For info on the lower-level invocations used to launch an app inside the cluster, read ahead.
+The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster in either deploy mode. For more detail, see the [cluster mode overview](cluster-overview.html).
+
+    ./bin/spark-submit \
+      --class <main-class>
+      --master <master-url> \
+      --deploy-mode <deploy-mode> \
+      ... // other options
+      <application-jar>
+      [application-arguments]
 
-## Launching Applications Inside the Cluster
+    main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
+    master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
+    deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
+    application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
+    application-arguments: Arguments passed to the main method of <main-class>
+
+Behind the scenes, this invokes the standalone Client to launch your application, which is also the legacy way to launch your application before Spark 1.0.
 
     ./bin/spark-class org.apache.spark.deploy.Client launch
        [client-options] \
-       <cluster-url> <application-jar-url> <main-class> \
-       [application-options]
-
-    cluster-url: The URL of the master node.
-    application-jar-url: Path to a bundled jar including your application and all dependencies. Currently, the URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. 
-    main-class: The entry point for your application.
+       <master-url> <application-jar> <main-class> \
+       [application-arguments]
 
-    Client Options:
+    client-options:
       --memory <count> (amount of memory, in MB, allocated for your driver program)
       --cores <count> (number of cores allocated for your driver program)
       --supervise (whether to automatically restart your driver on application or node failure)
@@ -172,7 +194,7 @@ The spark-submit script described in the [cluster mode overview](cluster-overvie
 Keep in mind that your driver program will be executed on a remote worker machine. You can control the execution environment in the following ways:
 
  * _Environment variables_: These will be captured from the environment in which you launch the client and applied when launching the driver program.
- * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client.
+ * _Java options_: You can add java options by setting `SPARK_JAVA_OPTS` in the environment in which you launch the submission client. (Note: as of Spark 1.0, spark options should be specified through `conf/spark-defaults.conf`, which is only loaded through spark-submit.)
  * _Dependencies_: You'll still need to call `sc.addJar` inside of your program to make your bundled application jar visible on all worker nodes.
 
 Once you submit a driver program, it will appear in the cluster management UI at port 8080 and