Skip to content

Commit af09e3e

Browse files
committed
Mention config file in docs and clean-up docs
1 parent b16e6a2 commit af09e3e

File tree

4 files changed

+146
-165
lines changed

4 files changed

+146
-165
lines changed

docs/cluster-overview.md

Lines changed: 72 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -50,61 +50,78 @@ The system currently supports three cluster managers:
5050
In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
5151
cluster on Amazon EC2.
5252

53-
# Launching Applications
54-
55-
The recommended way to launch a compiled Spark application is through the spark-submit script (located in the
56-
bin directory), which takes care of setting up the classpath with Spark and its dependencies, as well as
57-
provides a layer over the different cluster managers and deploy modes that Spark supports. It's usage is
58-
59-
spark-submit `<app jar>` `<options>`
60-
61-
Where options are any of:
62-
63-
- **\--class** - The main class to run.
64-
- **\--master** - The URL of the cluster manager master, e.g. spark://host:port, mesos://host:port, yarn,
65-
or local.
66-
- **\--deploy-mode** - "client" to run the driver in the client process or "cluster" to run the driver in
67-
a process on the cluster. For Mesos, only "client" is supported.
68-
- **\--executor-memory** - Memory per executor (e.g. 1000M, 2G).
69-
- **\--executor-cores** - Number of cores per executor. (Default: 2)
70-
- **\--driver-memory** - Memory for driver (e.g. 1000M, 2G)
71-
- **\--name** - Name of the application.
72-
- **\--arg** - Argument to be passed to the application's main class. This option can be specified
73-
multiple times to pass multiple arguments.
74-
- **\--jars** - A comma-separated list of local jars to include on the driver classpath and that
75-
SparkContext.addJar will work with. Doesn't work on standalone with 'cluster' deploy mode.
76-
77-
The following currently only work for Spark standalone with cluster deploy mode:
78-
79-
- **\--driver-cores** - Cores for driver (Default: 1).
80-
- **\--supervise** - If given, restarts the driver on failure.
81-
82-
The following only works for Spark standalone and Mesos only:
83-
84-
- **\--total-executor-cores** - Total cores for all executors.
85-
86-
The following currently only work for YARN:
87-
88-
- **\--queue** - The YARN queue to place the application in.
89-
- **\--files** - Comma separated list of files to be placed in the working dir of each executor.
90-
- **\--archives** - Comma separated list of archives to be extracted into the working dir of each
91-
executor.
92-
- **\--num-executors** - Number of executors (Default: 2).
93-
94-
The master and deploy mode can also be set with the MASTER and DEPLOY_MODE environment variables.
95-
Values for these options passed via command line will override the environment variables.
96-
97-
# Shipping Code to the Cluster
98-
99-
The recommended way to ship your code to the cluster is to pass it through SparkContext's constructor,
100-
which takes a list of JAR files (Java/Scala) or .egg and .zip libraries (Python) to disseminate to
101-
worker nodes. You can also dynamically add new files to be sent to executors with `SparkContext.addJar`
102-
and `addFile`.
103-
104-
## URIs for addJar / addFile
105-
106-
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and every executor
107-
pulls the file from the driver HTTP server
53+
# Bundling and Launching Applications
54+
55+
### Bundling Your Application's Dependencies
56+
If your code depends on other projects, you will need to package them alongside
57+
your application in order to distribute the code to a Spark cluster. To do this,
58+
to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both
59+
[sbt](https://github.com/sbt/sbt-assembly) and
60+
[Maven](http://maven.apache.org/plugins/maven-shade-plugin/)
61+
have assembly plugins. When creating assembly jars, list Spark and Hadoop
62+
as `provided` dependencies; these need not be bundled since they are provided by
63+
the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit`
64+
script as shown here while passing your jar.
65+
66+
For Python, you can use the `pyFiles` argument of SparkContext
67+
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
68+
69+
### Launching Applications with ./bin/spark-submit
70+
71+
Once a user application is bundled, it can be launched using the `spark-submit` script located in
72+
the bin directory. This script takes care of setting up the classpath with Spark and its
73+
dependencies, and can support different cluster managers and deploy modes that Spark supports.
74+
It's usage is
75+
76+
./bin/spark-submit <app jar> --class path.to.your.Class [other options..]
77+
78+
To enumerate all options available to `spark-submit` run it with the `--help` flag.
79+
Here are a few examples of common options:
80+
81+
{% highlight bash %}
82+
# Run application locally
83+
./bin/spark-submit my-app.jar \
84+
--class my.main.ClassName
85+
--master local[8]
86+
87+
# Run on a Spark cluster
88+
./bin/spark-submit my-app.jar \
89+
--class my.main.ClassName
90+
--master spark://mycluster:7077 \
91+
--executor-memory 20G \
92+
--total-executor-cores 100
93+
94+
# Run on a YARN cluster
95+
HADOOP_CONF_DIR=XX /bin/spark-submit my-app.jar \
96+
--class my.main.ClassName
97+
--master yarn-cluster \ # can also be `yarn-client` for client mode
98+
--executor-memory 20G \
99+
--num-executors 50
100+
{% endhighlight %}
101+
102+
### Loading Configurations from a File
103+
104+
The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
105+
onto your application. By default it will read configuration options from
106+
`conf/spark-defaults.properties`. Any values specified in the file will be passed on to the
107+
application when run. They can obviate the need for certain flags to `spark-submit`: for
108+
instance, if `spark.master` property is set, you can safely omit the
109+
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
110+
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values
111+
in the defaults file.
112+
113+
If you are ever unclear where configuration options are coming from. fine-grained debugging
114+
information can be printed by adding the `--verbose` option to `./spark-submit`.
115+
116+
### Advanced Dependency Management
117+
When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
118+
users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
119+
on an existing SparkContext. This can be used to distribute JAR files (Java/Scala) or .egg and
120+
.zip libraries (Python) to executors. Spark uses the following URL scheme to allow different
121+
strategies for disseminating jars:
122+
123+
- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
124+
every executor pulls the file from the driver HTTP server
108125
- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
109126
- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node. This
110127
means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,

docs/quick-start.md

Lines changed: 49 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -104,10 +104,13 @@ that these same functions can be used on very large data sets, even when they ar
104104
tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
105105
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
106106

107-
# A Standalone App in Scala
107+
# A Standalone Application
108108
Now say we wanted to write a standalone application using the Spark API. We will walk through a
109109
simple application in both Scala (with SBT), Java (with Maven), and Python.
110110

111+
<div class="codetabs">
112+
<div data-lang="scala" markdown="1">
113+
111114
We'll create a very simple Spark application in Scala. So simple, in fact, that it's
112115
named `SimpleApp.scala`:
113116

@@ -154,14 +157,7 @@ libraryDependencies += "org.apache.spark" %% "spark-core" % "{{site.SPARK_VERSIO
154157
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
155158
{% endhighlight %}
156159

157-
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on
158-
`hadoop-client` for your version of HDFS:
159-
160-
{% highlight scala %}
161-
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"
162-
{% endhighlight %}
163-
164-
Finally, for sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt`
160+
For sbt to work correctly, we'll need to layout `SimpleApp.scala` and `simple.sbt`
165161
according to the typical directory structure. Once that is in place, we can create a JAR package
166162
containing the application's code, then use the `spark-submit` script to run our program.
167163

@@ -188,22 +184,23 @@ $ YOUR_SPARK_HOME/bin/spark-submit target/scala-2.10/simple-project_2.10-1.0.jar
188184
Lines with a: 46, Lines with b: 23
189185
{% endhighlight %}
190186

191-
# A Standalone App in Java
192-
Now say we wanted to write a standalone application using the Java API. This example will use
193-
maven to compile an application jar, but any similar build system will work.
187+
</div>
188+
<div data-lang="java" markdown="1">
189+
This example will use Maven to compile an application jar, but any similar build system will work.
194190

195191
We'll create a very simple Spark application, `SimpleApp.java`:
196192

197193
{% highlight java %}
198194
/*** SimpleApp.java ***/
199195
import org.apache.spark.api.java.*;
196+
import org.apache.spark.SparkConf;
200197
import org.apache.spark.api.java.function.Function;
201198

202199
public class SimpleApp {
203200
public static void main(String[] args) {
204-
String logFile = "$YOUR_SPARK_HOME/README.md"; // Should be some file on your system
205-
JavaSparkContext sc = new JavaSparkContext("local", "Simple App",
206-
"$YOUR_SPARK_HOME", new String[]{"target/simple-project-1.0.jar"});
201+
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system
202+
SparkConf conf = new SparkConf().setAppName("Simple Application");
203+
JavaSparkContext sc = new JavaSparkContext(conf);
207204
JavaRDD<String> logData = sc.textFile(logFile).cache();
208205

209206
long numAs = logData.filter(new Function<String, Boolean>() {
@@ -219,9 +216,16 @@ public class SimpleApp {
219216
}
220217
{% endhighlight %}
221218

222-
This program just counts the number of lines containing 'a' and the number containing 'b' in a text file. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala example, we initialize a SparkContext, though we use the special `JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by `JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend `spark.api.java.function.Function`. The [Java programming guide](java-programming-guide.html) describes these differences in more detail.
219+
This program just counts the number of lines containing 'a' and the number containing 'b' in a text
220+
file. Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
221+
As with the Scala example, we initialize a SparkContext, though we use the special
222+
`JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by
223+
`JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes
224+
that extend `spark.api.java.function.Function`. The
225+
[Java programming guide](java-programming-guide.html) describes these differences in more detail.
223226

224-
To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.
227+
To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency.
228+
Note that Spark artifacts are tagged with a Scala version.
225229

226230
{% highlight xml %}
227231
<project>
@@ -247,16 +251,6 @@ To build the program, we also write a Maven `pom.xml` file that lists Spark as a
247251
</project>
248252
{% endhighlight %}
249253

250-
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
251-
252-
{% highlight xml %}
253-
<dependency>
254-
<groupId>org.apache.hadoop</groupId>
255-
<artifactId>hadoop-client</artifactId>
256-
<version>...</version>
257-
</dependency>
258-
{% endhighlight %}
259-
260254
We lay out these files according to the canonical Maven directory structure:
261255
{% highlight bash %}
262256
$ find .
@@ -267,16 +261,25 @@ $ find .
267261
./src/main/java/SimpleApp.java
268262
{% endhighlight %}
269263

270-
Now, we can execute the application using Maven:
264+
Now, we can package the application using Maven and execute it with `./bin/spark-submit`.
271265

272266
{% highlight bash %}
267+
# Package a jar containing your application
273268
$ mvn package
274-
$ mvn exec:java -Dexec.mainClass="SimpleApp"
269+
...
270+
[INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar
271+
272+
# Use spark-submit to run your application
273+
$ YOUR_SPARK_HOME/bin/spark-submit target/simple-project-1.0.jar \
274+
--class "SimpleApp" \
275+
--master local[4]
275276
...
276277
Lines with a: 46, Lines with b: 23
277278
{% endhighlight %}
278279

279-
# A Standalone App in Python
280+
</div>
281+
<div data-lang="python" markdown="1">
282+
280283
Now we will show how to write a standalone application using the Python API (PySpark).
281284

282285
As an example, we'll create a simple Spark application, `SimpleApp.py`:
@@ -285,7 +288,7 @@ As an example, we'll create a simple Spark application, `SimpleApp.py`:
285288
"""SimpleApp.py"""
286289
from pyspark import SparkContext
287290

288-
logFile = "$YOUR_SPARK_HOME/README.md" # Should be some file on your system
291+
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
289292
sc = SparkContext("local", "Simple App")
290293
logData = sc.textFile(logFile).cache()
291294

@@ -296,11 +299,15 @@ print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
296299
{% endhighlight %}
297300

298301

299-
This program just counts the number of lines containing 'a' and the number containing 'b' in a text file.
300-
Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed.
302+
This program just counts the number of lines containing 'a' and the number containing 'b' in a
303+
text file.
304+
Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
301305
As with the Scala and Java examples, we use a SparkContext to create RDDs.
302-
We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.
303-
For applications that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide.html).
306+
We can pass Python functions to Spark, which are automatically serialized along with any variables
307+
that they reference.
308+
For applications that use custom classes or third-party libraries, we can add those code
309+
dependencies to SparkContext to ensure that they will be available on remote machines; this is
310+
described in more detail in the [Python programming guide](python-programming-guide.html).
304311
`SimpleApp` is simple enough that we do not need to specify any code dependencies.
305312

306313
We can run this application using the `bin/pyspark` script:
@@ -312,57 +319,12 @@ $ ./bin/pyspark SimpleApp.py
312319
Lines with a: 46, Lines with b: 23
313320
{% endhighlight python %}
314321

315-
# Running on a Cluster
316-
317-
There are a few additional considerations when running applicaitons on a
318-
[Spark](spark-standalone.html), [YARN](running-on-yarn.html), or
319-
[Mesos](running-on-mesos.html) cluster.
320-
321-
### Including Your Dependencies
322-
If your code depends on other projects, you will need to ensure they are also
323-
present on the slave nodes. A popular approach is to create an
324-
assembly jar (or "uber" jar) containing your code and its dependencies. Both
325-
[sbt](https://github.com/sbt/sbt-assembly) and
326-
[Maven](http://maven.apache.org/plugins/maven-assembly-plugin/)
327-
have assembly plugins. When creating assembly jars, list Spark
328-
itself as a `provided` dependency; it need not be bundled since it is
329-
already present on the slaves. Once you have an assembled jar,
330-
add it to the SparkContext as shown here. It is also possible to add
331-
your dependent jars one-by-one using the `addJar` method of `SparkContext`.
332-
333-
For Python, you can use the `pyFiles` argument of SparkContext
334-
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
335-
336-
### Setting Configuration Options
337-
Spark includes several [configuration options](configuration.html#spark-properties)
338-
that influence the behavior of your application.
339-
These should be set by building a [SparkConf](api/core/index.html#org.apache.spark.SparkConf)
340-
object and passing it to the SparkContext constructor.
341-
For example, in Java and Scala, you can do:
342-
343-
{% highlight scala %}
344-
import org.apache.spark.{SparkConf, SparkContext}
345-
val conf = new SparkConf()
346-
.setMaster("local")
347-
.setAppName("My application")
348-
.set("spark.executor.memory", "1g")
349-
val sc = new SparkContext(conf)
350-
{% endhighlight %}
351-
352-
Or in Python:
353-
354-
{% highlight scala %}
355-
from pyspark import SparkConf, SparkContext
356-
conf = SparkConf()
357-
conf.setMaster("local")
358-
conf.setAppName("My application")
359-
conf.set("spark.executor.memory", "1g"))
360-
sc = SparkContext(conf = conf)
361-
{% endhighlight %}
322+
</div>
323+
</div>
362324

363-
### Accessing Hadoop Filesystems
325+
# Where to go from here
326+
Congratulations on running your first Spark application!
364327

365-
The examples here access a local file. To read data from a distributed
366-
filesystem, such as HDFS, include
367-
[Hadoop version information](index.html#a-note-about-hadoop-versions)
368-
in your build file. By default, Spark builds against HDFS 1.0.4.
328+
* For an in-depth overview of the API see "Programming Guides" menu section.
329+
* For running applications on a cluster head to the [deployment overview](cluster-overview.html).
330+
* For configuration options available to Spark applications see the [configuration page](configuration.html).

0 commit comments

Comments
 (0)