You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-1753 / 1773 / 1814] Update outdated docs for spark-submit, YARN, standalone etc.
YARN
- SparkPi was updated to not take in master as an argument; we should update the docs to reflect that.
- The default YARN build guide should be in maven, not sbt.
- This PR also adds a paragraph on steps to debug a YARN application.
Standalone
- Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`.
- The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made.
In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there.
Author: Andrew Or <[email protected]>
Closes#701 from andrewor14/yarn-docs and squashes the following commits:
e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814)
25cfe7b [Andrew Or] Merge in the warning from SPARK-1753a8c39c5 [Andrew Or] Minor changes
336bbd9 [Andrew Or] Tabs -> spaces
4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html
3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions
5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc.
c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
381fe32 [Andrew Or] Update docs for standalone mode
757c184 [Andrew Or] Add a note about the requirements for the debugging trick
f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs
924f04c [Andrew Or] Revert addition of --deploy-mode
d5fe17b [Andrew Or] Update the YARN docs
(cherry picked from commit 2ffd1ea)
Signed-off-by: Patrick Wendell <[email protected]>
Copy file name to clipboardExpand all lines: docs/building-with-maven.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -129,6 +129,13 @@ Java 8 tests are run when -Pjava8-tests profile is enabled, they will run in spi
129
129
For these tests to run your system must have a JDK 8 installation.
130
130
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.
131
131
132
+
## Building for PySpark on YARN ##
133
+
134
+
PySpark on YARN is only supported if the jar is built with maven. Further, there is a known problem
135
+
with building this assembly jar on Red Hat based operating systems (see SPARK-1753). If you wish to
136
+
run PySpark on a YARN cluster with Red Hat installed, we recommend that you build the jar elsewhere,
137
+
then ship it over to the cluster. We are investigating the exact cause for this.
138
+
132
139
## Packaging without Hadoop dependencies for deployment on YARN ##
133
140
134
141
The assembly jar produced by "mvn package" will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with yarn.application.classpath. The "hadoop-provided" profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.
When calling `spark-submit`, `[app options]` will be passed along to your application's
79
-
main class. To enumerate all options available to `spark-submit` run it with
80
-
the `--help` flag. Here are a few examples of common options:
83
+
main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
84
+
master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
85
+
deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
86
+
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
87
+
application-arguments: Space delimited arguments passed to the main method of <main-class>, if any
88
+
89
+
To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
90
+
examples of common options:
81
91
82
92
{% highlight bash %}
83
93
# Run application locally
84
94
./bin/spark-submit \
85
-
--class my.main.ClassName
95
+
--class org.apache.spark.examples.SparkPi
86
96
--master local[8]\
87
-
my-app.jar
97
+
/path/to/examples.jar \
98
+
100
88
99
89
100
# Run on a Spark standalone cluster
90
101
./bin/spark-submit \
91
-
--class my.main.ClassName
92
-
--master spark://mycluster:7077 \
102
+
--class org.apache.spark.examples.SparkPi
103
+
--master spark://207.184.161.138:7077 \
93
104
--executor-memory 20G \
94
105
--total-executor-cores 100 \
95
-
my-app.jar
106
+
/path/to/examples.jar \
107
+
1000
96
108
97
109
# Run on a YARN cluster
98
-
HADOOP_CONF_DIR=XX /bin/spark-submit \
99
-
--class my.main.ClassName
110
+
HADOOP_CONF_DIR=XX ./bin/spark-submit \
111
+
--class org.apache.spark.examples.SparkPi
100
112
--master yarn-cluster \ # can also be `yarn-client` for client mode
101
113
--executor-memory 20G \
102
114
--num-executors 50 \
103
-
my-app.jar
115
+
/path/to/examples.jar \
116
+
1000
104
117
{% endhighlight %}
105
118
106
119
### Loading Configurations from a File
107
120
108
-
The `spark-submit` script can load default `SparkConf` values from a properties file and pass them
109
-
onto your application. By default it will read configuration options from
110
-
`conf/spark-defaults.conf`. Any values specified in the file will be passed on to the
111
-
application when run. They can obviate the need for certain flags to `spark-submit`: for
112
-
instance, if `spark.master` property is set, you can safely omit the
121
+
The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
122
+
properties file and pass them on to your application. By default it will read configuration options
123
+
from `conf/spark-defaults.conf`. For more detail, see the section on
Copy file name to clipboardExpand all lines: docs/index.md
+22-12Lines changed: 22 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,21 +24,31 @@ right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/
24
24
25
25
# Running the Examples and Shell
26
26
27
-
Spark comes with several sample programs. Scala, Java and Python examples are in the `examples/src/main` directory.
28
-
To run one of the Java or Scala sample programs, use `./bin/run-example <class> <params>` in the top-level Spark directory
29
-
(the `bin/run-example` script sets up the appropriate paths and launches that program).
30
-
For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`.
31
-
To run a Python sample program, use `./bin/pyspark <sample-program> <params>`. For example, try `./bin/pyspark ./examples/src/main/python/pi.py local`.
27
+
Spark comes with several sample programs. Scala, Java and Python examples are in the
28
+
`examples/src/main` directory. To run one of the Java or Scala sample programs, use
29
+
`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
30
+
invokes the more general
31
+
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
32
+
launching applications). For example,
32
33
33
-
Each example prints usage help when run with no parameters.
34
+
./bin/run-example SparkPi 10
34
35
35
-
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
36
-
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
37
-
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
38
-
`local` for testing.
36
+
You can also run Spark interactively through modified versions of the Scala shell. This is a
37
+
great way to learn the framework.
39
38
40
-
Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
41
-
Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
39
+
./bin/spark-shell --master local[2]
40
+
41
+
The `--master` option specifies the
42
+
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
43
+
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
44
+
`local` for testing. For a full list of options, run Spark shell with the `--help` option.
45
+
46
+
Spark also provides a Python interface. To run an example Spark application written in Python, use
Copy file name to clipboardExpand all lines: docs/java-programming-guide.md
+1-4Lines changed: 1 addition & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -215,7 +215,4 @@ Spark includes several sample programs using the Java API in
215
215
[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the
216
216
`bin/run-example` script included in Spark; for example:
Copy file name to clipboardExpand all lines: docs/python-programming-guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -164,6 +164,6 @@ some example applications.
164
164
PySpark also includes several sample programs in the [`examples/src/main/python` folder](https://github.com/apache/spark/tree/master/examples/src/main/python).
165
165
You can run them by passing the files to `pyspark`; e.g.:
0 commit comments