You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
long numAs = logData.filter(new Function<String, Boolean>() {
@@ -219,9 +216,16 @@ public class SimpleApp {
219
216
}
220
217
{% endhighlight %}
221
218
222
-
This program just counts the number of lines containing 'a' and the number containing 'b' in a text file. Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed. As with the Scala example, we initialize a SparkContext, though we use the special `JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by `JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes that extend `spark.api.java.function.Function`. The [Java programming guide](java-programming-guide.html) describes these differences in more detail.
219
+
This program just counts the number of lines containing 'a' and the number containing 'b' in a text
220
+
file. Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
221
+
As with the Scala example, we initialize a SparkContext, though we use the special
222
+
`JavaSparkContext` class to get a Java-friendly one. We also create RDDs (represented by
223
+
`JavaRDD`) and run transformations on them. Finally, we pass functions to Spark by creating classes
224
+
that extend `spark.api.java.function.Function`. The
225
+
[Java programming guide](java-programming-guide.html) describes these differences in more detail.
223
226
224
-
To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency. Note that Spark artifacts are tagged with a Scala version.
227
+
To build the program, we also write a Maven `pom.xml` file that lists Spark as a dependency.
228
+
Note that Spark artifacts are tagged with a Scala version.
225
229
226
230
{% highlight xml %}
227
231
<project>
@@ -247,16 +251,6 @@ To build the program, we also write a Maven `pom.xml` file that lists Spark as a
247
251
</project>
248
252
{% endhighlight %}
249
253
250
-
If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
251
-
252
-
{% highlight xml %}
253
-
<dependency>
254
-
<groupId>org.apache.hadoop</groupId>
255
-
<artifactId>hadoop-client</artifactId>
256
-
<version>...</version>
257
-
</dependency>
258
-
{% endhighlight %}
259
-
260
254
We lay out these files according to the canonical Maven directory structure:
261
255
{% highlight bash %}
262
256
$ find .
@@ -267,16 +261,25 @@ $ find .
267
261
./src/main/java/SimpleApp.java
268
262
{% endhighlight %}
269
263
270
-
Now, we can execute the application using Maven:
264
+
Now, we can package the application using Maven and execute it with `./bin/spark-submit`.
271
265
272
266
{% highlight bash %}
267
+
# Package a jar containing your application
273
268
$ mvn package
274
-
$ mvn exec:java -Dexec.mainClass="SimpleApp"
269
+
...
270
+
[INFO] Building jar: {..}/{..}/target/simple-project-1.0.jar
Now we will show how to write a standalone application using the Python API (PySpark).
281
284
282
285
As an example, we'll create a simple Spark application, `SimpleApp.py`:
@@ -285,7 +288,7 @@ As an example, we'll create a simple Spark application, `SimpleApp.py`:
285
288
"""SimpleApp.py"""
286
289
from pyspark import SparkContext
287
290
288
-
logFile = "$YOUR_SPARK_HOME/README.md" # Should be some file on your system
291
+
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
289
292
sc = SparkContext("local", "Simple App")
290
293
logData = sc.textFile(logFile).cache()
291
294
@@ -296,11 +299,15 @@ print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
296
299
{% endhighlight %}
297
300
298
301
299
-
This program just counts the number of lines containing 'a' and the number containing 'b' in a text file.
300
-
Note that you'll need to replace $YOUR_SPARK_HOME with the location where Spark is installed.
302
+
This program just counts the number of lines containing 'a' and the number containing 'b' in a
303
+
text file.
304
+
Note that you'll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
301
305
As with the Scala and Java examples, we use a SparkContext to create RDDs.
302
-
We can pass Python functions to Spark, which are automatically serialized along with any variables that they reference.
303
-
For applications that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide.html).
306
+
We can pass Python functions to Spark, which are automatically serialized along with any variables
307
+
that they reference.
308
+
For applications that use custom classes or third-party libraries, we can add those code
309
+
dependencies to SparkContext to ensure that they will be available on remote machines; this is
310
+
described in more detail in the [Python programming guide](python-programming-guide.html).
304
311
`SimpleApp` is simple enough that we do not need to specify any code dependencies.
305
312
306
313
We can run this application using the `bin/pyspark` script:
0 commit comments