Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 99 additions & 53 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: global
title: Spark Overview
title: Getting Started with Apache Spark
---

Apache Spark is a fast and general-purpose cluster computing system.
Expand All @@ -11,87 +11,133 @@ It also supports a rich set of higher-level tools including [Shark](http://shark

Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is to have `java` to installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation.
Spark runs on both Windows and Unix-like systems (e.g., Linux, Mac OS). All you need to run it is to have Java installed on your system `PATH` or point the `JAVA_HOME` environment variable to a Java installation.

Note: Some parts of the [Spark Programming Quick Start Guide](quick-start.html) and all of the [Spark Scala Programming Guide](scala-programming-guide.html) are written through a Scala lens, so Java and Python developers may wish to download and install Scala so they can work hands-on with the Scala examples.

For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}. If you write applications in
Scala, you will need to use a compatible Scala version (*e.g.*, {{site.SCALA_BINARY_VERSION}}.x) -- newer major versions may not work. You can get the appropriate version of Scala from [scala-lang.org](http://www.scala-lang.org/download/).

# Building

Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run
Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
storage systems. Because the HDFS protocol has changed in different versions of
Hadoop, you must build Spark against the same version that your cluster uses.

Spark is bundled with the [Simple Build Tool](http://www.scala-sbt.org) (SBT). To compile the code with SBT so Spark links to Hadoop 1.0.4 (default), from the top-level Spark directory run:

$ sbt/sbt assembly

You can change the Hadoop version that Spark links to by setting the
`SPARK_HADOOP_VERSION` environment variable when compiling. For example:

$ SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

If you wish to run Spark on [YARN](running-on-yarn.html), set
`SPARK_YARN` to `true`. For example:

$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

Note: If you're using the Windows Command Prompt run each command separately:

> set SPARK_HADOOP_VERSION=2.0.5-alpha
> set SPARK_YARN=true
> sbt/sbt assembly

# Running Spark Examples

Spark comes with a number of sample programs. Scala and Java examples are in the `examples` directory, and Python examples are in the `python/examples` directory.

To run one of the Java or Scala sample programs, in the top-level Spark directory:

$ ./bin/run-example <class> <params>

The `bin/run-example` script sets up the appropriate paths and launches the specified program.
For example, try this Scala program:

$ ./bin/run-example org.apache.spark.examples.SparkPi local

sbt/sbt assembly
Or run this Java program:

For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}. If you write applications in Scala, you will need to use a compatible Scala version (e.g. {{site.SCALA_BINARY_VERSION}}.X) -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/).
$ ./bin/run-example org.apache.spark.examples.JavaSparkPi local

# Running the Examples and Shell
To run a Python sample program, in the top-level Spark directory:

Spark comes with several sample programs. Scala and Java examples are in the `examples` directory, and Python examples are in `python/examples`.
To run one of the Java or Scala sample programs, use `./bin/run-example <class> <params>` in the top-level Spark directory
(the `bin/run-example` script sets up the appropriate paths and launches that program).
For example, try `./bin/run-example org.apache.spark.examples.SparkPi local`.
To run a Python sample program, use `./bin/pyspark <sample-program> <params>`. For example, try `./bin/pyspark ./python/examples/pi.py local`.
$ ./bin/pyspark <sample-program> <params>

For example, try:

Each example prints usage help when run with no parameters.
$ ./bin/pyspark ./python/examples/pi.py local

Each example prints usage help when run without parameters:

$ ./bin/run-example org.apache.spark.examples.JavaWordCount
Usage: JavaWordCount <master> <file>

$ ./bin/run-example org.apache.spark.examples.JavaWordCount local README.md

The README.md file is located in the top-level Spark directory.

Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` to run locally with one thread, or `local[N]` to run locally with N threads. We recommend starting by using
`local` for testing.

Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
# Using the Spark Shell

# Launching on a Cluster
You can run Spark interactively through modified versions of the Scala shell or
the Python interpreter. These are great ways to learn the Spark framework.

The Spark [cluster mode overview](cluster-overview.html) explains the key concepts in running on a cluster.
Spark can run both by itself, or over several existing cluster managers. It currently provides several
options for deployment:
The Spark Scala shell is discussed in greater detail in the [Spark Programming Quick Start Guide](quick-start.html) and the [Spark Scala Programming Guide](scala-programming-guide.html). The Spark Python interpreter is discussed in greater detail in the [Spark Python Programming Guide](python-programming-guide.html#interactive-use).

* [Amazon EC2](ec2-scripts.html): our EC2 scripts let you launch a cluster in about 5 minutes
* [Standalone Deploy Mode](spark-standalone.html): simplest way to deploy Spark on a private cluster
* [Apache Mesos](running-on-mesos.html)
* [Hadoop YARN](running-on-yarn.html)
To run Spark's Scala shell, from the top-level Spark directory:

# A Note About Hadoop Versions
$ ./bin/spark-shell
...
scala>

Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
storage systems. Because the HDFS protocol has changed in different versions of
Hadoop, you must build Spark against the same version that your cluster uses.
By default, Spark links to Hadoop 1.0.4. You can change this by setting the
`SPARK_HADOOP_VERSION` variable when compiling:
To run Spark's Python interpreter, from the top-level Spark directory:

SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly
$ ./bin/pyspark
...
>>>

In addition, if you wish to run Spark on [YARN](running-on-yarn.html), set
`SPARK_YARN` to `true`:
# Launching on a Cluster

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
The Spark [cluster mode overview](cluster-overview.html) explains the key concepts of running on a cluster.
Spark can run by itself or over several existing cluster managers. There are currently several
options for deployment:

Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`.
* [Amazon EC2](ec2-scripts.html): our EC2 scripts let you launch a cluster in about 5 minutes
* [Standalone Deploy Mode](spark-standalone.html): the simplest way to deploy Spark on a private cluster
* [Apache Mesos](running-on-mesos.html)
* [Hadoop YARN](running-on-yarn.html)

# Where to Go from Here

**Programming guides:**
**Programming Guides:**

We recommend that Scala, Java and Python developers work through the [Spark Programming Quick Start Guide](quick-start.html) and then work through the [Spark Scala Programming Guide](scala-programming-guide.html).

* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
Even though the [Spark Programming Quick Start Guide](quick-start.html) and the [Spark Scala Programming Guide](scala-programming-guide.html) are written through a Scala lens, Java and Python developers will find that these docs introduce key concepts that are very helpful to understand before diving into the [Spark Java Programming Guide](java-programming-guide.html) or the [Spark Python Programming Guide](python-programming-guide.html).

* [Spark Programming Quick Start Guide](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Scala Programming Guide](scala-programming-guide.html): an overview of Spark concepts though a Scala lens; then go here!
* [Spark Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Spark Python Programming Guide](python-programming-guide.html): using Spark from Python
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model; will soon be superseded by [GraphX](graphx-programming-guide.html)
* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs

**API Docs:**

* [Spark for Java/Scala (Scaladoc)](api/core/index.html)
* [Spark for Python (Epydoc)](api/pyspark/index.html)
* [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
* [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)

[Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
[Spark Java API (Javadoc)](api/java/index.html)
[Spark Python API (Epydoc)](api/python/index.html)

**Deployment guides:**
**Deployment Guides:**

* [Cluster Overview](cluster-overview.html): overview of concepts and components when running on a cluster
* [Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
Expand All @@ -100,17 +146,17 @@ Note that on Windows, you need to set the environment variables on separate line
[Apache Mesos](http://mesos.apache.org)
* [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)

**Other documents:**
**Other Documents:**

* [Configuration](configuration.html): customize Spark via its configuration system
* [Tuning Guide](tuning.html): best practices to optimize performance and memory use
* [Tuning Guide](tuning.html): best practices for optimizing performance and memory use
* [Security](security.html): Spark security support
* [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
* [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications
* [Building Spark with Maven](building-with-maven.html): build Spark using the Maven system
* [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
* [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark): Spark Wiki discussing how to contribute code, contribute documentation, report issues, etc.

**External resources:**
**External Resources:**

* [Spark Homepage](http://spark.apache.org)
* [Shark](http://shark.cs.berkeley.edu): Apache Hive over Spark
Expand All @@ -119,7 +165,7 @@ Note that on Windows, you need to set the environment variables on separate line
exercises about Spark, Shark, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/agenda-2012),
[slides](http://ampcamp.berkeley.edu/agenda-2012) and [exercises](http://ampcamp.berkeley.edu/exercises-2012) are
available online for free.
* [Code Examples](http://spark.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/apache/spark/tree/master/examples/src/main/scala/) of Spark
* [Code Examples](http://spark.apache.org/examples.html): more are also available in the [examples subfolder](https://github.com/apache/spark/tree/master/examples/src/main/scala/) of the Apache Spark project
* [Paper Describing Spark](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf)
* [Paper Describing Spark Streaming](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf)

Expand Down
Loading