diff --git a/core/pom.xml b/core/pom.xml index bd6767e03bb9d..5f40224461227 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -35,7 +35,7 @@ org.apache.hadoop hadoop-client - + net.java.dev.jets3t jets3t diff --git a/docs/openstack-integration.md b/docs/openstack-integration.md new file mode 100644 index 0000000000000..ac5b5a34a141c --- /dev/null +++ b/docs/openstack-integration.md @@ -0,0 +1,269 @@ +--- +layout: global +title: OpenStack Integration +--- + +* This will become a table of contents (this text will be scraped). +{:toc} + + +# Accessing OpenStack Swift from Spark + +Spark's file interface allows it to process data in OpenStack Swift using the same URI +formats that are supported for Hadoop. You can specify a path in Swift as input through a +URI of the form swift://. You will also need to set your +Swift security credentials, through core-sites.xml or via +SparkContext.hadoopConfiguration. +Openstack Swift driver was merged in Hadoop version 2.3.0 +([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). +Users that wish to use previous Hadoop versions will need to configure Swift driver manually. +Current Swift driver requires Swift to use Keystone authentication method. There are recent efforts +to support temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). + +# Configuring Swift +Proxy server of Swift should include list_endpoints middleware. More information +available +[here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) + +# Dependencies + +Spark should be compiled with hadoop-openstack-2.3.0.jar that is distributted with +Hadoop 2.3.0. For the Maven builds, the dependencyManagement section of Spark's main +pom.xml should include: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + 2.3.0 + + ... + +{% endhighlight %} + +In addition, both core and yarn projects should add +hadoop-openstack to the dependencies section of their +pom.xml: +{% highlight xml %} + + ... + + org.apache.hadoop + hadoop-openstack + + ... + +{% endhighlight %} + +# Configuration Parameters + +Create core-sites.xml and place it inside /spark/conf directory. +There are two main categories of parameters that should to be configured: declaration of the +Swift driver and the parameters that are required by Keystone. + +Configuration of Hadoop to use Swift File system achieved via + + + + + + + +
Property NameValue
fs.swift.implorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem
+ +Additional parameters required by Keystone and should be provided to the Swift driver. Those +parameters will be used to perform authentication in Keystone to access Swift. The following table +contains a list of Keystone mandatory parameters. PROVIDER can be any name. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.urlKeystone Authentication URLMandatory
fs.swift.service.PROVIDER.auth.endpoint.prefixKeystone endpoints prefixOptional
fs.swift.service.PROVIDER.tenantTenantMandatory
fs.swift.service.PROVIDER.usernameUsernameMandatory
fs.swift.service.PROVIDER.passwordPasswordMandatory
fs.swift.service.PROVIDER.http.portHTTP portMandatory
fs.swift.service.PROVIDER.regionKeystone regionMandatory
fs.swift.service.PROVIDER.publicIndicates if all URLs are publicMandatory
+ +For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing +defined for tenant tenant. Than core-sites.xml should include: + +{% highlight xml %} + + + fs.swift.impl + org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem + + + fs.swift.service.SparkTest.auth.url + http://127.0.0.1:5000/v2.0/tokens + + + fs.swift.service.SparkTest.auth.endpoint.prefix + endpoints + + fs.swift.service.SparkTest.http.port + 8080 + + + fs.swift.service.SparkTest.region + RegionOne + + + fs.swift.service.SparkTest.public + true + + + fs.swift.service.SparkTest.tenant + test + + + fs.swift.service.SparkTest.username + tester + + + fs.swift.service.SparkTest.password + testing + + +{% endhighlight %} + +Notice that +fs.swift.service.PROVIDER.tenant, +fs.swift.service.PROVIDER.username, +fs.swift.service.PROVIDER.password contains sensitive information and keeping them in +core-sites.xml is not always a good approach. +We suggest to keep those parameters in core-sites.xml for testing purposes when running Spark +via spark-shell. +For job submissions they should be provided via sparkContext.hadoopConfiguration. + +# Usage examples + +Assume Keystone's authentication URL is http://127.0.0.1:5000/v2.0/tokens and Keystone contains tenant test, user tester with password testing. In our example we define PROVIDER=SparkTest. Assume that Swift contains container logs with an object data.log. To access data.log from Spark the swift:// scheme should be used. + + +## Running Spark via spark-shell + +Make sure that core-sites.xml contains fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Run Spark via spark-shell and access Swift via swift:// scheme. + +{% highlight scala %} +val sfdata = sc.textFile("swift://logs.SparkTest/data.log") +sfdata.count() +{% endhighlight %} + + +## Sample Application + +In this case core-sites.xml need not contain fs.swift.service.SparkTest.tenant, fs.swift.service.SparkTest.username, +fs.swift.service.SparkTest.password. Example of Java usage: + +{% highlight java %} +/* SimpleApp.java */ +import org.apache.spark.api.java.*; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.function.Function; + +public class SimpleApp { + public static void main(String[] args) { + String logFile = "swift://logs.SparkTest/data.log"; + SparkConf conf = new SparkConf().setAppName("Simple Application"); + JavaSparkContext sc = new JavaSparkContext(conf); + sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); + sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); + + JavaRDD logData = sc.textFile(logFile).cache(); + long num = logData.count(); + + System.out.println("Total number of lines: " + num); + } +} +{% endhighlight %} + +The directory structure is +{% highlight bash %} +./src +./src/main +./src/main/java +./src/main/java/SimpleApp.java +{% endhighlight %} + +Maven pom.xml should contain: +{% highlight xml %} + + edu.berkeley + simple-project + 4.0.0 + Simple Project + jar + 1.0 + + + Akka repository + http://repo.akka.io/releases + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 2.3 + + 1.6 + 1.6 + + + + + + + org.apache.spark + spark-core_2.10 + 1.0.0 + + + +{% endhighlight %} + +Compile and execute +{% highlight bash %} +mvn package +SPARK_HOME/spark-submit --class SimpleApp --master local[4] target/simple-project-1.0.jar +{% endhighlight %}