Skip to content

Commit 43e6619

Browse files
holdenkshivaram
authored andcommitted
[SPARK-8506] Add pakages to R context created through init.
Author: Holden Karau <[email protected]> Closes apache#6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits: b60dd63 [Holden Karau] Add an example with the spark-csv package fa8bc92 [Holden Karau] typo: sparm -> spark 865a90c [Holden Karau] strip spaces for comparision c7a4471 [Holden Karau] Add some documentation c1a9233 [Holden Karau] refactor for testing c818556 [Holden Karau] Add pakages to R
1 parent 1173483 commit 43e6619

File tree

4 files changed

+69
-13
lines changed

4 files changed

+69
-13
lines changed

R/pkg/R/client.R

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,24 +34,36 @@ connectBackend <- function(hostname, port, timeout = 6000) {
3434
con
3535
}
3636

37-
launchBackend <- function(args, sparkHome, jars, sparkSubmitOpts) {
37+
determineSparkSubmitBin <- function() {
3838
if (.Platform$OS.type == "unix") {
3939
sparkSubmitBinName = "spark-submit"
4040
} else {
4141
sparkSubmitBinName = "spark-submit.cmd"
4242
}
43+
sparkSubmitBinName
44+
}
45+
46+
generateSparkSubmitArgs <- function(args, sparkHome, jars, sparkSubmitOpts, packages) {
47+
if (jars != "") {
48+
jars <- paste("--jars", jars)
49+
}
50+
51+
if (packages != "") {
52+
packages <- paste("--packages", packages)
53+
}
4354

55+
combinedArgs <- paste(jars, packages, sparkSubmitOpts, args, sep = " ")
56+
combinedArgs
57+
}
58+
59+
launchBackend <- function(args, sparkHome, jars, sparkSubmitOpts, packages) {
60+
sparkSubmitBin <- determineSparkSubmitBin()
4461
if (sparkHome != "") {
4562
sparkSubmitBin <- file.path(sparkHome, "bin", sparkSubmitBinName)
4663
} else {
4764
sparkSubmitBin <- sparkSubmitBinName
4865
}
49-
50-
if (jars != "") {
51-
jars <- paste("--jars", jars)
52-
}
53-
54-
combinedArgs <- paste(jars, sparkSubmitOpts, args, sep = " ")
66+
combinedArgs <- generateSparkSubmitArgs(args, sparkHome, jars, sparkSubmitOpts, packages)
5567
cat("Launching java with spark-submit command", sparkSubmitBin, combinedArgs, "\n")
5668
invisible(system2(sparkSubmitBin, combinedArgs, wait = F))
5769
}

R/pkg/R/sparkR.R

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ sparkR.stop <- function() {
8181
#' @param sparkExecutorEnv Named list of environment variables to be used when launching executors.
8282
#' @param sparkJars Character string vector of jar files to pass to the worker nodes.
8383
#' @param sparkRLibDir The path where R is installed on the worker nodes.
84+
#' @param sparkPackages Character string vector of packages from spark-packages.org
8485
#' @export
8586
#' @examples
8687
#'\dontrun{
@@ -100,7 +101,8 @@ sparkR.init <- function(
100101
sparkEnvir = list(),
101102
sparkExecutorEnv = list(),
102103
sparkJars = "",
103-
sparkRLibDir = "") {
104+
sparkRLibDir = "",
105+
sparkPackages = "") {
104106

105107
if (exists(".sparkRjsc", envir = .sparkREnv)) {
106108
cat("Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context\n")
@@ -129,7 +131,8 @@ sparkR.init <- function(
129131
args = path,
130132
sparkHome = sparkHome,
131133
jars = jars,
132-
sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell"))
134+
sparkSubmitOpts = Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell"),
135+
sparkPackages = sparkPackages)
133136
# wait atmost 100 seconds for JVM to launch
134137
wait <- 0.1
135138
for (i in 1:25) {

R/pkg/inst/tests/test_client.R

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
context("functions in client.R")
19+
20+
test_that("adding spark-testing-base as a package works", {
21+
args <- generateSparkSubmitArgs("", "", "", "",
22+
"holdenk:spark-testing-base:1.3.0_0.0.5")
23+
expect_equal(gsub("[[:space:]]", "", args),
24+
gsub("[[:space:]]", "",
25+
"--packages holdenk:spark-testing-base:1.3.0_0.0.5"))
26+
})
27+
28+
test_that("no package specified doesn't add packages flag", {
29+
args <- generateSparkSubmitArgs("", "", "", "", "")
30+
expect_equal(gsub("[[:space:]]", "", args),
31+
"")
32+
})

docs/sparkr.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ All of the examples on this page use sample data included in R or the Spark dist
2727
<div data-lang="r" markdown="1">
2828
The entry point into SparkR is the `SparkContext` which connects your R program to a Spark cluster.
2929
You can create a `SparkContext` using `sparkR.init` and pass in options such as the application name
30-
etc. Further, to work with DataFrames we will need a `SQLContext`, which can be created from the
31-
SparkContext. If you are working from the SparkR shell, the `SQLContext` and `SparkContext` should
32-
already be created for you.
30+
, any spark packages depended on, etc. Further, to work with DataFrames we will need a `SQLContext`,
31+
which can be created from the SparkContext. If you are working from the SparkR shell, the
32+
`SQLContext` and `SparkContext` should already be created for you.
3333

3434
{% highlight r %}
3535
sc <- sparkR.init()
@@ -62,7 +62,16 @@ head(df)
6262

6363
SparkR supports operating on a variety of data sources through the `DataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
6464

65-
The general method for creating DataFrames from data sources is `read.df`. This method takes in the `SQLContext`, the path for the file to load and the type of data source. SparkR supports reading JSON and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [CSV](http://spark-packages.org/package/databricks/spark-csv) and [Avro](http://spark-packages.org/package/databricks/spark-avro).
65+
The general method for creating DataFrames from data sources is `read.df`. This method takes in the `SQLContext`, the path for the file to load and the type of data source. SparkR supports reading JSON and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [CSV](http://spark-packages.org/package/databricks/spark-csv) and [Avro](http://spark-packages.org/package/databricks/spark-avro). These packages can either be added by
66+
specifying `--packages` with `spark-submit` or `sparkR` commands, or if creating context through `init`
67+
you can specify the packages with the `packages` argument.
68+
69+
<div data-lang="r" markdown="1">
70+
{% highlight r %}
71+
sc <- sparkR.init(packages="com.databricks:spark-csv_2.11:1.0.3")
72+
sqlContext <- sparkRSQL.init(sc)
73+
{% endhighlight %}
74+
</div>
6675

6776
We can see how to use data sources using an example JSON input file. Note that the file that is used here is _not_ a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
6877

0 commit comments

Comments
 (0)