Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions R/pkg/R/sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,10 @@ sparkR.session <- function(
enableHiveSupport = TRUE,
...) {

if (length(sparkConfig[["spark.sql.warehouse.dir"]]) == 0) {
sparkConfig[["spark.sql.warehouse.dir"]] <- tempdir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the JIRA, I think this change is not enough. If the user has a hive-site.xml and the warehouse dir set inside that, this change will override that [1]. We might need to change the Scala code and/or add a new option to specify that a new default for SparkR.

[1]

val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this property spark.sql.warehouse.dir could also be set in spark-defaults.conf which isn't known by this method at this point. By setting it here it would override any other possible values from spark config or hive site.

Copy link
Contributor Author

@bdwyer2 bdwyer2 Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an argument named sparkWorkingDirectory default that to tempdir()?

Copy link
Member

@felixcheung felixcheung Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shivaram is talking about a spark property, not a parameter (if perhaps your camel casing is an indication)
Basically, we don't want to change spark.sql.warehouse.dir here because it could be set already in an earlier point (just not accessible here)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the pb of circular dependency for getting the hive warehouse dir.

I think currently in sparkR, to create SparkSession, we need all those parameter and once spark session is there we can call

sparkContext <- SparkR:::callJMethod(sc, "sparkContext") 
haddopConf <- SparkR:::callJMethod(sparkContext, "hadoopConfiguration")
hivewarehouseDir <- SparkR:::callJMethod(hadoopConf, "get", "hive.metastore.warehouse.dir"") 

but the above logic requires that we have sparksession available but there is no spark session yet at this line of code. So there must a way to get the hive.conf directory from env.
we have to make decision that is it ok to get the env from system or not? something like the following

require(XML)
data <- xmlParse(file.path(Sys.env("HADOOP_HOME", "conf", "hive-site.xml")
xml_data <- xmlToList(data)

it will make sparkR dependent on xml parser . just some thoughts :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I'm confused. By "spark property" do you mean something passed to sparkR.session() via the sparkConfig argument?

}

sparkConfigMap <- convertNamedListToEnv(sparkConfig)
namedParams <- list(...)
if (length(namedParams) > 0) {
Expand Down
14 changes: 14 additions & 0 deletions R/pkg/inst/tests/testthat/test_sparkR.R
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,17 @@ test_that("sparkCheckInstall", {
deployMode <- "client"
expect_error(sparkCheckInstall(sparkHome, master, deployMode))
})

test_that("sparkR.session", {
# nothing should be written outside tempdir() without explicit user permission
inital_working_directory_files <- list.files()
sparkR.session(enableHiveSupport = FALSE)
df <- data.frame("col1" = c(1, 2, 3, 4, 5, 6),
"col2" = c(1, 0, 0, 1, 1, 0),
"col3" = c(1, 0, 0, 2, 6, 2))
df <- as.DataFrame(df)
createOrReplaceTempView(df, "table")
result <- sql("SELECT * FROM `table`")
sparkR.session.stop()
expect_equal(inital_working_directory_files, list.files())
})