[SPARK-18817][SparkR] set default spark-warehouse path to tempdir() #16247

bdwyer2 · 2016-12-10T21:09:16Z

What changes were proposed in this pull request?

Set the default location of spark.sql.warehouse.dir to be compliant with the CRAN policy (https://cran.r-project.org/web/packages/policies.html) regarding writing files outside of the tmp directory. Previously a folder named spark-warehouse was created in the working directory when sparkR.session() was called.

See SPARK-15799 for discussion.
cc @shivaram

How was this patch tested?

Added new test and manually verified nothing was created in my working directory after running the following code:

sparkR.session(master = "local[*]",
               sparkConfig = list(spark.driver.memory = "2g"),
               enableHiveSupport = FALSE)

HyukjinKwon · 2016-12-10T21:40:34Z

I may missed something but isn't tmpdir per R's session? Maybe I should have tried it first but I guess it wouldn't be accessible after the R session is ended and another one is started. Is this behaviour expected?

shivaram · 2016-12-10T21:47:24Z

Jenkins, ok to test

bdwyer2 · 2016-12-10T21:48:13Z

@HyukjinKwon yes but we are restricted by CRAN policies

Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory

shivaram · 2016-12-10T21:49:11Z

@HyukjinKwon That is a good point. As @bdwyer2 says the question is one of defaults. If the user does specify a more permanent location we will use that. I wonder if we should print a warning or info message as well ?

HyukjinKwon · 2016-12-10T21:51:01Z

I see. Thank you both for your comments. (BTW, this should have a JIRA I think.)

bdwyer2 · 2016-12-10T21:51:55Z

Should I open a JIRA under SPARK-15799 myself or leave that to one of the admins?

SparkQA · 2016-12-10T21:56:08Z

Test build #69974 has finished for PR 16247 at commit c855c2c.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-12-10T21:59:45Z

@bdwyer2 Feel free to open a JIRA as a sub-task under SPARK-15799

shivaram · 2016-12-10T22:18:02Z

R/pkg/R/sparkR.R

  ...) {

+  if (length(sparkConfig[["spark.sql.warehouse.dir"]]) == 0) {
+    sparkConfig[["spark.sql.warehouse.dir"]] <- tempdir()


As I mentioned in the JIRA, I think this change is not enough. If the user has a hive-site.xml and the warehouse dir set inside that, this change will override that [1]. We might need to change the Scala code and/or add a new option to specify that a new default for SparkR.

[1]

spark/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

Line 49 in cf33a86

val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

Plus this property spark.sql.warehouse.dir could also be set in spark-defaults.conf which isn't known by this method at this point. By setting it here it would override any other possible values from spark config or hive site.

How about an argument named sparkWorkingDirectory default that to tempdir()?

I think Shivaram is talking about a spark property, not a parameter (if perhaps your camel casing is an indication)
Basically, we don't want to change spark.sql.warehouse.dir here because it could be set already in an earlier point (just not accessible here)

There is the pb of circular dependency for getting the hive warehouse dir.

I think currently in sparkR, to create SparkSession, we need all those parameter and once spark session is there we can call

sparkContext <- SparkR:::callJMethod(sc, "sparkContext") haddopConf <- SparkR:::callJMethod(sparkContext, "hadoopConfiguration") hivewarehouseDir <- SparkR:::callJMethod(hadoopConf, "get", "hive.metastore.warehouse.dir"")

but the above logic requires that we have sparksession available but there is no spark session yet at this line of code. So there must a way to get the hive.conf directory from env.
we have to make decision that is it ok to get the env from system or not? something like the following

require(XML) data <- xmlParse(file.path(Sys.env("HADOOP_HOME", "conf", "hive-site.xml") xml_data <- xmlToList(data)

it will make sparkR dependent on xml parser . just some thoughts :)

@felixcheung I'm confused. By "spark property" do you mean something passed to sparkR.session() via the sparkConfig argument?

shivaram · 2016-12-10T22:19:06Z

@bdwyer2 One more thing: Is there a good way to test this ?

SparkQA · 2016-12-10T22:40:31Z

Test build #69975 has finished for PR 16247 at commit 79caf1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-11T00:52:40Z

I think we should only change spark.sql.warehouse.dir when we are loading SparkR as a package. This should minimize changes in the case where we are running in cluster mode and so on.

For that purpose, let's check for R interactive().

Conceptually, we might want to pass the R tempdir() along as a property and then check again after config and hive config is applied, if it is still not set, take that property and set it to spark.sql.warehouse.dir.

shivaram · 2016-12-11T00:56:54Z

Or we could introduce a new property say spark.sql.default.warehouse and set that to tmpdir()

…

On Dec 10, 2016 16:53, "Felix Cheung" ***@***.***> wrote: I think we should only change spark.sql.warehouse.dir when we are loading SparkR as a package. This should minimize changes in the case where we are running in cluster mode and so on. For that purpose, let's check for R interactive(). Conceptually, we might want to pass the R tempdir() along as a property and then check again after config and hive config is applied, if it is still not set, take that property and set it to spark.sql.warehouse.dir. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16247 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIyFeXO1NnrZ1-ydaoiiHUVG83gFEcSks5rG0lsgaJpZM4LJwof> .

felixcheung · 2016-12-11T02:50:03Z

... that's actually what I meant with "we might want to pass the R tempdir() along as a property" <-- this would be a new property and not spark.sql.warehouse.dir

bdwyer2 · 2016-12-12T18:44:13Z

@shivaram I can create a test to verify the output of list.files() is the same before and after running sparkR.session().

shivaram · 2016-12-12T18:51:16Z

@bdwyer2 The test case idea sounds good !

Regarding the conf naming for warehouse dir lets also check with contributors who are more familiar with SQL.
cc @yhuai @cloud-fan

SparkQA · 2016-12-12T21:08:00Z

Test build #70032 has finished for PR 16247 at commit cee3944.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-12-12T21:55:58Z

R/pkg/inst/tests/testthat/test_sparkR.R

+test_that("sparkR.session", {
+  # nothing should be written outside the tempdir() without explicit user premission
+  inital_working_directory_files <- list.files()
+  sparkR.session()


we might want to explicitly run some SQL query here and then call sparkR.stop() and then check the files

SparkQA · 2016-12-12T22:59:40Z

Test build #70035 has finished for PR 16247 at commit 17b7bff.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

bdwyer2 · 2016-12-12T23:10:51Z

I don't see how my last commit could of caused this

functions in sparkR.R: .....
SparkSQL functions: Spark package found in SPARK_HOME: /home/jenkins/workspace/SparkPullRequestBuilder
Error in handleErrors(returnStatus, conn) : 
  java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
	at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
	at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$WithF
Calls: test_package ... sparkR.session -> callJStatic -> invokeJava -> handleErrors

jodersky · 2016-12-12T23:18:40Z

jenkins, retest this please

SparkQA · 2016-12-12T23:57:50Z

Test build #70040 has finished for PR 16247 at commit 17b7bff.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-13T01:13:13Z

Test build #70047 has finished for PR 16247 at commit 84a110e.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-12-13T06:19:50Z

a new property for default warehouse LGTM

felixcheung · 2016-12-13T06:52:04Z

re: test failure. it might be related to this change? the call stack is hidden, it should be trying to call into https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala and failed.

Is it possible that tempdir() is not writeable on Jenkins?

HyukjinKwon · 2016-12-13T06:56:11Z

This test failure seems related with this PR. It seems because the previous Hive-enabled spark session is not closed properly between test_saprkSQL.R and test_sparkR.R here, in particular, I suspect the lock in derby via Hive client (not sure). It seems apparently related with SPARK-16027?

There was only single instance of Hive-enabled spark session in test_saprkSQL.R but the test here introduces another one. I manually tested for each after removing each other and it seems working fine exclusively.

felixcheung · 2016-12-13T07:25:29Z

Possibly, SPARK-16027 was just a hack, the root issue remains I think

shivaram · 2016-12-13T19:22:50Z

Yeah disabling Hive for the test is fine. @bdwyer2 Can you add the new config flag as well ? We can do one final pass of review after that

SparkQA · 2016-12-13T20:02:34Z

Test build #70093 has finished for PR 16247 at commit 9142397.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-14T03:03:07Z

I mean it as something we set to the SparkContext or SparkSession and not a parameter of sparkR.session().

felixcheung · 2016-12-14T05:56:53Z

But yes other Spark config properties would be set by the user in sparkConfig parameter of sparkR.session method. We would just add to that like

sparkConfig[["spark.sql.warehouse.default.dir"]] <- tempdir()

without adding another parameter to sparkR.session()

bdwyer2 · 2016-12-14T21:44:34Z

How would we access that value on the scala side? Would sparkContext.hadoopConfiguration.get("spark.sql.warehouse.default.dir") work?

I'm currently unable to compile Spark which makes experimenting with scala difficult.

felixcheung · 2016-12-14T21:57:13Z

yes, something like what's being used here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49

shivaram · 2016-12-14T22:17:36Z

@bdwyer2 Let us know if you have problems setting up the environment - If so me or @felixcheung can open a new PR that includes your changes (we can still assign the JIRA as your contribution)

The reason I ask is that it will be good to get this in before the RC3 cut as this helps the SparkR CRAN release etc.

bdwyer2 · 2016-12-14T23:50:16Z

@shivaram @felixcheung I'll close this PR so that one of you can take over in order to have it done in time for the RC.

set default location of spark.sql.warehouse.dir

c855c2c

fixed R style

79caf1c

bdwyer2 force-pushed the default_sparkr_spark_warehouse_fix branch from 21bf50b to 79caf1c Compare December 10, 2016 21:59

bdwyer2 changed the title ~~[MINOR][SparkR] set default spark-warehouse path to tempdir()~~ [SPARK-18817][SparkR] set default spark-warehouse path to tempdir() Dec 10, 2016

shivaram reviewed Dec 10, 2016

View reviewed changes

added test case

cee3944

shivaram reviewed Dec 12, 2016

View reviewed changes

added SQL query to test

17b7bff

attempt to fix failing unit test

84a110e

remove hive support from unit test

9142397

bdwyer2 closed this Dec 14, 2016

shivaram mentioned this pull request Dec 15, 2016

[SPARK-18817] [SPARKR] [SQL] Set default warehouse dir to tempdir #16290

Closed

[SPARK-18817][SparkR] set default spark-warehouse path to tempdir() #16247

[SPARK-18817][SparkR] set default spark-warehouse path to tempdir() #16247

Uh oh!

Conversation

bdwyer2 commented Dec 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Dec 10, 2016

Uh oh!

shivaram commented Dec 10, 2016

Uh oh!

bdwyer2 commented Dec 10, 2016

Uh oh!

shivaram commented Dec 10, 2016

Uh oh!

HyukjinKwon commented Dec 10, 2016

Uh oh!

bdwyer2 commented Dec 10, 2016

Uh oh!

SparkQA commented Dec 10, 2016

Uh oh!

shivaram commented Dec 10, 2016

Uh oh!

shivaram Dec 10, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung Dec 11, 2016

Choose a reason for hiding this comment

Uh oh!

bdwyer2 Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aloknsingh Dec 13, 2016

Choose a reason for hiding this comment

Uh oh!

bdwyer2 Dec 13, 2016

Choose a reason for hiding this comment

Uh oh!

shivaram commented Dec 10, 2016

Uh oh!

SparkQA commented Dec 10, 2016

Uh oh!

felixcheung commented Dec 11, 2016

Uh oh!

shivaram commented Dec 11, 2016 via email

Uh oh!

felixcheung commented Dec 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdwyer2 commented Dec 12, 2016

Uh oh!

shivaram commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

shivaram Dec 12, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

bdwyer2 commented Dec 12, 2016

Uh oh!

jodersky commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

SparkQA commented Dec 13, 2016

Uh oh!

cloud-fan commented Dec 13, 2016

Uh oh!

felixcheung commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdwyer2 commented Dec 10, 2016 •

edited

Loading

bdwyer2 Dec 12, 2016 •

edited

Loading

felixcheung Dec 13, 2016 •

edited

Loading

felixcheung commented Dec 11, 2016 •

edited

Loading

felixcheung commented Dec 13, 2016 •

edited

Loading

HyukjinKwon commented Dec 13, 2016 •

edited

Loading