Skip to content

Conversation

@bdwyer2
Copy link
Contributor

@bdwyer2 bdwyer2 commented Dec 10, 2016

What changes were proposed in this pull request?

Set the default location of spark.sql.warehouse.dir to be compliant with the CRAN policy (https://cran.r-project.org/web/packages/policies.html) regarding writing files outside of the tmp directory. Previously a folder named spark-warehouse was created in the working directory when sparkR.session() was called.

See SPARK-15799 for discussion.
cc @shivaram

How was this patch tested?

Added new test and manually verified nothing was created in my working directory after running the following code:

sparkR.session(master = "local[*]",
               sparkConfig = list(spark.driver.memory = "2g"),
               enableHiveSupport = FALSE)

@HyukjinKwon
Copy link
Member

I may missed something but isn't tmpdir per R's session? Maybe I should have tried it first but I guess it wouldn't be accessible after the R session is ended and another one is started. Is this behaviour expected?

@shivaram
Copy link
Contributor

Jenkins, ok to test

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 10, 2016

@HyukjinKwon yes but we are restricted by CRAN policies

Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory

@shivaram
Copy link
Contributor

@HyukjinKwon That is a good point. As @bdwyer2 says the question is one of defaults. If the user does specify a more permanent location we will use that. I wonder if we should print a warning or info message as well ?

@HyukjinKwon
Copy link
Member

I see. Thank you both for your comments. (BTW, this should have a JIRA I think.)

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 10, 2016

Should I open a JIRA under SPARK-15799 myself or leave that to one of the admins?

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69974 has finished for PR 16247 at commit c855c2c.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bdwyer2 bdwyer2 force-pushed the default_sparkr_spark_warehouse_fix branch from 21bf50b to 79caf1c Compare December 10, 2016 21:59
@shivaram
Copy link
Contributor

@bdwyer2 Feel free to open a JIRA as a sub-task under SPARK-15799

@bdwyer2 bdwyer2 changed the title [MINOR][SparkR] set default spark-warehouse path to tempdir() [SPARK-18817][SparkR] set default spark-warehouse path to tempdir() Dec 10, 2016
...) {

if (length(sparkConfig[["spark.sql.warehouse.dir"]]) == 0) {
sparkConfig[["spark.sql.warehouse.dir"]] <- tempdir()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the JIRA, I think this change is not enough. If the user has a hive-site.xml and the warehouse dir set inside that, this change will override that [1]. We might need to change the Scala code and/or add a new option to specify that a new default for SparkR.

[1]

val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this property spark.sql.warehouse.dir could also be set in spark-defaults.conf which isn't known by this method at this point. By setting it here it would override any other possible values from spark config or hive site.

Copy link
Contributor Author

@bdwyer2 bdwyer2 Dec 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an argument named sparkWorkingDirectory default that to tempdir()?

Copy link
Member

@felixcheung felixcheung Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shivaram is talking about a spark property, not a parameter (if perhaps your camel casing is an indication)
Basically, we don't want to change spark.sql.warehouse.dir here because it could be set already in an earlier point (just not accessible here)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the pb of circular dependency for getting the hive warehouse dir.

I think currently in sparkR, to create SparkSession, we need all those parameter and once spark session is there we can call

sparkContext <- SparkR:::callJMethod(sc, "sparkContext") 
haddopConf <- SparkR:::callJMethod(sparkContext, "hadoopConfiguration")
hivewarehouseDir <- SparkR:::callJMethod(hadoopConf, "get", "hive.metastore.warehouse.dir"") 

but the above logic requires that we have sparksession available but there is no spark session yet at this line of code. So there must a way to get the hive.conf directory from env.
we have to make decision that is it ok to get the env from system or not? something like the following

require(XML)
data <- xmlParse(file.path(Sys.env("HADOOP_HOME", "conf", "hive-site.xml")
xml_data <- xmlToList(data)

it will make sparkR dependent on xml parser . just some thoughts :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I'm confused. By "spark property" do you mean something passed to sparkR.session() via the sparkConfig argument?

@shivaram
Copy link
Contributor

@bdwyer2 One more thing: Is there a good way to test this ?

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69975 has finished for PR 16247 at commit 79caf1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

I think we should only change spark.sql.warehouse.dir when we are loading SparkR as a package. This should minimize changes in the case where we are running in cluster mode and so on.

For that purpose, let's check for R interactive().

Conceptually, we might want to pass the R tempdir() along as a property and then check again after config and hive config is applied, if it is still not set, take that property and set it to spark.sql.warehouse.dir.

@shivaram
Copy link
Contributor

shivaram commented Dec 11, 2016 via email

@felixcheung
Copy link
Member

felixcheung commented Dec 11, 2016

... that's actually what I meant with "we might want to pass the R tempdir() along as a property" <-- this would be a new property and not spark.sql.warehouse.dir

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 12, 2016

@shivaram I can create a test to verify the output of list.files() is the same before and after running sparkR.session().

@shivaram
Copy link
Contributor

@bdwyer2 The test case idea sounds good !

Regarding the conf naming for warehouse dir lets also check with contributors who are more familiar with SQL.
cc @yhuai @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70032 has finished for PR 16247 at commit cee3944.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test_that("sparkR.session", {
# nothing should be written outside the tempdir() without explicit user premission
inital_working_directory_files <- list.files()
sparkR.session()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to explicitly run some SQL query here and then call sparkR.stop() and then check the files

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70035 has finished for PR 16247 at commit 17b7bff.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 12, 2016

I don't see how my last commit could of caused this

functions in sparkR.R: .....
SparkSQL functions: Spark package found in SPARK_HOME: /home/jenkins/workspace/SparkPullRequestBuilder
Error in handleErrors(returnStatus, conn) : 
  java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
	at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
	at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$WithF
Calls: test_package ... sparkR.session -> callJStatic -> invokeJava -> handleErrors

@jodersky
Copy link
Member

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70040 has finished for PR 16247 at commit 17b7bff.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70047 has finished for PR 16247 at commit 84a110e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

a new property for default warehouse LGTM

@felixcheung
Copy link
Member

felixcheung commented Dec 13, 2016

re: test failure. it might be related to this change? the call stack is hidden, it should be trying to call into https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala and failed.

Is it possible that tempdir() is not writeable on Jenkins?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 13, 2016

This test failure seems related with this PR. It seems because the previous Hive-enabled spark session is not closed properly between test_saprkSQL.R and test_sparkR.R here, in particular, I suspect the lock in derby via Hive client (not sure). It seems apparently related with SPARK-16027?

There was only single instance of Hive-enabled spark session in test_saprkSQL.R but the test here introduces another one. I manually tested for each after removing each other and it seems working fine exclusively.

@felixcheung
Copy link
Member

Possibly, SPARK-16027 was just a hack, the root issue remains I think

@shivaram
Copy link
Contributor

Yeah disabling Hive for the test is fine. @bdwyer2 Can you add the new config flag as well ? We can do one final pass of review after that

@SparkQA
Copy link

SparkQA commented Dec 13, 2016

Test build #70093 has finished for PR 16247 at commit 9142397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Dec 14, 2016 via email

@felixcheung
Copy link
Member

But yes other Spark config properties would be set by the user in sparkConfig parameter of sparkR.session method. We would just add to that like

sparkConfig[["spark.sql.warehouse.default.dir"]] <- tempdir()

without adding another parameter to sparkR.session()

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 14, 2016

How would we access that value on the scala side? Would sparkContext.hadoopConfiguration.get("spark.sql.warehouse.default.dir") work?

I'm currently unable to compile Spark which makes experimenting with scala difficult.

@felixcheung
Copy link
Member

@shivaram
Copy link
Contributor

@bdwyer2 Let us know if you have problems setting up the environment - If so me or @felixcheung can open a new PR that includes your changes (we can still assign the JIRA as your contribution)

The reason I ask is that it will be good to get this in before the RC3 cut as this helps the SparkR CRAN release etc.

@bdwyer2
Copy link
Contributor Author

bdwyer2 commented Dec 14, 2016

@shivaram @felixcheung I'll close this PR so that one of you can take over in order to have it done in time for the RC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants