-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18817][SparkR] set default spark-warehouse path to tempdir() #16247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I may missed something but isn't tmpdir per R's session? Maybe I should have tried it first but I guess it wouldn't be accessible after the R session is ended and another one is started. Is this behaviour expected? |
|
Jenkins, ok to test |
|
@HyukjinKwon yes but we are restricted by CRAN policies
|
|
@HyukjinKwon That is a good point. As @bdwyer2 says the question is one of defaults. If the user does specify a more permanent location we will use that. I wonder if we should print a warning or info message as well ? |
|
I see. Thank you both for your comments. (BTW, this should have a JIRA I think.) |
|
Should I open a JIRA under SPARK-15799 myself or leave that to one of the admins? |
|
Test build #69974 has finished for PR 16247 at commit
|
21bf50b to
79caf1c
Compare
|
@bdwyer2 Feel free to open a JIRA as a sub-task under SPARK-15799 |
| ...) { | ||
|
|
||
| if (length(sparkConfig[["spark.sql.warehouse.dir"]]) == 0) { | ||
| sparkConfig[["spark.sql.warehouse.dir"]] <- tempdir() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in the JIRA, I think this change is not enough. If the user has a hive-site.xml and the warehouse dir set inside that, this change will override that [1]. We might need to change the Scala code and/or add a new option to specify that a new default for SparkR.
[1]
| val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus this property spark.sql.warehouse.dir could also be set in spark-defaults.conf which isn't known by this method at this point. By setting it here it would override any other possible values from spark config or hive site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about an argument named sparkWorkingDirectory default that to tempdir()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Shivaram is talking about a spark property, not a parameter (if perhaps your camel casing is an indication)
Basically, we don't want to change spark.sql.warehouse.dir here because it could be set already in an earlier point (just not accessible here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the pb of circular dependency for getting the hive warehouse dir.
I think currently in sparkR, to create SparkSession, we need all those parameter and once spark session is there we can call
sparkContext <- SparkR:::callJMethod(sc, "sparkContext")
haddopConf <- SparkR:::callJMethod(sparkContext, "hadoopConfiguration")
hivewarehouseDir <- SparkR:::callJMethod(hadoopConf, "get", "hive.metastore.warehouse.dir"") but the above logic requires that we have sparksession available but there is no spark session yet at this line of code. So there must a way to get the hive.conf directory from env.
we have to make decision that is it ok to get the env from system or not? something like the following
require(XML)
data <- xmlParse(file.path(Sys.env("HADOOP_HOME", "conf", "hive-site.xml")
xml_data <- xmlToList(data)it will make sparkR dependent on xml parser . just some thoughts :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung I'm confused. By "spark property" do you mean something passed to sparkR.session() via the sparkConfig argument?
|
@bdwyer2 One more thing: Is there a good way to test this ? |
|
Test build #69975 has finished for PR 16247 at commit
|
|
I think we should only change For that purpose, let's check for R Conceptually, we might want to pass the R |
|
Or we could introduce a new property say spark.sql.default.warehouse and
set that to tmpdir()
…On Dec 10, 2016 16:53, "Felix Cheung" ***@***.***> wrote:
I think we should only change spark.sql.warehouse.dir when we are loading
SparkR as a package. This should minimize changes in the case where we are
running in cluster mode and so on.
For that purpose, let's check for R interactive().
Conceptually, we might want to pass the R tempdir() along as a property
and then check again after config and hive config is applied, if it is
still not set, take that property and set it to spark.sql.warehouse.dir.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16247 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAIyFeXO1NnrZ1-ydaoiiHUVG83gFEcSks5rG0lsgaJpZM4LJwof>
.
|
|
... that's actually what I meant with "we might want to pass the R tempdir() along as a property" <-- this would be a new property and not |
|
@shivaram I can create a test to verify the output of |
|
@bdwyer2 The test case idea sounds good ! Regarding the conf naming for warehouse dir lets also check with contributors who are more familiar with SQL. |
|
Test build #70032 has finished for PR 16247 at commit
|
| test_that("sparkR.session", { | ||
| # nothing should be written outside the tempdir() without explicit user premission | ||
| inital_working_directory_files <- list.files() | ||
| sparkR.session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to explicitly run some SQL query here and then call sparkR.stop() and then check the files
|
Test build #70035 has finished for PR 16247 at commit
|
|
I don't see how my last commit could of caused this |
|
jenkins, retest this please |
|
Test build #70040 has finished for PR 16247 at commit
|
|
Test build #70047 has finished for PR 16247 at commit
|
|
a new property for default warehouse LGTM |
|
re: test failure. it might be related to this change? the call stack is hidden, it should be trying to call into https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala and failed. Is it possible that tempdir() is not writeable on Jenkins? |
|
This test failure seems related with this PR. It seems because the previous Hive-enabled spark session is not closed properly between There was only single instance of Hive-enabled spark session in |
|
Possibly, SPARK-16027 was just a hack, the root issue remains I think |
|
Yeah disabling Hive for the test is fine. @bdwyer2 Can you add the new config flag as well ? We can do one final pass of review after that |
|
Test build #70093 has finished for PR 16247 at commit
|
|
I mean it as something we set to the SparkContext or SparkSession and not a parameter of sparkR.session().
|
|
But yes other Spark config properties would be set by the user in sparkConfig parameter of sparkR.session method. We would just add to that like without adding another parameter to sparkR.session() |
|
How would we access that value on the scala side? Would I'm currently unable to compile Spark which makes experimenting with scala difficult. |
|
yes, something like what's being used here https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49 |
|
@bdwyer2 Let us know if you have problems setting up the environment - If so me or @felixcheung can open a new PR that includes your changes (we can still assign the JIRA as your contribution) The reason I ask is that it will be good to get this in before the RC3 cut as this helps the SparkR CRAN release etc. |
|
@shivaram @felixcheung I'll close this PR so that one of you can take over in order to have it done in time for the RC. |
What changes were proposed in this pull request?
Set the default location of
spark.sql.warehouse.dirto be compliant with the CRAN policy (https://cran.r-project.org/web/packages/policies.html) regarding writing files outside of the tmp directory. Previously a folder namedspark-warehousewas created in the working directory whensparkR.session()was called.See SPARK-15799 for discussion.
cc @shivaram
How was this patch tested?
Added new test and manually verified nothing was created in my working directory after running the following code: