-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12224][SPARKR] R support for JDBC source #10480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48335 has finished for PR 10480 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained above, this is needed to properly handle java.util.Properties. This is useful for 2 parts:
- R code could set or get values from java.util.Properties directly
- For callJMethod to match parameter types properly
As above, we could have a Scala helper that takes in all the parameters - in such case it would be better to have all the logic in Scala and perhaps it would be easier to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my preference is to do more in R. if you feel strongly about having a helper in Scala instead of handling Properties then we could move most of the code into a Scala helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it, java.util.Properties implements Map interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it still feels awkward to just do this specially for the Properties object.
@felixcheung Do you have an idea what part of the code would move to scala if we want to do it on the scala side ? Typically we do deal with such conversions on the scala side, so thats the main reason I'm asking. Is it just the varargsToJProperties function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram as you see we are calling 3 different overloads of read().jdbc() in Scala, 4 if counting write().jdbc(). I think there would be 4 approaches to handle read().jdbc():
- Have 3 JVM helper functions
- Have 1 helper function and on JVM side figure out which overload to route to
- Have 1 helper function and include parameter processing (eg. check numPartitions/defaultParallelism etc), and overload checks all within JVM - and leave R to be a thin shim
- serialize Properties as jobj and work on it on R side
I feel # 4 gives us the least overhead (less code) and more flexibility (since logic like default values for numPartition exists only on R/Python and not on Scala side).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 2 is also acceptable to me besides 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't think that special-casing the Properties object here is a major problem -- java.util.Properties is a very commonly used class, and it would make sense for the RPC layer of SparkR to handle Properties alongside other common types like Map and String. But it makes sense to defer to Shivaram on this point. I would vote for option (2) above.
Note that, as far as I can see, the code here to pass a Properties object back to R is only triggered by the test cases in this PR. The actual code for invoking read.jdbc() only writes to Properties objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @frreiss. I agree with the support for serde of Properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felixcheung for summarizing the options. I was trying to judge how frequently we use java.util.Properties in the Spark DataFrame codebase and it looks like JDBC support is the only use case that is using this. That said if having support in SerDe makes the integration much easier I think we can go along this route. As @frreiss said, java.util.Properties is a pretty common data structure, so this could be useful in the future.
Overall I think the current option is fine by me
|
For test JDBC, we can add a helper function in Scala side, which reuses code in JDBCSuite to start a in-memory JDBC server? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State that parameter predicates is mutually exclusive from partitionColumn/lowerBound/upperBound/numPartitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in line 564 above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
rxin davies shivaram Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559 - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed) Author: felixcheung <[email protected]> Closes #10584 from felixcheung/rremovedeprecated.
|
@sun-rui Are there any more comments on this PR ? |
de635b1 to
991a9b7
Compare
|
rebased and updated. thanks |
|
Test build #49082 has finished for PR 10480 at commit
|
seems to be SPARK-4628 |
|
Test build #49087 has finished for PR 10480 at commit
|
991a9b7 to
8c64ac7
Compare
|
Test build #49514 has finished for PR 10480 at commit
|
|
Test build #49523 has finished for PR 10480 at commit
|
|
@shivaram this is ready, thanks! |
5471b14 to
fccc761
Compare
|
jenkins, retest this please |
|
Test build #49739 has finished for PR 10480 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "\cr" intended for Roxygen format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it forces a new line in the generated doc, otherwise roxygen2 removes new line
|
LGTM |
|
@shivaram any suggestion on how to proceed? |
|
Sorry for the delay @felixcheung -- I'll get back on this today |
|
@felixcheung Could you bring this PR up to date ? I think the code changes look fine to me and we can merge after this goes through Jenkins. |
…ns, add generic, fix bugs
|
Test build #56244 has finished for PR 10480 at commit
|
|
hmm |
|
Jenkins, retest this please |
|
Lets give it one more shot I guess. |
|
Test build #56265 has finished for PR 10480 at commit
|
|
Merging this to master |
Add R API for
read.jdbc,write.jdbc.Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala
JDBCSuitedepends on Java H2 in-memory database.Refactored some code into util so they could be tested.
Core's R SerDe code needs to be updated to allow access to java.util.Properties as
jobjhandle which is required by DataFrameReader/Writer'sjdbcmethod. It would be possible, though more code to add asql/r/SQLUtilshelper function.Tested: