[SPARK-12224][SPARKR] R support for JDBC source #10480

felixcheung · 2015-12-26T04:11:31Z

Add R API for read.jdbc, write.jdbc.

Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala JDBCSuite depends on Java H2 in-memory database.

Refactored some code into util so they could be tested.

Core's R SerDe code needs to be updated to allow access to java.util.Properties as jobj handle which is required by DataFrameReader/Writer's jdbc method. It would be possible, though more code to add a sql/r/SQLUtils helper function.

Tested:

# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar

# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)

# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)

# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2

# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1

# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"

write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")

felixcheung · 2015-12-26T04:11:40Z

@shivaram @sun-rui

SparkQA · 2015-12-26T06:05:35Z

Test build #48335 has finished for PR 10480 at commit de635b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-12-28T07:53:18Z

core/src/main/scala/org/apache/spark/api/r/SerDe.scala

This is not needed?

As explained above, this is needed to properly handle java.util.Properties. This is useful for 2 parts:

R code could set or get values from java.util.Properties directly

For callJMethod to match parameter types properly

As above, we could have a Scala helper that takes in all the parameters - in such case it would be better to have all the logic in Scala and perhaps it would be easier to test.

my preference is to do more in R. if you feel strongly about having a helper in Scala instead of handling Properties then we could move most of the code into a Scala helper.

I got it, java.util.Properties implements Map interface.

Yeah it still feels awkward to just do this specially for the Properties object.

@felixcheung Do you have an idea what part of the code would move to scala if we want to do it on the scala side ? Typically we do deal with such conversions on the scala side, so thats the main reason I'm asking. Is it just the varargsToJProperties function ?

@shivaram as you see we are calling 3 different overloads of read().jdbc() in Scala, 4 if counting write().jdbc(). I think there would be 4 approaches to handle read().jdbc():

Have 3 JVM helper functions

Have 1 helper function and on JVM side figure out which overload to route to

Have 1 helper function and include parameter processing (eg. check numPartitions/defaultParallelism etc), and overload checks all within JVM - and leave R to be a thin shim

serialize Properties as jobj and work on it on R side

I feel # 4 gives us the least overhead (less code) and more flexibility (since logic like default values for numPartition exists only on R/Python and not on Scala side).

I think 2 is also acceptable to me besides 4.

Personally I don't think that special-casing the Properties object here is a major problem -- java.util.Properties is a very commonly used class, and it would make sense for the RPC layer of SparkR to handle Properties alongside other common types like Map and String. But it makes sense to defer to Shivaram on this point. I would vote for option (2) above.

Note that, as far as I can see, the code here to pass a Properties object back to R is only triggered by the test cases in this PR. The actual code for invoking read.jdbc() only writes to Properties objects.

thanks @frreiss. I agree with the support for serde of Properties

Thanks @felixcheung for summarizing the options. I was trying to judge how frequently we use java.util.Properties in the Spark DataFrame codebase and it looks like JDBC support is the only use case that is using this. That said if having support in SerDe makes the integration much easier I think we can go along this route. As @frreiss said, java.util.Properties is a pretty common data structure, so this could be useful in the future.

Overall I think the current option is fine by me

sun-rui · 2015-12-28T09:03:11Z

For test JDBC, we can add a helper function in Scala side, which reuses code in JDBCSuite to start a in-memory JDBC server?

sun-rui · 2015-12-28T10:21:39Z

R/pkg/R/SQLContext.R

State that parameter predicates is mutually exclusive from partitionColumn/lowerBound/upperBound/numPartitions

It's in line 564 above

rxin davies shivaram Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559 - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed) Author: felixcheung <[email protected]> Closes #10584 from felixcheung/rremovedeprecated.

shivaram · 2016-01-09T06:56:02Z

@sun-rui Are there any more comments on this PR ?
@felixcheung Could you bring this up to date with master?

felixcheung · 2016-01-11T01:57:09Z

rebased and updated. thanks

SparkQA · 2016-01-11T02:22:54Z

Test build #49082 has finished for PR 10480 at commit 991a9b7.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-01-11T02:25:02Z

spark-mllib: found 0 potential binary incompatibilities (filtered 10)
sbt.ResolveException: unresolved dependency: org.eclipse.paho#org.eclipse.paho.client.mqttv3;1.0.1: not found
[error] (streaming-mqtt/*:mimaPreviousClassfiles) sbt.ResolveException: unresolved dependency: org.eclipse.paho#org.eclipse.paho.client.mqttv3;1.0.1: not found

seems to be SPARK-4628

SparkQA · 2016-01-11T05:40:47Z

Test build #49087 has finished for PR 10480 at commit 991a9b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-16T02:37:37Z

Test build #49514 has finished for PR 10480 at commit 8c64ac7.

This patch fails R style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-16T07:41:47Z

Test build #49523 has finished for PR 10480 at commit 7cb5121.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-01-17T00:25:15Z

@shivaram this is ready, thanks!

felixcheung · 2016-01-20T01:14:40Z

jenkins, retest this please

SparkQA · 2016-01-20T03:19:29Z

Test build #49739 has finished for PR 10480 at commit fccc761.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2016-01-20T05:20:49Z

R/pkg/R/DataFrame.R

is "\cr" intended for Roxygen format?

yes, it forces a new line in the generated doc, otherwise roxygen2 removes new line

sun-rui · 2016-01-20T09:03:31Z

LGTM

felixcheung · 2016-01-29T01:23:38Z

@shivaram any suggestion on how to proceed?

felixcheung · 2016-03-11T05:47:36Z

@shivaram please check on this question when you have a chance?
Users are running into issues with jdbc source in R and we've discovered there is no simple workaround.

shivaram · 2016-03-11T15:55:28Z

Sorry for the delay @felixcheung -- I'll get back on this today

felixcheung · 2016-04-16T07:47:21Z

@shivaram please check on this question when you have a chance?
Users are running into issues with jdbc source in R and we've discovered there is no simple workaround - I think it'd be great if we can get this in before the Spark 2.0 code freeze in a week or 2.

shivaram · 2016-04-19T17:26:43Z

@felixcheung Could you bring this PR up to date ? I think the code changes look fine to me and we can merge after this goes through Jenkins.

…ns, add generic, fix bugs

SparkQA · 2016-04-19T20:03:29Z

Test build #56244 has finished for PR 10480 at commit 26cd5f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-04-19T20:44:41Z

hmm

[error] (docker-integration-tests/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] (streaming/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 6116 s, completed Apr 19, 2016 1:03:29 PM

shivaram · 2016-04-19T21:01:51Z

Jenkins, retest this please

shivaram · 2016-04-19T21:02:04Z

Lets give it one more shot I guess.

SparkQA · 2016-04-19T22:57:19Z

Test build #56265 has finished for PR 10480 at commit 26cd5f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-04-19T22:59:17Z

Merging this to master

sun-rui reviewed Dec 28, 2015
View reviewed changes

felixcheung mentioned this pull request Jan 5, 2016

[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API #10584

Closed

1 task

felixcheung force-pushed the rreadjdbc branch from de635b1 to 991a9b7 Compare January 11, 2016 01:55

felixcheung force-pushed the rreadjdbc branch from 991a9b7 to 8c64ac7 Compare January 16, 2016 02:22

felixcheung force-pushed the rreadjdbc branch 2 times, most recently from 5471b14 to fccc761 Compare January 20, 2016 00:51

sun-rui reviewed Jan 20, 2016
View reviewed changes

felixcheung added 11 commits April 19, 2016 10:58

read.jdbc support

d262bc8

update comment

42d4349

write.jdbc, doc update

41e0f74

more doc update

4f8947e

update doc

1c134d1

code fix

bfee390

fix serialization of java.util.Properties, add tests for util functio…

8e3e9c6

…ns, add generic, fix bugs

update doc

9a17271

address comment

9e5c904

fix test, update to make sure java.util.Properties is handled properly

8e506d3

remove extra line from merge/rebase

26cd5f1

felixcheung force-pushed the rreadjdbc branch from fccc761 to 26cd5f1 Compare April 19, 2016 18:03

asfgit closed this in ecd877e Apr 19, 2016

gatorsmile mentioned this pull request Sep 23, 2016

[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc #12601

Closed

[SPARK-12224][SPARKR] R support for JDBC source #10480

[SPARK-12224][SPARKR] R support for JDBC source #10480

Uh oh!

Conversation

felixcheung commented Dec 26, 2015

Uh oh!

felixcheung commented Dec 26, 2015

Uh oh!

SparkQA commented Dec 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sun-rui commented Dec 28, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivaram commented Jan 9, 2016

Uh oh!

felixcheung commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

felixcheung commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 11, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

SparkQA commented Jan 16, 2016

Uh oh!

felixcheung commented Jan 17, 2016

Uh oh!

felixcheung commented Jan 20, 2016

Uh oh!

SparkQA commented Jan 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sun-rui commented Jan 20, 2016

Uh oh!

felixcheung commented Jan 29, 2016

Uh oh!

felixcheung commented Mar 11, 2016

Uh oh!

shivaram commented Mar 11, 2016

Uh oh!

felixcheung commented Apr 16, 2016

Uh oh!

shivaram commented Apr 19, 2016

Uh oh!

SparkQA commented Apr 19, 2016

Uh oh!

felixcheung commented Apr 19, 2016

Uh oh!

shivaram commented Apr 19, 2016

Uh oh!