Skip to content

Conversation

@felixcheung
Copy link
Member

Add R API for read.jdbc, write.jdbc.

Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala JDBCSuite depends on Java H2 in-memory database.

Refactored some code into util so they could be tested.

Core's R SerDe code needs to be updated to allow access to java.util.Properties as jobj handle which is required by DataFrameReader/Writer's jdbc method. It would be possible, though more code to add a sql/r/SQLUtils helper function.

Tested:

# with postgresql
../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar

# read.jdbc
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345)

# partitionColumn and numPartitions test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345)
a <- SparkR:::toRDD(df)
SparkR:::getNumPartitions(a)
[1] 4
SparkR:::collectPartition(a, 2L)

# defaultParallelism test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345)
SparkR:::getNumPartitions(a)
[1] 2

# predicates test
df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345)
count(df) == 1

# write.jdbc, default save mode "error"
irisDf <- as.DataFrame(sqlContext, iris)
write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345")
"error, already exists"

write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345")

@felixcheung
Copy link
Member Author

@shivaram @sun-rui

@SparkQA
Copy link

SparkQA commented Dec 26, 2015

Test build #48335 has finished for PR 10480 at commit de635b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, this is needed to properly handle java.util.Properties. This is useful for 2 parts:

  1. R code could set or get values from java.util.Properties directly
  2. For callJMethod to match parameter types properly

As above, we could have a Scala helper that takes in all the parameters - in such case it would be better to have all the logic in Scala and perhaps it would be easier to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my preference is to do more in R. if you feel strongly about having a helper in Scala instead of handling Properties then we could move most of the code into a Scala helper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it, java.util.Properties implements Map interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it still feels awkward to just do this specially for the Properties object.

@felixcheung Do you have an idea what part of the code would move to scala if we want to do it on the scala side ? Typically we do deal with such conversions on the scala side, so thats the main reason I'm asking. Is it just the varargsToJProperties function ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram as you see we are calling 3 different overloads of read().jdbc() in Scala, 4 if counting write().jdbc(). I think there would be 4 approaches to handle read().jdbc():

  1. Have 3 JVM helper functions
  2. Have 1 helper function and on JVM side figure out which overload to route to
  3. Have 1 helper function and include parameter processing (eg. check numPartitions/defaultParallelism etc), and overload checks all within JVM - and leave R to be a thin shim
  4. serialize Properties as jobj and work on it on R side

I feel # 4 gives us the least overhead (less code) and more flexibility (since logic like default values for numPartition exists only on R/Python and not on Scala side).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2 is also acceptable to me besides 4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't think that special-casing the Properties object here is a major problem -- java.util.Properties is a very commonly used class, and it would make sense for the RPC layer of SparkR to handle Properties alongside other common types like Map and String. But it makes sense to defer to Shivaram on this point. I would vote for option (2) above.

Note that, as far as I can see, the code here to pass a Properties object back to R is only triggered by the test cases in this PR. The actual code for invoking read.jdbc() only writes to Properties objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @frreiss. I agree with the support for serde of Properties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung for summarizing the options. I was trying to judge how frequently we use java.util.Properties in the Spark DataFrame codebase and it looks like JDBC support is the only use case that is using this. That said if having support in SerDe makes the integration much easier I think we can go along this route. As @frreiss said, java.util.Properties is a pretty common data structure, so this could be useful in the future.

Overall I think the current option is fine by me

@sun-rui
Copy link
Contributor

sun-rui commented Dec 28, 2015

For test JDBC, we can add a helper function in Scala side, which reuses code in JDBCSuite to start a in-memory JDBC server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State that parameter predicates is mutually exclusive from partitionColumn/lowerBound/upperBound/numPartitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in line 564 above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

asfgit pushed a commit that referenced this pull request Jan 5, 2016
rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559

- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)

Author: felixcheung <[email protected]>

Closes #10584 from felixcheung/rremovedeprecated.
@shivaram
Copy link
Contributor

shivaram commented Jan 9, 2016

@sun-rui Are there any more comments on this PR ?
@felixcheung Could you bring this up to date with master?

@felixcheung
Copy link
Member Author

rebased and updated. thanks

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49082 has finished for PR 10480 at commit 991a9b7.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

spark-mllib: found 0 potential binary incompatibilities (filtered 10)
sbt.ResolveException: unresolved dependency: org.eclipse.paho#org.eclipse.paho.client.mqttv3;1.0.1: not found
[error] (streaming-mqtt/*:mimaPreviousClassfiles) sbt.ResolveException: unresolved dependency: org.eclipse.paho#org.eclipse.paho.client.mqttv3;1.0.1: not found

seems to be SPARK-4628

@SparkQA
Copy link

SparkQA commented Jan 11, 2016

Test build #49087 has finished for PR 10480 at commit 991a9b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2016

Test build #49514 has finished for PR 10480 at commit 8c64ac7.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2016

Test build #49523 has finished for PR 10480 at commit 7cb5121.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

@shivaram this is ready, thanks!

@felixcheung felixcheung force-pushed the rreadjdbc branch 2 times, most recently from 5471b14 to fccc761 Compare January 20, 2016 00:51
@felixcheung
Copy link
Member Author

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49739 has finished for PR 10480 at commit fccc761.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "\cr" intended for Roxygen format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it forces a new line in the generated doc, otherwise roxygen2 removes new line

@sun-rui
Copy link
Contributor

sun-rui commented Jan 20, 2016

LGTM

@felixcheung
Copy link
Member Author

@shivaram any suggestion on how to proceed?

@felixcheung
Copy link
Member Author

@shivaram please check on this question when you have a chance?
Users are running into issues with jdbc source in R and we've discovered there is no simple workaround.

@shivaram
Copy link
Contributor

Sorry for the delay @felixcheung -- I'll get back on this today

@felixcheung
Copy link
Member Author

@shivaram please check on this question when you have a chance?
Users are running into issues with jdbc source in R and we've discovered there is no simple workaround - I think it'd be great if we can get this in before the Spark 2.0 code freeze in a week or 2.

@shivaram
Copy link
Contributor

@felixcheung Could you bring this PR up to date ? I think the code changes look fine to me and we can merge after this goes through Jenkins.

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56244 has finished for PR 10480 at commit 26cd5f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member Author

hmm

[error] (docker-integration-tests/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] (streaming/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 6116 s, completed Apr 19, 2016 1:03:29 PM

@shivaram
Copy link
Contributor

Jenkins, retest this please

@shivaram
Copy link
Contributor

Lets give it one more shot I guess.

@SparkQA
Copy link

SparkQA commented Apr 19, 2016

Test build #56265 has finished for PR 10480 at commit 26cd5f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

Merging this to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants