-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8840][SparkR] Add float coercion on SparkR #7280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0dcc992
8db3244
6f9159d
52b5294
30c2a40
733015a
dbf0c1b
c86dc0e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -108,6 +108,32 @@ test_that("create DataFrame from RDD", { | |
| expect_equal(count(df), 10) | ||
| expect_equal(columns(df), c("a", "b")) | ||
| expect_equal(dtypes(df), list(c("a", "int"), c("b", "string"))) | ||
|
|
||
| df <- jsonFile(sqlContext, jsonPathNa) | ||
| hiveCtx <- tryCatch({ | ||
| newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc) | ||
| }, error = function(err) { | ||
| skip("Hive is not build with SparkSQL, skipped") | ||
| }) | ||
| sql(hiveCtx, "CREATE TABLE people (name string, age double, height float)") | ||
| insertInto(df, "people") | ||
| expect_equal(sql(hiveCtx, "SELECT age from people WHERE name = 'Bob'"), c(16)) | ||
| expect_equal(sql(hiveCtx, "SELECT height from people WHERE name ='Bob'"), c(176.5)) | ||
|
|
||
| schema <- structType(structField("name", "string"), structField("age", "integer"), | ||
| structField("height", "float")) | ||
| df2 <- createDataFrame(sqlContext, df.toRDD, schema) | ||
| expect_equal(columns(df2), c("name", "age", "height")) | ||
| expect_equal(dtypes(df2), list(c("name", "string"), c("age", "int"), c("height", "float"))) | ||
| expect_equal(collect(where(df2, df2$name == "Bob")), c("Bob", 16, 176.5)) | ||
|
|
||
| localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18), height=c(164.10, 181.4, 173.7)) | ||
| df <- createDataFrame(sqlContext, localDF, schema) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding this, but I am not sure this is testing the same bug reported ? After constructing the DF, if I do
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I checked this. The column is still Although I solved that in #7311, I just found that with user defined schema, it is possible to cause problem when collecting data from dataframe. That is because we serialize @shivaram How do you think? Do we need to fix #7311? Or you think it is up to users to define correct schema?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davies, is there any reason that allows user pass in a schema for createDataFrame(), as we can infer types (R objects have runtime type information)? Even if in some cases, user-specified schema is needed, I think only those DataTypes that can map to native R types will be supported, for long,float, it is not natural to support. For external sources that has float types , which will be loaded as java.lang.Float in JVM side, we can support transferring it to double type in R side.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that is loaded in JVM side, I think it is no problem. We already have serialization/deserialization for values from R/Java to Java/R.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the main reason for supporting user-defined schema was to have support for column names that are different from the ones given in the local R data frame. We could of course switch to only picking up names from the given schema rather than the types -- but I also think specifying schema is an advanced option, so expecting users to get it to match their data types is fine. As a follow up JIRA, we could file a new issue to warn or print an error if we find that the schema specified doesn't match the types of values being serialized.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. That is good. As #7311 is merged now. I should update this test case or it will fail due to this issue.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For user specified schema for createDataFrame, my point is we may not support some DataTypes like byte, long, float, which is not natural to R users. Or alternatively, from the view point of API parity with Scala, we support these types but internally convert to R natural types, like: |
||
| expect_is(df, "DataFrame") | ||
| expect_equal(count(df), 3) | ||
| expect_equal(columns(df), c("name", "age", "height")) | ||
| expect_equal(dtypes(df), list(c("name", "string"), c("age", "int"), c("height", "float"))) | ||
| expect_equal(collect(where(df, df$name == "John")), c("John", 19, 164.10)) | ||
| }) | ||
|
|
||
| test_that("convert NAs to null type in DataFrames", { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have a test for create a DataFrame with
floattype? It may crash now.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I added it.
It is ok to create a DataFrame with
floattype. Inserting data from RDD to the DataFrame is no problem too. But if you want to insert local data from R to the DataFrame, it will crash because we serializedoublein R toDoublein JVM.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing I worry about, create a DataFrame from local data is the most important use case right now. I think we shouldn't support FloatType or make it really works.