-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8840][SparkR] Add float coercion on SparkR #7280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @viirya for the PR. Did you check if this fixes the bug reported in the JIRA ? Also it might be cool if we can add a test case for this. |
|
@shivaram I will check it later. For the test case, I don't see that for other types. Where do you suggest me to add the test case into? Directly adding it in deserialize.R? |
|
Test build #36750 has finished for PR 7280 at commit
|
|
@viirya The nice thing would be to add a test case based on say a JSON or Parquet input file. We can check the file into https://github.com/apache/spark/tree/master/R/pkg/inst/test_support and use it in a test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_sparkSQL.R |
|
Test build #36777 has finished for PR 7280 at commit
|
|
retest this please. |
|
Test build #36791 has finished for PR 7280 at commit
|
R/pkg/inst/tests/test_sparkSQL.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, but I am not sure this is testing the same bug reported ? After constructing the DF, if I do show(df) then I see the column as double while in the original bug report the columns were marked as float
show(result)
DataFrame[offset:float, percentage:float]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked this. The column is still double due to another problem I just submitted in #7311. That is, in createDataFrame, the given schema will be overwritten.
Although I solved that in #7311, I just found that with user defined schema, it is possible to cause problem when collecting data from dataframe.
That is because we serialize double in R to Double in Java. If we define a column as float in R and create a dataframe based on this schema. The serialized and deserialized Double will be stored at the float column. Then when we collect the data from it, it will throw error.
@shivaram How do you think? Do we need to fix #7311? Or you think it is up to users to define correct schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies, is there any reason that allows user pass in a schema for createDataFrame(), as we can infer types (R objects have runtime type information)? Even if in some cases, user-specified schema is needed, I think only those DataTypes that can map to native R types will be supported, for long,float, it is not natural to support.
For external sources that has float types , which will be loaded as java.lang.Float in JVM side, we can support transferring it to double type in R side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is loaded in JVM side, I think it is no problem. We already have serialization/deserialization for values from R/Java to Java/R.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main reason for supporting user-defined schema was to have support for column names that are different from the ones given in the local R data frame. We could of course switch to only picking up names from the given schema rather than the types -- but I also think specifying schema is an advanced option, so expecting users to get it to match their data types is fine.
As a follow up JIRA, we could file a new issue to warn or print an error if we find that the schema specified doesn't match the types of values being serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That is good. As #7311 is merged now. I should update this test case or it will fail due to this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For user specified schema for createDataFrame, my point is we may not support some DataTypes like byte, long, float, which is not natural to R users. Or alternatively, from the view point of API parity with Scala, we support these types but internally convert to R natural types, like:
byte -> integer
long -> double
float -> double
and print some warning message about the conversion.
|
Test build #36881 has finished for PR 7280 at commit
|
|
Jenkins, retest this please |
|
Test build #36884 has finished for PR 7280 at commit
|
|
Test build #36950 has finished for PR 7280 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you have a test for create a DataFrame with float type? It may crash now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I added it.
It is ok to create a DataFrame with float type. Inserting data from RDD to the DataFrame is no problem too. But if you want to insert local data from R to the DataFrame, it will crash because we serialize double in R to Double in JVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the thing I worry about, create a DataFrame from local data is the most important use case right now. I think we shouldn't support FloatType or make it really works.
|
I left a comment above |
|
Test build #37008 has finished for PR 7280 at commit
|
|
@davies Is this good to merge ? |
|
@shivaram I'm still having some concerns on it. We should support getting FloatType back as Double, but doesn't support createDataFrame from FloatType (or do the casting from Double to Float in JVM). |
|
Ok I see the problem -- I guess there are two solutions.
I don't mind either of them (the first might be simpler / cheaper to implement) as I don't think using |
|
Either of them sounds good to me too. |
|
Test build #37199 has finished for PR 7280 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use DataType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, SerDe is clean and doesn't include any import from sql and I think it shouldn't too because it is in core? So I just use string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram , one problem is we don't have val dataType = readObjectType(dis) in bytesToRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - do we need it though ? In bytesToRow you can just do the conversion if the typename is float ? Something like
val obj = SerDe.readObject(dis)
if (schemaTypeName == "Float") obj.asInstanceOf[Double].floatValue() else obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that obj is always a Double if schemaTypeName is Float here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think its fair assumption that the SerDe will return a double in case the schema type is a float. If its not a double it means something went wrong somewhere down the line ? If we want to be really careful we could add a check with isInstanceOf[Double] and throw an exception saying Unexpected type: Expected Double got <>.
BTW the reason I'm trying to move this out of SerDe is that the readObject code path is used by everything else while the float, double issue only comes up in the case where we create a DataFrame from a local R object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think you are right. I will update this later.
|
LGTM, we can figure out a better way to do the type conversion later. |
|
Test build #37312 has finished for PR 7280 at commit
|
|
An unrelated failure. |
|
please retest this. |
|
Jenkins, retest this please |
|
Test build #37362 has finished for PR 7280 at commit
|
|
merging this into master, thanks! |
JIRA: https://issues.apache.org/jira/browse/SPARK-8840
Currently the type coercion rules don't include float type. This PR simply adds it.