[SPARK-8840][SparkR] Add float coercion on SparkR #7280

viirya · 2015-07-08T02:57:44Z

JIRA: https://issues.apache.org/jira/browse/SPARK-8840

Currently the type coercion rules don't include float type. This PR simply adds it.

shivaram · 2015-07-08T03:16:38Z

Thanks @viirya for the PR. Did you check if this fixes the bug reported in the JIRA ? Also it might be cool if we can add a test case for this.

viirya · 2015-07-08T03:55:12Z

@shivaram I will check it later. For the test case, I don't see that for other types. Where do you suggest me to add the test case into? Directly adding it in deserialize.R?

SparkQA · 2015-07-08T05:08:48Z

Test build #36750 has finished for PR 7280 at commit 0dcc992.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-08T05:10:39Z

@viirya The nice thing would be to add a test case based on say a JSON or Parquet input file. We can check the file into https://github.com/apache/spark/tree/master/R/pkg/inst/test_support and use it in a test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_sparkSQL.R

SparkQA · 2015-07-08T11:43:40Z

Test build #36777 has finished for PR 7280 at commit 8db3244.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-08T12:11:20Z

retest this please.

SparkQA · 2015-07-08T15:34:42Z

Test build #36791 has finished for PR 7280 at commit 8db3244.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-08T16:43:04Z

R/pkg/inst/tests/test_sparkSQL.R

Thanks for adding this, but I am not sure this is testing the same bug reported ? After constructing the DF, if I do show(df) then I see the column as double while in the original bug report the columns were marked as float

show(result) DataFrame[offset:float, percentage:float]

I checked this. The column is still double due to another problem I just submitted in #7311. That is, in createDataFrame, the given schema will be overwritten.

Although I solved that in #7311, I just found that with user defined schema, it is possible to cause problem when collecting data from dataframe.

That is because we serialize double in R to Double in Java. If we define a column as float in R and create a dataframe based on this schema. The serialized and deserialized Double will be stored at the float column. Then when we collect the data from it, it will throw error.

@shivaram How do you think? Do we need to fix #7311? Or you think it is up to users to define correct schema?

@davies, is there any reason that allows user pass in a schema for createDataFrame(), as we can infer types (R objects have runtime type information)? Even if in some cases, user-specified schema is needed, I think only those DataTypes that can map to native R types will be supported, for long,float, it is not natural to support.

For external sources that has float types , which will be loaded as java.lang.Float in JVM side, we can support transferring it to double type in R side.

If that is loaded in JVM side, I think it is no problem. We already have serialization/deserialization for values from R/Java to Java/R.

I think the main reason for supporting user-defined schema was to have support for column names that are different from the ones given in the local R data frame. We could of course switch to only picking up names from the given schema rather than the types -- but I also think specifying schema is an advanced option, so expecting users to get it to match their data types is fine.

As a follow up JIRA, we could file a new issue to warn or print an error if we find that the schema specified doesn't match the types of values being serialized.

OK. That is good. As #7311 is merged now. I should update this test case or it will fail due to this issue.

For user specified schema for createDataFrame, my point is we may not support some DataTypes like byte, long, float, which is not natural to R users. Or alternatively, from the view point of API parity with Scala, we support these types but internally convert to R natural types, like:
byte -> integer
long -> double
float -> double
and print some warning message about the conversion.

SparkQA · 2015-07-09T03:07:14Z

Test build #36881 has finished for PR 7280 at commit 6f9159d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-09T03:12:19Z

Jenkins, retest this please

SparkQA · 2015-07-09T06:02:55Z

Test build #36884 has finished for PR 7280 at commit 6f9159d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-09T19:43:51Z

Test build #36950 has finished for PR 7280 at commit 30c2a40.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-09T19:57:23Z

Cool -- this is a good test case. Thanks @viirya LGTM. @sun-rui any other comments ?

davies · 2015-07-09T20:52:40Z

R/pkg/R/schema.R

Could you have a test for create a DataFrame with float type? It may crash now.

Ok. I added it.

It is ok to create a DataFrame with float type. Inserting data from RDD to the DataFrame is no problem too. But if you want to insert local data from R to the DataFrame, it will crash because we serialize double in R to Double in JVM.

That's the thing I worry about, create a DataFrame from local data is the most important use case right now. I think we shouldn't support FloatType or make it really works.

sun-rui · 2015-07-10T02:49:31Z

I left a comment above

SparkQA · 2015-07-10T06:32:21Z

Test build #37008 has finished for PR 7280 at commit 733015a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-13T16:09:59Z

@davies Is this good to merge ?

davies · 2015-07-13T16:36:49Z

@shivaram I'm still having some concerns on it. We should support getting FloatType back as Double, but doesn't support createDataFrame from FloatType (or do the casting from Double to Float in JVM).

shivaram · 2015-07-13T16:44:24Z

Ok I see the problem -- I guess there are two solutions.

In createDataFrame we throw an error if somebody has float in their schema and ask them to use double instead.
We auto-convert double to float based on the schema on the Scala side.

I don't mind either of them (the first might be simpler / cheaper to implement) as I don't think using float from a local data frame is a major use case.

davies · 2015-07-13T17:50:24Z

Either of them sounds good to me too.

viirya · 2015-07-14T05:48:13Z

@shivaram @davies I updated this to implicitly convert double to float based on schema.

SparkQA · 2015-07-14T08:07:12Z

Test build #37199 has finished for PR 7280 at commit dbf0c1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-07-14T15:36:33Z

core/src/main/scala/org/apache/spark/api/r/SerDe.scala

It's better to use DataType.

Currently, SerDe is clean and doesn't include any import from sql and I think it shouldn't too because it is in core? So I just use string here.

@shivaram , one problem is we don't have val dataType = readObjectType(dis) in bytesToRow.

Hmm - do we need it though ? In bytesToRow you can just do the conversion if the typename is float ? Something like

val obj = SerDe.readObject(dis) if (schemaTypeName == "Float") obj.asInstanceOf[Double].floatValue() else obj

Is it guaranteed that obj is always a Double if schemaTypeName is Float here?

Yeah I think its fair assumption that the SerDe will return a double in case the schema type is a float. If its not a double it means something went wrong somewhere down the line ? If we want to be really careful we could add a check with isInstanceOf[Double] and throw an exception saying Unexpected type: Expected Double got <>.

BTW the reason I'm trying to move this out of SerDe is that the readObject code path is used by everything else while the float, double issue only comes up in the case where we create a DataFrame from a local R object.

Ok. I think you are right. I will update this later.

davies · 2015-07-14T15:38:41Z

LGTM, we can figure out a better way to do the type conversion later.

SparkQA · 2015-07-15T06:48:53Z

Test build #37312 has finished for PR 7280 at commit c86dc0e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2015-07-15T07:35:41Z

An unrelated failure.

viirya · 2015-07-15T07:47:57Z

please retest this.

shivaram · 2015-07-15T14:15:03Z

Jenkins, retest this please

SparkQA · 2015-07-15T16:20:58Z

Test build #37362 has finished for PR 7280 at commit c86dc0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-07-15T16:48:14Z

merging this into master, thanks!

Add float coercion on SparkR.

0dcc992

schema also needs to support float. add test case.

8db3244

viirya mentioned this pull request Jul 8, 2015

[SPARK-8897][SPARKR] SparkR DataFrame fail to return data of float type. #7289

Closed

shivaram reviewed Jul 8, 2015
View reviewed changes

Add another test case.

6f9159d

viirya added 2 commits July 10, 2015 01:08

Merge remote-tracking branch 'upstream/master' into add_r_float_coercion

52b5294

Update test case.

30c2a40

davies reviewed Jul 9, 2015
View reviewed changes

Add test case for DataFrame with float type.

733015a

Implicitly convert Double to Float based on provided schema.

dbf0c1b

davies reviewed Jul 14, 2015
View reviewed changes

For comments.

c86dc0e

asfgit closed this in 6f69025 Jul 15, 2015

viirya deleted the add_r_float_coercion branch December 27, 2023 18:32

[SPARK-8840][SparkR] Add float coercion on SparkR #7280

[SPARK-8840][SparkR] Add float coercion on SparkR #7280

Uh oh!

Conversation

viirya commented Jul 8, 2015

Uh oh!

shivaram commented Jul 8, 2015

Uh oh!

viirya commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

shivaram commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

viirya commented Jul 8, 2015

Uh oh!

SparkQA commented Jul 8, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

shivaram commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

SparkQA commented Jul 9, 2015

Uh oh!

shivaram commented Jul 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sun-rui commented Jul 10, 2015

Uh oh!

SparkQA commented Jul 10, 2015

Uh oh!

shivaram commented Jul 13, 2015

Uh oh!

davies commented Jul 13, 2015

Uh oh!

shivaram commented Jul 13, 2015

Uh oh!

davies commented Jul 13, 2015

Uh oh!

viirya commented Jul 14, 2015

Uh oh!

SparkQA commented Jul 14, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!