Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 8, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-8840

Currently the type coercion rules don't include float type. This PR simply adds it.

@shivaram
Copy link
Contributor

shivaram commented Jul 8, 2015

Thanks @viirya for the PR. Did you check if this fixes the bug reported in the JIRA ? Also it might be cool if we can add a test case for this.

@viirya
Copy link
Member Author

viirya commented Jul 8, 2015

@shivaram I will check it later. For the test case, I don't see that for other types. Where do you suggest me to add the test case into? Directly adding it in deserialize.R?

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36750 has finished for PR 7280 at commit 0dcc992.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Jul 8, 2015

@viirya The nice thing would be to add a test case based on say a JSON or Parquet input file. We can check the file into https://github.com/apache/spark/tree/master/R/pkg/inst/test_support and use it in a test case https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_sparkSQL.R

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36777 has finished for PR 7280 at commit 8db3244.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 8, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 8, 2015

Test build #36791 has finished for PR 7280 at commit 8db3244.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, but I am not sure this is testing the same bug reported ? After constructing the DF, if I do show(df) then I see the column as double while in the original bug report the columns were marked as float

show(result)
DataFrame[offset:float, percentage:float]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this. The column is still double due to another problem I just submitted in #7311. That is, in createDataFrame, the given schema will be overwritten.

Although I solved that in #7311, I just found that with user defined schema, it is possible to cause problem when collecting data from dataframe.

That is because we serialize double in R to Double in Java. If we define a column as float in R and create a dataframe based on this schema. The serialized and deserialized Double will be stored at the float column. Then when we collect the data from it, it will throw error.

@shivaram How do you think? Do we need to fix #7311? Or you think it is up to users to define correct schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies, is there any reason that allows user pass in a schema for createDataFrame(), as we can infer types (R objects have runtime type information)? Even if in some cases, user-specified schema is needed, I think only those DataTypes that can map to native R types will be supported, for long,float, it is not natural to support.

For external sources that has float types , which will be loaded as java.lang.Float in JVM side, we can support transferring it to double type in R side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is loaded in JVM side, I think it is no problem. We already have serialization/deserialization for values from R/Java to Java/R.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main reason for supporting user-defined schema was to have support for column names that are different from the ones given in the local R data frame. We could of course switch to only picking up names from the given schema rather than the types -- but I also think specifying schema is an advanced option, so expecting users to get it to match their data types is fine.

As a follow up JIRA, we could file a new issue to warn or print an error if we find that the schema specified doesn't match the types of values being serialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That is good. As #7311 is merged now. I should update this test case or it will fail due to this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For user specified schema for createDataFrame, my point is we may not support some DataTypes like byte, long, float, which is not natural to R users. Or alternatively, from the view point of API parity with Scala, we support these types but internally convert to R natural types, like:
byte -> integer
long -> double
float -> double
and print some warning message about the conversion.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36881 has finished for PR 7280 at commit 6f9159d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Jul 9, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36884 has finished for PR 7280 at commit 6f9159d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36950 has finished for PR 7280 at commit 30c2a40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Jul 9, 2015

Cool -- this is a good test case. Thanks @viirya LGTM. @sun-rui any other comments ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a test for create a DataFrame with float type? It may crash now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I added it.

It is ok to create a DataFrame with float type. Inserting data from RDD to the DataFrame is no problem too. But if you want to insert local data from R to the DataFrame, it will crash because we serialize double in R to Double in JVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the thing I worry about, create a DataFrame from local data is the most important use case right now. I think we shouldn't support FloatType or make it really works.

@sun-rui
Copy link
Contributor

sun-rui commented Jul 10, 2015

I left a comment above

@SparkQA
Copy link

SparkQA commented Jul 10, 2015

Test build #37008 has finished for PR 7280 at commit 733015a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

@davies Is this good to merge ?

@davies
Copy link
Contributor

davies commented Jul 13, 2015

@shivaram I'm still having some concerns on it. We should support getting FloatType back as Double, but doesn't support createDataFrame from FloatType (or do the casting from Double to Float in JVM).

@shivaram
Copy link
Contributor

Ok I see the problem -- I guess there are two solutions.

  1. In createDataFrame we throw an error if somebody has float in their schema and ask them to use double instead.
  2. We auto-convert double to float based on the schema on the Scala side.

I don't mind either of them (the first might be simpler / cheaper to implement) as I don't think using float from a local data frame is a major use case.

@davies
Copy link
Contributor

davies commented Jul 13, 2015

Either of them sounds good to me too.

@viirya
Copy link
Member Author

viirya commented Jul 14, 2015

@shivaram @davies I updated this to implicitly convert double to float based on schema.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37199 has finished for PR 7280 at commit dbf0c1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use DataType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, SerDe is clean and doesn't include any import from sql and I think it shouldn't too because it is in core? So I just use string here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram , one problem is we don't have val dataType = readObjectType(dis) in bytesToRow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - do we need it though ? In bytesToRow you can just do the conversion if the typename is float ? Something like

val obj = SerDe.readObject(dis)
if (schemaTypeName == "Float") obj.asInstanceOf[Double].floatValue() else obj

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that obj is always a Double if schemaTypeName is Float here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think its fair assumption that the SerDe will return a double in case the schema type is a float. If its not a double it means something went wrong somewhere down the line ? If we want to be really careful we could add a check with isInstanceOf[Double] and throw an exception saying Unexpected type: Expected Double got <>.

BTW the reason I'm trying to move this out of SerDe is that the readObject code path is used by everything else while the float, double issue only comes up in the case where we create a DataFrame from a local R object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think you are right. I will update this later.

@davies
Copy link
Contributor

davies commented Jul 14, 2015

LGTM, we can figure out a better way to do the type conversion later.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37312 has finished for PR 7280 at commit c86dc0e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 15, 2015

An unrelated failure.

@viirya
Copy link
Member Author

viirya commented Jul 15, 2015

please retest this.

@shivaram
Copy link
Contributor

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37362 has finished for PR 7280 at commit c86dc0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jul 15, 2015

merging this into master, thanks!

@asfgit asfgit closed this in 6f69025 Jul 15, 2015
@viirya viirya deleted the add_r_float_coercion branch December 27, 2023 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants