[SPARK-21163][SQL] DataFrame.toPandas should respect the data type #18378

cloud-fan · 2017-06-21T15:47:33Z

What changes were proposed in this pull request?

Currently we convert a spark DataFrame to Pandas Dataframe by pd.DataFrame.from_records. It infers the data type from the data and doesn't respect the spark DataFrame Schema. This PR fixes it.

How was this patch tested?

a new regression test

cloud-fan · 2017-06-21T15:47:54Z

cc @ueshin @BryanCutler

SparkQA · 2017-06-21T15:53:55Z

Test build #78392 has finished for PR 18378 at commit 8a033fb.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-06-21T17:21:28Z

LGTM, pending Jenkins.

SparkQA · 2017-06-21T17:23:57Z

Test build #78395 has finished for PR 18378 at commit e352817.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-21T18:04:38Z

Test build #78398 has finished for PR 18378 at commit afa74ab.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-06-21T18:14:06Z

python/pyspark/sql/dataframe.py

This is probably the easiest way to assign the types, but data is still loaded and inferred then the astype will then cast the data and I'm not sure if it will make a pass over the data or do it lazily. A more ideal way would be to not use from_records but then I think the data would need to be broken up into columns.

converting rows to column format is also costly I think, so if users wanna performance here, they should enable the arrow optimization :)

BryanCutler · 2017-06-21T18:18:17Z

python/pyspark/sql/dataframe.py

I think it might cause problems to have all non-primitive types as object. Things like timestamps will be inferred form a datetime object, for example:

In [10]: pdf = pd.DataFrame.from_records([(1.0, 1, "a", datetime.datetime.now())]) In [11]: pdf.dtypes Out[11]: 0 float64 1 int64 2 object 3 datetime64[ns] dtype: object In [12]: pdf.astype({0: "float32", 1: "int32", 2: "object", 3: "object"}).dtypes Out[12]: 0 float32 1 int32 2 object 3 object dtype: object

BryanCutler

This seems ok for primitive types, but I think makes problems for other types. I'm also not sure if there will be more performance issues with using from_records with astype, which is already too slow ;)

ueshin · 2017-06-21T18:47:12Z

How about applying astype only for primitive types?
I guess the problem here is up-convert from Byte/Short/IntegerType to int64, FloatType to float64.

BryanCutler · 2017-06-21T19:31:35Z

How about applying astype only for primitive types?

Yeah, that might work since astype takes a dict you probably don't need to specify all the columns. It does seem like it makes a deep copy of the data that is being casted, so still might have an impact on performance.

HyukjinKwon · 2017-06-21T22:20:29Z

python/pyspark/sql/tests.py

Could we check and skip if pandas is not able to import (numpy is a Pandas dependency. So checking Pandas alone should be fine)?

try: import pandas _have_pandas = True except: # No Pandas, but that's okay, we'll skip those tests pass ... @unittest.skipIf(not _have_pandas, "Pandas not installed") def test_to_pandas(self): ...

I at least see the doctest is being skipped, >>> df.toPandas() # doctest: +SKIP.

Yeah, the Jenkins worker might not have Pandas installed and it's not a hard dependency for pyspark. To be sure the test gets run, it could be added to dev/run-pip-tests similar to #15821 for now.

SparkQA · 2017-06-22T05:03:49Z

Test build #78426 has finished for PR 18378 at commit 36f9cb6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-22T05:29:55Z

Test build #78427 has finished for PR 18378 at commit 36dc5e7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-22T05:36:30Z

It sounds ok to me just except missing _have_pandas = False above try: .

viirya · 2017-06-22T05:46:38Z

python/pyspark/sql/dataframe.py

+            if (pandas_type):
+                dtype[field.name] = pandas_type
+
+        return pd.DataFrame.from_records(self.collect(), columns=self.columns).astype(dtype)


The param copy of astype is true by default. Seems to me we don't need copying the data? No copying the data should benefit the performance.

SparkQA · 2017-06-22T06:01:16Z

Test build #78429 has finished for PR 18378 at commit 1e98c49.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-22T06:02:58Z

Hm.. actually. this failure looks legitimate. I can reproduce this in my local too.

cloud-fan · 2017-06-22T06:07:22Z

@HyukjinKwon can you give me a hand for this? I can't reproduce this locally... thanks!

HyukjinKwon · 2017-06-22T06:08:10Z

My pleasure. I will give a shot.

SparkQA · 2017-06-22T06:21:18Z

Test build #78432 has finished for PR 18378 at commit dfaa392.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-22T06:32:30Z

It sounds astype with the dict added from 0.19.0 - pandas-dev/pandas@63a1e5c#diff-fb14ed747473b618d0c021fdef7ee85b. Mine was lower then that and I assume Jenkins one is the same case too.

HyukjinKwon · 2017-06-22T06:33:14Z

(I will try to find a workaround ...)

HyukjinKwon · 2017-06-22T06:54:34Z

I sent a PR to your branch - cloud-fan#7 @cloud-fan. I will double check as well.

Work around astype with columns in Pandas < 0.19.0

SparkQA · 2017-06-22T08:06:08Z

Test build #78443 has finished for PR 18378 at commit 357a798.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-22T08:08:44Z

python/pyspark/sql/dataframe.py

+def _to_corrected_pandas_type(dt):
+    """
+    When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong.
+    This method gets the correted data type for Pandas if that type may be inferred uncorrectly.


nit: typo correted.

viirya · 2017-06-22T08:09:41Z

LGTM

HyukjinKwon · 2017-06-22T08:10:02Z

LGTM except for the nit ^.

cloud-fan · 2017-06-22T08:21:13Z

the last commit just fixes a typo in comment, and the python style check passed locally, I'm going to merge this PR to unblock #15821

cloud-fan · 2017-06-22T08:22:40Z

merged, thanks for your review!

HyukjinKwon · 2017-06-22T09:25:33Z

python/pyspark/sql/dataframe.py

+        pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
+
+        for f, t in dtype.items():
+            pdf[f] = pdf[f].astype(t, copy=False)


Just, just in case someone blames this in the future, as a little side note, it looks copy is introduced in 0.11.0 here. So, Pandas 0.10.0 does not work with it (see here).

from pyspark.sql.types import * schema = StructType().add("a", IntegerType()).add("b", StringType())\ .add("c", BooleanType()).add("d", FloatType()) data = [ (1, "foo", True, 3.0,), (2, "foo", True, 5.0), (3, "bar", False, -1.0), (4, "bar", False, 6.0), ] spark.createDataFrame(data, schema).toPandas().dtypes

Pandas 0.10.0:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas pdf[f] = pdf[f].astype(t, copy=False) TypeError: astype() got an unexpected keyword argument 'copy'

However, I guess it is really fine becuase:

0.10.0 was released in 2012, when Spark was 0.6.x and Java was 6 & 7.

I guess this is really fine. It was 5 years ago.

In 0.10.0, it does works without copy but the types are not properly set as proposed here:

spark.createDataFrame(data, schema).toPandas().dtypes a int64 # <- this should be 'int32' b object c bool d float64 # <- this should be 'float32'

I am writing this comment only because, up to my knolwedge, we didn't specify Pandas version requirement -

spark/python/setup.py

Line 202 in 314cf51

'sql': ['pandas']

.

thanks for the investigation! maybe we should specify the version requirement for pandas

SparkQA · 2017-06-22T11:42:35Z

Test build #78448 has finished for PR 18378 at commit d8ba545.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-06-22T18:00:43Z

Looks good, I'll update #15821 with this

## What changes were proposed in this pull request? Currently we convert a spark DataFrame to Pandas Dataframe by `pd.DataFrame.from_records`. It infers the data type from the data and doesn't respect the spark DataFrame Schema. This PR fixes it. ## How was this patch tested? a new regression test Author: hyukjinkwon <[email protected]> Author: Wenchen Fan <[email protected]> Author: Wenchen Fan <[email protected]> Closes apache#18378 from cloud-fan/to_pandas.

edlee123 · 2018-04-14T22:54:07Z

python/pyspark/sql/dataframe.py

+        return np.int32
+    elif type(dt) == FloatType:
+        return np.float32
+    else:


Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with column integer type, the dtypes in pandas is int64. Whereas in in Spark 2.3.0 the ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0:

df = spark.sparkContext.parallelize([(i, ) for i in [1, 2, 3]]).toDF(["a"]).select(sf.col('a').cast('int')).toPandas() df.dtypes

Is this intended? We ran into this as we have unit tests in a project that passed in Spark 2.2.1 that fail in Spark 2.3.0 when we looked into upgrading

Yup, it was unfortunate but it was a bug that we should fix. Does that cause an actual break or simply just unit test failure?

As far as I can so far just some of our unit tests where we are asserting some expected pandas dataframes. Think maybe float also is affected... Should I create a ticket in Jira?

I think the current change is actually more correct. Such changes might usually have to be avoided but there are strong reasons for it and I would classify this case as a bug. I would discourage to create a JIRA unless it breaks a senario which makes a strong sense.

edlee123 · 2018-04-15T15:29:05Z

Ok I see, I can see part of the rationale is performance (from discussion of astype above) and consistency with arrow https://arrow.apache.org/docs/python/pandas.html

I guess without knowing much about the work with Arrow I was expecting it to be consistent with how pandas converts python types e.g in Spark 2.2

What happens with Double and DateType?

cloud-fan · 2018-04-16T01:42:40Z

It's pretty natural to convert integer type to int32. Although Spark tries its best to avoid behavior changes, it's allowed to fix some wrong behaviors in new releases, and I believe it's well documented in the Spark 2.3 release notes.

edlee123 · 2018-04-16T02:22:36Z

I see the rationale now, thank you everyone

BryanCutler · 2018-04-16T18:02:52Z

@edlee123 a Spark DoubleType will produce a float64 dtype in Pandas and FloatType will be float32. DateType will be Python datetime.date objects. Also keep in mind that if you have integer data with null values, then Pandas will treat it as floats and represent the null values as NaNs. In this case, Spark will not change the dtype.

cloud-fan mentioned this pull request Jun 21, 2017

[SPARK-13534][PySpark] Using Apache Arrow to increase performance of DataFrame.toPandas #15821

Closed

cloud-fan force-pushed the to_pandas branch from 8a033fb to e352817 Compare June 21, 2017 17:04

cloud-fan force-pushed the to_pandas branch from e352817 to afa74ab Compare June 21, 2017 17:45

BryanCutler reviewed Jun 21, 2017

View reviewed changes

HyukjinKwon reviewed Jun 21, 2017

View reviewed changes

cloud-fan force-pushed the to_pandas branch 2 times, most recently from 31eec1f to 36f9cb6 Compare June 22, 2017 05:00

cloud-fan force-pushed the to_pandas branch from 36f9cb6 to 36dc5e7 Compare June 22, 2017 05:12

DataFrame.toPandas should respect the data type

1e98c49

cloud-fan force-pushed the to_pandas branch from 36dc5e7 to 1e98c49 Compare June 22, 2017 05:39

viirya reviewed Jun 22, 2017

View reviewed changes

do not copy

dfaa392

Work around astype with columns in Pandas < 0.19.0

e903cd2

HyukjinKwon and others added 2 commits June 22, 2017 15:58

No copy

6702ad1

Merge pull request #7 from HyukjinKwon/pandas-astype

357a798

Work around astype with columns in Pandas < 0.19.0

viirya reviewed Jun 22, 2017

View reviewed changes

fix typo

d8ba545

asfgit closed this in 67c7502 Jun 22, 2017

HyukjinKwon reviewed Jun 22, 2017

View reviewed changes

edlee123 reviewed Apr 14, 2018

View reviewed changes

[SPARK-21163][SQL] DataFrame.toPandas should respect the data type #18378

[SPARK-21163][SQL] DataFrame.toPandas should respect the data type #18378

Uh oh!

Conversation

cloud-fan commented Jun 21, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

ueshin commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

SparkQA commented Jun 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin commented Jun 21, 2017

Uh oh!

BryanCutler commented Jun 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

viirya Jun 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

cloud-fan commented Jun 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

SparkQA commented Jun 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 22, 2017

Uh oh!

HyukjinKwon commented Jun 22, 2017

Uh oh!

cloud-fan commented Jun 22, 2017

Uh oh!

cloud-fan commented Jun 22, 2017

Uh oh!

HyukjinKwon Jun 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Jun 22, 2017 •

edited

Loading

cloud-fan commented Jun 22, 2017 •

edited

Loading

HyukjinKwon Jun 22, 2017 •

edited

Loading

edlee123 Apr 14, 2018 •

edited

Loading

edlee123 commented Apr 15, 2018 •

edited

Loading