-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21163][SQL] DataFrame.toPandas should respect the data type #18378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1e98c49
dfaa392
e903cd2
6702ad1
357a798
d8ba545
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1721,7 +1721,18 @@ def toPandas(self): | |
| 1 5 Bob | ||
| """ | ||
| import pandas as pd | ||
| return pd.DataFrame.from_records(self.collect(), columns=self.columns) | ||
|
|
||
| dtype = {} | ||
| for field in self.schema: | ||
| pandas_type = _to_corrected_pandas_type(field.dataType) | ||
| if pandas_type is not None: | ||
| dtype[field.name] = pandas_type | ||
|
|
||
| pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) | ||
|
|
||
| for f, t in dtype.items(): | ||
| pdf[f] = pdf[f].astype(t, copy=False) | ||
| return pdf | ||
|
|
||
| ########################################################################################## | ||
| # Pandas compatibility | ||
|
|
@@ -1750,6 +1761,24 @@ def _to_scala_map(sc, jm): | |
| return sc._jvm.PythonUtils.toScalaMap(jm) | ||
|
|
||
|
|
||
| def _to_corrected_pandas_type(dt): | ||
| """ | ||
| When converting Spark SQL records to Pandas DataFrame, the inferred data type may be wrong. | ||
| This method gets the corrected data type for Pandas if that type may be inferred uncorrectly. | ||
| """ | ||
| import numpy as np | ||
| if type(dt) == ByteType: | ||
| return np.int8 | ||
| elif type(dt) == ShortType: | ||
| return np.int16 | ||
| elif type(dt) == IntegerType: | ||
| return np.int32 | ||
| elif type(dt) == FloatType: | ||
| return np.float32 | ||
| else: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Had a question: in Spark 2.2.1, if I do a .toPandas on a Spark DataFrame with column integer type, the dtypes in pandas is int64. Whereas in in Spark 2.3.0 the ints are converted to int32. I ran the below in Spark 2.2.1 and 2.3.0: Is this intended? We ran into this as we have unit tests in a project that passed in Spark 2.2.1 that fail in Spark 2.3.0 when we looked into upgrading
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup, it was unfortunate but it was a bug that we should fix. Does that cause an actual break or simply just unit test failure? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I can so far just some of our unit tests where we are asserting some expected pandas dataframes. Think maybe float also is affected... Should I create a ticket in Jira?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the current change is actually more correct. Such changes might usually have to be avoided but there are strong reasons for it and I would classify this case as a bug. I would discourage to create a JIRA unless it breaks a senario which makes a strong sense. |
||
| return None | ||
|
|
||
|
|
||
| class DataFrameNaFunctions(object): | ||
| """Functionality for working with missing data in :class:`DataFrame`. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -46,6 +46,14 @@ | |
| else: | ||
| import unittest | ||
|
|
||
| _have_pandas = False | ||
| try: | ||
| import pandas | ||
| _have_pandas = True | ||
| except: | ||
| # No Pandas, but that's okay, we'll skip those tests | ||
| pass | ||
|
|
||
| from pyspark import SparkContext | ||
| from pyspark.sql import SparkSession, SQLContext, HiveContext, Column, Row | ||
| from pyspark.sql.types import * | ||
|
|
@@ -2274,6 +2282,22 @@ def count_bucketed_cols(names, table="pyspark_bucket"): | |
| .mode("overwrite").saveAsTable("pyspark_bucket")) | ||
| self.assertSetEqual(set(data), set(self.spark.table("pyspark_bucket").collect())) | ||
|
|
||
| @unittest.skipIf(not _have_pandas, "Pandas not installed") | ||
| def test_to_pandas(self): | ||
| import numpy as np | ||
| schema = StructType().add("a", IntegerType()).add("b", StringType())\ | ||
| .add("c", BooleanType()).add("d", FloatType()) | ||
| data = [ | ||
| (1, "foo", True, 3.0), (2, "foo", True, 5.0), | ||
| (3, "bar", False, -1.0), (4, "bar", False, 6.0), | ||
| ] | ||
| df = self.spark.createDataFrame(data, schema) | ||
| types = df.toPandas().dtypes | ||
|
||
| self.assertEquals(types[0], np.int32) | ||
| self.assertEquals(types[1], np.object) | ||
| self.assertEquals(types[2], np.bool) | ||
| self.assertEquals(types[3], np.float32) | ||
|
|
||
|
|
||
| class HiveSparkSubmitTests(SparkSubmitTests): | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just, just in case someone blames this in the future, as a little side note, it looks
copyis introduced in 0.11.0 here. So, Pandas 0.10.0 does not work with it (see here).Pandas 0.10.0:
However, I guess it is really fine becuase:
0.10.0 was released in 2012, when Spark was 0.6.x and Java was 6 & 7.
I guess this is really fine. It was 5 years ago.
In 0.10.0, it does works without
copybut the types are not properly set as proposed here:I am writing this comment only because, up to my knolwedge, we didn't specify Pandas version requirement -
spark/python/setup.py
Line 202 in 314cf51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the investigation! maybe we should specify the version requirement for pandas