[SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution. #28221

ueshin · 2020-04-15T06:18:16Z

What changes were proposed in this pull request?

This is to backport #28210.

This PR is adding support duplicated column names for toPandas with Arrow execution.

Why are the changes needed?

When we execute toPandas() with Arrow execution, it fails if the column names have duplicates.

>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index

Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1

How was this patch tested?

Added and modified related tests.

SparkQA · 2020-04-15T07:05:02Z

Test build #121299 has finished for PR 28221 at commit 79ede2d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2020-04-15T07:40:20Z

Jenkins, retest this please.

SparkQA · 2020-04-15T08:48:08Z

Test build #121310 has finished for PR 28221 at commit 79ede2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2020-04-15T18:40:18Z

Thanks! merging to branch-2.4.

…toPandas with arrow execution ### What changes were proposed in this pull request? This is to backport #28210. This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28221 from ueshin/issues/SPARK-31441/2.4/to_pandas. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>

Backport SPARK-31441.

79ede2d

ueshin requested review from HyukjinKwon and viirya April 15, 2020 06:18

probot-autolabeler bot added PYTHON SQL labels Apr 15, 2020

viirya approved these changes Apr 15, 2020

View reviewed changes

HyukjinKwon approved these changes Apr 15, 2020

View reviewed changes

ueshin closed this Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution. #28221

[SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution. #28221

Uh oh!

ueshin commented Apr 15, 2020

Uh oh!

SparkQA commented Apr 15, 2020

Uh oh!

ueshin commented Apr 15, 2020

Uh oh!

SparkQA commented Apr 15, 2020

Uh oh!

ueshin commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution. #28221

[SPARK-31441][PYSPARK][SQL][2.4] Support duplicated column names for toPandas with arrow execution. #28221

Uh oh!

Conversation

ueshin commented Apr 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 15, 2020

Uh oh!

ueshin commented Apr 15, 2020

Uh oh!

SparkQA commented Apr 15, 2020

Uh oh!

ueshin commented Apr 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants