Skip to content

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Apr 15, 2020

What changes were proposed in this pull request?

This is to backport #28210.

This PR is adding support duplicated column names for toPandas with Arrow execution.

Why are the changes needed?

When we execute toPandas() with Arrow execution, it fails if the column names have duplicates.

>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index

Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1

How was this patch tested?

Added and modified related tests.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121299 has finished for PR 28221 at commit 79ede2d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Apr 15, 2020

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121310 has finished for PR 28221 at commit 79ede2d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member Author

ueshin commented Apr 15, 2020

Thanks! merging to branch-2.4.

ueshin added a commit that referenced this pull request Apr 15, 2020
…toPandas with arrow execution

### What changes were proposed in this pull request?

This is to backport #28210.

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes #28221 from ueshin/issues/SPARK-31441/2.4/to_pandas.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: Takuya UESHIN <[email protected]>
@ueshin ueshin closed this Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants