[SPARK-31186][PySpark][SQL][2.4] toPandas should not fail on duplicate column names #28219
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
When
toPandasAPI works on duplicate column names produced from operators like join, we see the error like:This patch fixes the error in
toPandasAPI.This is the backport of original patch to branch-2.4.
Why are the changes needed?
To make
toPandaswork on dataframe with duplicate column names.Does this PR introduce any user-facing change?
Yes. Previously calling
toPandasAPI on a dataframe with duplicate column names will fail. After this patch, it will produce correct result.How was this patch tested?
Unit test.