-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12104][SPARKR] collect() does not handle multiple columns with same name. #10118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @falaki |
|
Test build #47114 has finished for PR 10118 at commit
|
|
looks good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different from 1.5. We will get exact same column names in local data.frame. In Spark 1.5 subsequent instances of the same name are appended with numbers. I am not sure which one is better. In fact I slightly prefer your suggested behavior. But just in case others want to chime in: cc @shivaram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current behavior in 1.6 is actually an unintentional change from a recent change in the collect() code
Matching back to 1.5.x seems to make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with Spark 1.4.1 and 1.5.1, both just have the same names instead of making the duplicated names unique. So this PR's behavior is backward-compatible.
>df <- createDataFrame(sqlContext, list(list(1, 2)), schema = c("a", "a"))
>collect(df)
a a
1 1 2
Actually, it is very easy to make unique column names, like:
names(df) <- make.names(names(x), unique = TRUE)
But we need discussion is this preferred behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lets keep the local DF names consistent with the schema in SQL. (i.e. duplicated name, name is fine). If this is a breaking change we can add a note in the release notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@falaki Just curious: what is the query you used to create the numbered columns ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using a left outer join and then collecting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun-rui Can we add a test with left-outer join and then collect ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: as I said, I like this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this behavior is fine. I just want to make sure that example doesn't trigger some other code path etc.
|
Thanks! |
|
@falaki, I can't reproduce your result. for example, in Spark 1.5.1: |
|
@shivaram, no, done with 1.5.1 |
|
Ok. Change LGTM. Merging this. We can discuss the left_join issue later if required |
|
Test build #47178 has finished for PR 10118 at commit
|
… same name. Author: Sun Rui <[email protected]> Closes #10118 from sun-rui/SPARK-12104. (cherry picked from commit 5011f26) Signed-off-by: Shivaram Venkataraman <[email protected]>
No description provided.