Skip to content

Conversation

@sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Dec 3, 2015

No description provided.

@shivaram
Copy link
Contributor

shivaram commented Dec 3, 2015

cc @falaki

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47114 has finished for PR 10118 at commit b3c654f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

looks good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from 1.5. We will get exact same column names in local data.frame. In Spark 1.5 subsequent instances of the same name are appended with numbers. I am not sure which one is better. In fact I slightly prefer your suggested behavior. But just in case others want to chime in: cc @shivaram

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current behavior in 1.6 is actually an unintentional change from a recent change in the collect() code
Matching back to 1.5.x seems to make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with Spark 1.4.1 and 1.5.1, both just have the same names instead of making the duplicated names unique. So this PR's behavior is backward-compatible.

>df <- createDataFrame(sqlContext, list(list(1, 2)), schema = c("a", "a"))
>collect(df)
  a a
1 1 2

Actually, it is very easy to make unique column names, like:

names(df) <- make.names(names(x), unique = TRUE)

But we need discussion is this preferred behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets keep the local DF names consistent with the schema in SQL. (i.e. duplicated name, name is fine). If this is a breaking change we can add a note in the release notes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@falaki Just curious: what is the query you used to create the numbered columns ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using a left outer join and then collecting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun-rui Can we add a test with left-outer join and then collect ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: as I said, I like this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this behavior is fine. I just want to make sure that example doesn't trigger some other code path etc.

@falaki
Copy link
Contributor

falaki commented Dec 3, 2015

Thanks!

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 4, 2015

@falaki, I can't reproduce your result. for example, in Spark 1.5.1:

> df <- createDataFrame(sqlContext, list(list(1,"1")), schema = c("key", "value"))
> df1 <- join(df, df)
> collect(df1)
  key value key value
1   1     1   1     1
> df1 <- join(df, df, df$key == df$key, "left_outer")
> collect(df1)
  key value key value                                                           
1   1     1   1     1
> names(collect(df1))
[1] "key"   "value" "key"   "value"

@shivaram
Copy link
Contributor

shivaram commented Dec 4, 2015

@sun-rui Was your test done with the patch ? I think @falaki 's example was on 1.5.1

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 4, 2015

@shivaram, no, done with 1.5.1

@shivaram
Copy link
Contributor

shivaram commented Dec 4, 2015

Ok. Change LGTM. Merging this. We can discuss the left_join issue later if required

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47178 has finished for PR 10118 at commit 3760410.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 4, 2015
… same name.

Author: Sun Rui <[email protected]>

Closes #10118 from sun-rui/SPARK-12104.

(cherry picked from commit 5011f26)
Signed-off-by: Shivaram Venkataraman <[email protected]>
@asfgit asfgit closed this in 5011f26 Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants