Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions R/pkg/R/DataFrame.R
Original file line number Diff line number Diff line change
Expand Up @@ -822,21 +822,21 @@ setMethod("collect",
# Get a column of complex type returns a list.
# Get a cell from a column of complex type returns a list instead of a vector.
col <- listCols[[colIndex]]
colName <- dtypes[[colIndex]][[1]]
if (length(col) <= 0) {
df[[colName]] <- col
df[[colIndex]] <- col
} else {
colType <- dtypes[[colIndex]][[2]]
# Note that "binary" columns behave like complex types.
if (!is.null(PRIMITIVE_TYPES[[colType]]) && colType != "binary") {
vec <- do.call(c, col)
stopifnot(class(vec) != "list")
df[[colName]] <- vec
df[[colIndex]] <- vec
} else {
df[[colName]] <- col
df[[colIndex]] <- col
}
}
}
names(df) <- names(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different from 1.5. We will get exact same column names in local data.frame. In Spark 1.5 subsequent instances of the same name are appended with numbers. I am not sure which one is better. In fact I slightly prefer your suggested behavior. But just in case others want to chime in: cc @shivaram

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current behavior in 1.6 is actually an unintentional change from a recent change in the collect() code
Matching back to 1.5.x seems to make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with Spark 1.4.1 and 1.5.1, both just have the same names instead of making the duplicated names unique. So this PR's behavior is backward-compatible.

>df <- createDataFrame(sqlContext, list(list(1, 2)), schema = c("a", "a"))
>collect(df)
  a a
1 1 2

Actually, it is very easy to make unique column names, like:

names(df) <- make.names(names(x), unique = TRUE)

But we need discussion is this preferred behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lets keep the local DF names consistent with the schema in SQL. (i.e. duplicated name, name is fine). If this is a breaking change we can add a note in the release notes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@falaki Just curious: what is the query you used to create the numbered columns ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using a left outer join and then collecting it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sun-rui Can we add a test with left-outer join and then collect ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: as I said, I like this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this behavior is fine. I just want to make sure that example doesn't trigger some other code path etc.

df
}
})
Expand Down
6 changes: 6 additions & 0 deletions R/pkg/inst/tests/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,11 @@ test_that("collect() returns a data.frame", {
expect_equal(names(rdf)[1], "age")
expect_equal(nrow(rdf), 0)
expect_equal(ncol(rdf), 2)

# collect() correctly handles multiple columns with same name
df <- createDataFrame(sqlContext, list(list(1, 2)), schema = c("name", "name"))
ldf <- collect(df)
expect_equal(names(ldf), c("name", "name"))
})

test_that("limit() returns DataFrame with the correct number of rows", {
Expand Down Expand Up @@ -1197,6 +1202,7 @@ test_that("join() and merge() on a DataFrame", {
joined <- join(df, df2)
expect_equal(names(joined), c("age", "name", "name", "test"))
expect_equal(count(joined), 12)
expect_equal(names(collect(joined)), c("age", "name", "name", "test"))

joined2 <- join(df, df2, df$name == df2$name)
expect_equal(names(joined2), c("age", "name", "name", "test"))
Expand Down