Skip to content

Commit 07fd68a

Browse files
committed
[SPARK-21897][PYTHON][R] Add unionByName API to DataFrame in Python and R
## What changes were proposed in this pull request? This PR proposes to add a wrapper for `unionByName` API to R and Python as well. **Python** ```python df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"]) df1.unionByName(df2).show() ``` ``` +----+----+----+ |col0|col1|col3| +----+----+----+ | 1| 2| 3| | 6| 4| 5| +----+----+----+ ``` **R** ```R df1 <- select(createDataFrame(mtcars), "carb", "am", "gear") df2 <- select(createDataFrame(mtcars), "am", "gear", "carb") head(unionByName(limit(df1, 2), limit(df2, 2))) ``` ``` carb am gear 1 4 1 4 2 4 1 4 3 4 1 4 4 4 1 4 ``` ## How was this patch tested? Doctests for Python and unit test added in `test_sparkSQL.R` for R. Author: hyukjinkwon <[email protected]> Closes #19105 from HyukjinKwon/unionByName-r-python.
1 parent acb7fed commit 07fd68a

File tree

5 files changed

+74
-6
lines changed

5 files changed

+74
-6
lines changed

R/pkg/NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,7 @@ exportMethods("arrange",
169169
"transform",
170170
"union",
171171
"unionAll",
172+
"unionByName",
172173
"unique",
173174
"unpersist",
174175
"where",

R/pkg/R/DataFrame.R

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2683,7 +2683,7 @@ generateAliasesForIntersectedCols <- function (x, intersectedColNames, suffix) {
26832683
#' @rdname union
26842684
#' @name union
26852685
#' @aliases union,SparkDataFrame,SparkDataFrame-method
2686-
#' @seealso \link{rbind}
2686+
#' @seealso \link{rbind} \link{unionByName}
26872687
#' @export
26882688
#' @examples
26892689
#'\dontrun{
@@ -2714,6 +2714,40 @@ setMethod("unionAll",
27142714
union(x, y)
27152715
})
27162716

2717+
#' Return a new SparkDataFrame containing the union of rows, matched by column names
2718+
#'
2719+
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
2720+
#' and another SparkDataFrame. This is different from \code{union} function, and both
2721+
#' \code{UNION ALL} and \code{UNION DISTINCT} in SQL as column positions are not taken
2722+
#' into account. Input SparkDataFrames can have different data types in the schema.
2723+
#'
2724+
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
2725+
#' This function resolves columns by name (not by position).
2726+
#'
2727+
#' @param x A SparkDataFrame
2728+
#' @param y A SparkDataFrame
2729+
#' @return A SparkDataFrame containing the result of the union.
2730+
#' @family SparkDataFrame functions
2731+
#' @rdname unionByName
2732+
#' @name unionByName
2733+
#' @aliases unionByName,SparkDataFrame,SparkDataFrame-method
2734+
#' @seealso \link{rbind} \link{union}
2735+
#' @export
2736+
#' @examples
2737+
#'\dontrun{
2738+
#' sparkR.session()
2739+
#' df1 <- select(createDataFrame(mtcars), "carb", "am", "gear")
2740+
#' df2 <- select(createDataFrame(mtcars), "am", "gear", "carb")
2741+
#' head(unionByName(df1, df2))
2742+
#' }
2743+
#' @note unionByName since 2.3.0
2744+
setMethod("unionByName",
2745+
signature(x = "SparkDataFrame", y = "SparkDataFrame"),
2746+
function(x, y) {
2747+
unioned <- callJMethod(x@sdf, "unionByName", y@sdf)
2748+
dataFrame(unioned)
2749+
})
2750+
27172751
#' Union two or more SparkDataFrames
27182752
#'
27192753
#' Union two or more SparkDataFrames by row. As in R's \code{rbind}, this method
@@ -2730,7 +2764,7 @@ setMethod("unionAll",
27302764
#' @aliases rbind,SparkDataFrame-method
27312765
#' @rdname rbind
27322766
#' @name rbind
2733-
#' @seealso \link{union}
2767+
#' @seealso \link{union} \link{unionByName}
27342768
#' @export
27352769
#' @examples
27362770
#'\dontrun{

R/pkg/R/generics.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -769,6 +769,10 @@ setGeneric("union", function(x, y) { standardGeneric("union") })
769769
#' @export
770770
setGeneric("unionAll", function(x, y) { standardGeneric("unionAll") })
771771

772+
#' @rdname unionByName
773+
#' @export
774+
setGeneric("unionByName", function(x, y) { standardGeneric("unionByName") })
775+
772776
#' @rdname unpersist
773777
#' @export
774778
setGeneric("unpersist", function(x, ...) { standardGeneric("unpersist") })

R/pkg/tests/fulltests/test_sparkSQL.R

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2255,7 +2255,7 @@ test_that("isLocal()", {
22552255
expect_false(isLocal(df))
22562256
})
22572257

2258-
test_that("union(), rbind(), except(), and intersect() on a DataFrame", {
2258+
test_that("union(), unionByName(), rbind(), except(), and intersect() on a DataFrame", {
22592259
df <- read.json(jsonPath)
22602260

22612261
lines <- c("{\"name\":\"Bob\", \"age\":24}",
@@ -2271,6 +2271,13 @@ test_that("union(), rbind(), except(), and intersect() on a DataFrame", {
22712271
expect_equal(first(unioned)$name, "Michael")
22722272
expect_equal(count(arrange(suppressWarnings(unionAll(df, df2)), df$age)), 6)
22732273

2274+
df1 <- select(df2, "age", "name")
2275+
unioned1 <- arrange(unionByName(df1, df), df1$age)
2276+
expect_is(unioned, "SparkDataFrame")
2277+
expect_equal(count(unioned), 6)
2278+
# Here, we test if 'Michael' in df is correctly mapped to the same name.
2279+
expect_equal(first(unioned)$name, "Michael")
2280+
22742281
unioned2 <- arrange(rbind(unioned, df, df2), df$age)
22752282
expect_is(unioned2, "SparkDataFrame")
22762283
expect_equal(count(unioned2), 12)

python/pyspark/sql/dataframe.py

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1290,7 +1290,7 @@ def union(self, other):
12901290
""" Return a new :class:`DataFrame` containing union of rows in this and another frame.
12911291
12921292
This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
1293-
(that does deduplication of elements), use this function followed by a distinct.
1293+
(that does deduplication of elements), use this function followed by :func:`distinct`.
12941294
12951295
Also as standard in SQL, this function resolves columns by position (not by name).
12961296
"""
@@ -1301,14 +1301,36 @@ def unionAll(self, other):
13011301
""" Return a new :class:`DataFrame` containing union of rows in this and another frame.
13021302
13031303
This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union
1304-
(that does deduplication of elements), use this function followed by a distinct.
1304+
(that does deduplication of elements), use this function followed by :func:`distinct`.
13051305
13061306
Also as standard in SQL, this function resolves columns by position (not by name).
13071307
1308-
.. note:: Deprecated in 2.0, use union instead.
1308+
.. note:: Deprecated in 2.0, use :func:`union` instead.
13091309
"""
13101310
return self.union(other)
13111311

1312+
@since(2.3)
1313+
def unionByName(self, other):
1314+
""" Returns a new :class:`DataFrame` containing union of rows in this and another frame.
1315+
1316+
This is different from both `UNION ALL` and `UNION DISTINCT` in SQL. To do a SQL-style set
1317+
union (that does deduplication of elements), use this function followed by :func:`distinct`.
1318+
1319+
The difference between this function and :func:`union` is that this function
1320+
resolves columns by name (not by position):
1321+
1322+
>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"])
1323+
>>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"])
1324+
>>> df1.unionByName(df2).show()
1325+
+----+----+----+
1326+
|col0|col1|col2|
1327+
+----+----+----+
1328+
| 1| 2| 3|
1329+
| 6| 4| 5|
1330+
+----+----+----+
1331+
"""
1332+
return DataFrame(self._jdf.unionByName(other._jdf), self.sql_ctx)
1333+
13121334
@since(1.3)
13131335
def intersect(self, other):
13141336
""" Return a new :class:`DataFrame` containing rows only in

0 commit comments

Comments
 (0)