[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603

redsanket · 2020-05-21T23:45:10Z

What changes were proposed in this pull request?

UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding

Why are the changes needed?

Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD.

Does this PR introduce any user-facing change?

Yes

Before:
SparkSession available as 'spark'.

rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
unionRDD1 = sc.union([pairRDD1, pairRDD1])
Traceback (most recent call last): File "", line 1, in File "/home/gs/spark/latest/python/pyspark/context.py", line 870,
in union jrdds[i] = rdds[i]._jrdd
File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221,
in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

After:

rdd2 = sc.parallelize([6,7,8,9,10])
pairRDD1 = rdd1.zip(rdd2)
unionRDD1 = sc.union([pairRDD1, pairRDD1])
unionRDD1.collect()
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)]

How was this patch tested?

Tested with the reproduced piece of code above manually

python/pyspark/context.py

HyukjinKwon · 2020-05-22T02:03:27Z

Shell we also add a unit test? Also, please describe before/after this fix in "Does this PR introduce any user-facing change?". Technically I think this IS a user-facing behaviour changes from error to working case.

redsanket · 2020-05-22T13:16:05Z

ok

HyukjinKwon · 2020-05-23T02:22:43Z

ok to test

SparkQA · 2020-05-23T02:58:56Z

Test build #123023 has finished for PR 28603 at commit c65be0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-25T01:29:00Z

Merged to master and branch-3.0.

### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit a61911c) Signed-off-by: HyukjinKwon <[email protected]>

leewyang · 2020-05-26T20:21:17Z

@redsanket @HyukjinKwon I pulled the latest branch-3.0 today (which includes this patch), but I'm now seeing the following weird behavior:

>>> rdd1 = sc.parallelize([1,2,3,4,5])
>>> rdd2 = sc.parallelize([6,7,8,9,10])
>>> pairRDD1 = rdd1.zip(rdd2)
>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
>>> unionRDD1.collect()
[((1, 6), (2, 7)), ((3, 8), (4, 9)), ((5, 10), (1, 6)), ((2, 7), (3, 8)), ((4, 9), (5, 10))] 
>>> unionRDD1.count()
0

... where Spark 2.4.5 produces:

>>> unionRDD1.collect()
[(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)]
>>> unionRDD1.count()
10

So, the output is incorrect/unexpected and the count is zero.

redsanket · 2020-05-26T21:48:06Z

@leewyang you are right the review changes suggested by @HyukjinKwon caused the change in behavior 1d8d308. It was working as expected prior to that. @HyukjinKwon are we sure the mapping of JavaPairRDD to JavaRDD is the right approach here?

Original pr code snippet reference

jrdd_cls = jvm.org.apache.spark.api.java.JavaRDD	       
is_jrdd = is_instance_of(gw, rdds[0]._jrdd, cls)
pair_jrdd_cls = jvm.org.apache.spark.api.java.JavaPairRDD	
double_jrdd_cls = jvm.org.apache.spark.api.java.JavaDoubleRDD	
if is_instance_of(gw, rdds[0]._jrdd, jrdd_cls):	
    cls = jrdd_cls	
elif is_instance_of(gw, rdds[0]._jrdd, pair_jrdd_cls):	
    cls = pair_jrdd_cls	
elif is_instance_of(gw, rdds[0]._jrdd, double_jrdd_cls):	
    cls = double_jrdd_cls	
else:	
    raise TypeError("Unsupported java rdd class %s", rdds[0]._jrdd)

HyukjinKwon · 2020-05-27T01:14:18Z

Okay, I just noticed f83fedc caused this problem, and this is a regression. I am going to revert this PR.

HyukjinKwon · 2020-05-27T02:03:58Z

Yes, I think I rushed to read it. Let me make another PR to use your fix. I think we should fix streaming side together fixed in f83fedc

Fix UnionRDD of PairRDDs

28dce3a

probot-autolabeler bot added CORE PYTHON labels May 21, 2020

HyukjinKwon reviewed May 22, 2020

View reviewed changes

python/pyspark/context.py Show resolved Hide resolved

schintap added 4 commits May 22, 2020 12:04

Address review comments

1d8d308

Fix nit

5da0ad6

Fix indentation

66fb353

Condense test

c65be0d

HyukjinKwon changed the title ~~[SPARK-31788][CORE] Fix UnionRDD of PairRDDs~~ [SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs May 23, 2020

HyukjinKwon approved these changes May 25, 2020

View reviewed changes

HyukjinKwon closed this in a61911c May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603

[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603

Uh oh!

redsanket commented May 21, 2020 •

edited

Loading

Uh oh!

Uh oh!

HyukjinKwon commented May 22, 2020 •

edited

Loading

Uh oh!

redsanket commented May 22, 2020

Uh oh!

HyukjinKwon commented May 23, 2020

Uh oh!

SparkQA commented May 23, 2020

Uh oh!

HyukjinKwon commented May 25, 2020

Uh oh!

leewyang commented May 26, 2020

Uh oh!

redsanket commented May 26, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented May 27, 2020 •

edited

Loading

Uh oh!

HyukjinKwon commented May 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603

[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603

Uh oh!

Conversation

redsanket commented May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

HyukjinKwon commented May 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

redsanket commented May 22, 2020

Uh oh!

HyukjinKwon commented May 23, 2020

Uh oh!

SparkQA commented May 23, 2020

Uh oh!

HyukjinKwon commented May 25, 2020

Uh oh!

leewyang commented May 26, 2020

Uh oh!

redsanket commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented May 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

redsanket commented May 21, 2020 •

edited

Loading

HyukjinKwon commented May 22, 2020 •

edited

Loading

redsanket commented May 26, 2020 •

edited

Loading

HyukjinKwon commented May 27, 2020 •

edited

Loading