-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31788][CORE][PYTHON] Fix UnionRDD of PairRDDs #28603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Shell we also add a unit test? Also, please describe before/after this fix in "Does this PR introduce any user-facing change?". Technically I think this IS a user-facing behaviour changes from error to working case. |
|
ok |
|
ok to test |
|
Test build #123023 has finished for PR 28603 at commit
|
|
Merged to master and branch-3.0. |
### What changes were proposed in this pull request? UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding ### Why are the changes needed? Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD. ### Does this PR introduce _any_ user-facing change? Yes Before: SparkSession available as 'spark'. >>> rdd1 = sc.parallelize([1,2,3,4,5]) >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in setitem File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) After: >>> rdd2 = sc.parallelize([6,7,8,9,10]) >>> pairRDD1 = rdd1.zip(rdd2) >>> unionRDD1 = sc.union([pairRDD1, pairRDD1]) >>> unionRDD1.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ### How was this patch tested? Tested with the reproduced piece of code above manually Closes #28603 from redsanket/SPARK-31788. Authored-by: schintap <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit a61911c) Signed-off-by: HyukjinKwon <[email protected]>
|
@redsanket @HyukjinKwon I pulled the latest ... where Spark 2.4.5 produces: So, the output is incorrect/unexpected and the count is zero. |
|
@leewyang you are right the review changes suggested by @HyukjinKwon caused the change in behavior 1d8d308. It was working as expected prior to that. @HyukjinKwon are we sure the mapping of JavaPairRDD to JavaRDD is the right approach here? Original pr code snippet reference |
|
Okay, I just noticed f83fedc caused this problem, and this is a regression. I am going to revert this PR. |
|
Yes, I think I rushed to read it. Let me make another PR to use your fix. I think we should fix streaming side together fixed in f83fedc |
What changes were proposed in this pull request?
UnionRDD of PairRDDs causing a bug. The fix is to check for instance type before proceeding
Why are the changes needed?
Changes are needed to avoid users running into issues with union rdd operation with any other type other than JavaRDD.
Does this PR introduce any user-facing change?
Yes
Before:
SparkSession available as 'spark'.
After:
How was this patch tested?
Tested with the reproduced piece of code above manually