-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... #448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
First try, pls comment. Not very comfortable with methods that take other RDDs, like intersect, subtract and union, since caller has to make sure they of the same schema. |
|
Can one of the admins verify this patch? |
|
ok to test |
|
Merged build triggered. |
|
Merged build started. |
|
Thanks for doing this! I think we are actually okay for Union is a little more troublesome. We could check the schema and throw an error if they don't match, but that is kinda changing the semantics relative the the standard So, I'd propose we leave union out, as users that want SQL semantics here can already call unionAll. @mateiz might have thoughts here too. A few other methods we can add that also don't change the schema:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a sample method that returns SchemaRDD above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noticed it. It has a different signature and doesn't override the base sample() method. I thought this was done on purpose - you wanted to keep both implementations, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see the execution eventually calls the base method, so it is equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are going to end up getting the same thing. I'd say we drop this one and leave the other. Right now it probably doesn't matter, but the other one is lazy and gives the optimizer a chance to possibly improve things before actually executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the base impl lazy also (till compute() is called)? It's kind of hard to tell the difference for users. What's your thinking in keeping the base implementation around (not overriding it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right. The base impl is probably lazy too. The distinction I was trying to make is that while normal RDD operations are lazy, they are not holistically optimized before execution. Where as if we create a logical operator and defer the creation of RDDs, there may be some extra chances for optimization (at some point in the future). We definitely want to override the base impl, but we don't need to have multiple redundant methods for creating samples.
Also note that you might need to sync with the changes being made in #462 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up. So #462 is already doing it, I'll skip it (I meant overriding it with the query one too).
|
Also, we should make the same changes to the Java and Python API if possible. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
I agree with leaving union out and adding repartition, coalesce and the other version of distinct. Also these should definitely be added to Java and Python too. |
|
Thanks for your suggestions. I'll update. Btw, I don't see PythonSchemaRDD in the code base yet, can I leave out Python for now? |
|
@marmbrus you are right, I can't override randomSplit() due to invariance of Array. How about cache(), persist(), unpersist()? |
Good catch!
It is just called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to remove this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Oh, I see. Thx. |
|
Build triggered. |
|
Build started. |
|
Hey, just pushed an update on Scala and Java API. Wanted to get some feedback before I move on to Python. Pls pay attention to signatures of filter, intersection and subtract on the Java API. Thx. |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14414/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have mentioned this before, but we could consider using this.type instead in the Base RDD class for these methods. I'm not sure if that is breaking API or too much scala magic though. @mateiz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be okay trying it, my questions then are what it looks like in Scaladoc and what it looks like in Java. We should also double-check that Scala expects binary compatibility for this kind of return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out, will verify this.
|
Looking pretty good. Thanks again for working on this! Jenkins, test this please. |
|
Build triggered. |
|
Build started. |
|
@marmbrus @mateiz Here's an update on using this.type as return type. See following shell output. The result type for setName() method has changed from org.apache.spark.rdd.RDD[Record] to rdd.type. Similar things happened for SchemaRDD and JavaSchemaRDD (not a subclass). The benefit is we don't have to reimplement those methods in subclasses. |
|
PS. Subclasses that override those methods may have to be updated and recompiled (like what I did in EdgeRDD, VertexRDD). Better ideas? |
|
Build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14611/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mridulm It's not necessary to overload these anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14731/ |
|
Jenkins, retest this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14738/ |
|
@kanzhang hey you'll need to silence some of the binary compatibility checks in project/MimaBuild.scala: |
|
@pwendell thanks for the heads-up. made those changes, let's see how it goes. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14746/ |
|
Jenkins, retest this please. |
|
@pwendell the build didn't seem to start? |
|
Jenkins, retest this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Thanks for updating this. I'm merging it. |
…ons... ... that do not change schema Author: Kan Zhang <[email protected]> Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema (cherry picked from commit 967635a) Signed-off-by: Patrick Wendell <[email protected]>
…ons... ... that do not change schema Author: Kan Zhang <[email protected]> Closes apache#448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
... that do not change schema