-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12935][SQL] DataFrame API for Count-Min Sketch #10911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, I didn't make these empty comment line changes. Reverting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this public?
|
Test build #50055 has finished for PR 10911 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it'd be good to refactor this so we don't need to assign the variables. one way is to take the serialization/deserialization code out of readFrom into a function.
|
Test build #50061 has finished for PR 10911 at commit
|
|
cc @JoshRosen is the python tests broken? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about colType == StringType || colType.isInstanceOf[IntegralType]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after thinking about it - let's avoid doing that and list the explicit types. It is plausible in the future we introduce an int96 or int128 data type, and I bet we won't remember this is one place we need to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been moved to CountMinSketch.Version as @rxin suggested in #10920 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
Test build #50117 has finished for PR 10911 at commit
|
|
Josh is looking into the PySpark test failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use scala.binary.version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is always hard coded as _2.10 to make publishing easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin told me this. I'm not quite sure about the details though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name is quite weird...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a common naming style in java - to have the private version named xxx0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this is now in a Javadoc block. Should reformat this using HTML tags. Same thing applies to the bloom filter format description.
|
Test build #50126 has finished for PR 10911 at commit
|
|
Test build #50146 has finished for PR 10911 at commit
|
|
I'm going to merge this. Thanks. |
…n Sketch This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <[email protected]> Closes #10968 from liancheng/cms-specialized.
This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to
RDD.aggregatefor building the sketch. A more performant UDAF version can be built in future follow-up PRs.