Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the sketch. A more performant UDAF version can be built in future follow-up PRs.

@liancheng
Copy link
Contributor Author

cc @cloud-fan @rxin @yhuai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, I didn't make these empty comment line changes. Reverting them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this public?

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50055 has finished for PR 10911 at commit 4e5d1af.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountMinSketchImpl extends CountMinSketch implements Externalizable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be good to refactor this so we don't need to assign the variables. one way is to take the serialization/deserialization code out of readFrom into a function.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50061 has finished for PR 10911 at commit 32a9860.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 26, 2016

cc @JoshRosen is the python tests broken?

Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Error: unrecognized module 'root'. Supported modules: pyspark-mllib, pyspark-core, pyspark-ml, pyspark-sql, pyspark-streaming
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-mllib,pyspark-ml,pyspark-sql,root --parallelism=4 ; received return code 255

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about colType == StringType || colType.isInstanceOf[IntegralType]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually after thinking about it - let's avoid doing that and list the explicit types. It is plausible in the future we introduce an int96 or int128 data type, and I bet we won't remember this is one place we need to update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been moved to CountMinSketch.Version as @rxin suggested in #10920 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@SparkQA
Copy link

SparkQA commented Jan 26, 2016

Test build #50117 has finished for PR 10911 at commit fb23a24.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Josh is looking into the PySpark test failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use scala.binary.version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is always hard coded as _2.10 to make publishing easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin told me this. I'm not quite sure about the details though :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name is quite weird...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a common naming style in java - to have the private version named xxx0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that this is now in a Javadoc block. Should reformat this using HTML tags. Same thing applies to the bloom filter format description.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50126 has finished for PR 10911 at commit 4a40802.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50146 has finished for PR 10911 at commit 3ff902a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 27, 2016

I'm going to merge this. Thanks.

@asfgit asfgit closed this in ce38a35 Jan 27, 2016
@liancheng liancheng deleted the cms-df-api branch January 27, 2016 18:40
asfgit pushed a commit that referenced this pull request Jan 28, 2016
…n Sketch

This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`.

Author: Cheng Lian <[email protected]>

Closes #10968 from liancheng/cms-specialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants