[SPARK-12935][SQL] DataFrame API for Count-Min Sketch #10911

liancheng · 2016-01-26T01:18:00Z

This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the sketch. A more performant UDAF version can be built in future follow-up PRs.

liancheng · 2016-01-26T01:19:05Z

cc @cloud-fan @rxin @yhuai

liancheng · 2016-01-26T01:20:35Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

Weird, I didn't make these empty comment line changes. Reverting them.

rxin · 2016-01-26T01:46:18Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java

why is this public?

SparkQA · 2016-01-26T04:24:06Z

Test build #50055 has finished for PR 10911 at commit 4e5d1af.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountMinSketchImpl extends CountMinSketch implements Externalizable

rxin · 2016-01-26T05:49:43Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java

it'd be good to refactor this so we don't need to assign the variables. one way is to take the serialization/deserialization code out of readFrom into a function.

SparkQA · 2016-01-26T05:50:58Z

Test build #50061 has finished for PR 10911 at commit 32a9860.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-26T06:00:36Z

cc @JoshRosen is the python tests broken?

Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Error: unrecognized module 'root'. Supported modules: pyspark-mllib, pyspark-core, pyspark-ml, pyspark-sql, pyspark-streaming
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-mllib,pyspark-ml,pyspark-sql,root --parallelism=4 ; received return code 255

cloud-fan · 2016-01-26T07:28:31Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

how about colType == StringType || colType.isInstanceOf[IntegralType]?

Actually after thinking about it - let's avoid doing that and list the explicit types. It is plausible in the future we introduce an int96 or int128 data type, and I bet we won't remember this is one place we need to update it.

cloud-fan · 2016-01-26T18:50:33Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java

This comment has been moved to CountMinSketch.Version as @rxin suggested in #10920 (comment)

SparkQA · 2016-01-26T22:06:07Z

Test build #50117 has finished for PR 10911 at commit fb23a24.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-01-26T22:46:53Z

Josh is looking into the PySpark test failure.

cloud-fan · 2016-01-26T23:20:56Z

sql/core/pom.xml

use scala.binary.version?

Actually this is always hard coded as _2.10 to make publishing easier.

@rxin told me this. I'm not quite sure about the details though :)

cloud-fan · 2016-01-27T00:31:37Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java

this name is quite weird...

this is actually a common naming style in java - to have the private version named xxx0

liancheng · 2016-01-27T00:51:17Z

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java

I just realized that this is now in a Javadoc block. Should reformat this using HTML tags. Same thing applies to the bloom filter format description.

SparkQA · 2016-01-27T01:11:27Z

Test build #50126 has finished for PR 10911 at commit 4a40802.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-27T03:10:56Z

Test build #50146 has finished for PR 10911 at commit 3ff902a.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-27T04:12:21Z

I'm going to merge this. Thanks.

…n Sketch This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <[email protected]> Closes #10968 from liancheng/cms-specialized.

liancheng reviewed Jan 26, 2016
View reviewed changes

liancheng force-pushed the cms-df-api branch from 4200636 to 4e5d1af Compare January 26, 2016 01:26

rxin reviewed Jan 26, 2016
View reviewed changes

common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java Outdated

Copy link

Contributor

rxin Jan 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this public?

liancheng force-pushed the cms-df-api branch from fb9004a to 42fab32 Compare January 26, 2016 02:22

rxin reviewed Jan 26, 2016
View reviewed changes

cloud-fan reviewed Jan 26, 2016
View reviewed changes

liancheng added 4 commits January 26, 2016 10:36

DataFrame API for Count-Min Sketch

55d90d5

Addresses PR comments

781043c

Makes writeExternal/readExternal more robust

65d7e8f

Addresses PR comments

fb23a24

liancheng force-pushed the cms-df-api branch from 32a9860 to fb23a24 Compare January 26, 2016 18:37

cloud-fan reviewed Jan 26, 2016
View reviewed changes

liancheng mentioned this pull request Jan 26, 2016

[SPARK-12937][SQL] bloom filter serialization #10920

Closed

More PR comments addressed

4a40802

cloud-fan reviewed Jan 26, 2016
View reviewed changes

Uses Serializable instead of Externalizable

e64a2d7

cloud-fan reviewed Jan 27, 2016
View reviewed changes

Addresses PR comments

6e29026

liancheng reviewed Jan 27, 2016
View reviewed changes

Fixes Javadoc

3ff902a

asfgit closed this in ce38a35 Jan 27, 2016

liancheng deleted the cms-df-api branch January 27, 2016 18:40

liancheng mentioned this pull request Jan 28, 2016

[SPARK-12818][SQL] Specializes integral and string types for Count-min Sketch #10968

Closed

[SPARK-12935][SQL] DataFrame API for Count-Min Sketch #10911

[SPARK-12935][SQL] DataFrame API for Count-Min Sketch #10911

Uh oh!

Conversation

liancheng commented Jan 26, 2016

Uh oh!

liancheng commented Jan 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 26, 2016

Uh oh!

rxin commented Jan 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 26, 2016

Uh oh!

liancheng commented Jan 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

SparkQA commented Jan 27, 2016

Uh oh!

rxin commented Jan 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants