[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames #5799

brkyvz · 2015-04-30T05:33:13Z

Finding frequent items with possibly false positives, using the algorithm described in http://www.cs.umd.edu/~samir/498/karp.pdf.
public API under:

df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame

The output is a local DataFrame having the input column names with -freqItems appended to it. This is a single pass algorithm that may return false positives, but no false negatives.

cc @mengxr @rxin

Let's get the implementations in, I can add python API in a follow up PR.

implemented frequent items

rxin · 2015-04-30T05:36:24Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

does this work in java?

I think it looks like df.stat$.MODULE$.freqItems(). I don't know how we can otherwise make it df.stat.freqItems in scala.

take a look at how we implemented na.

aha! I like it

rxin · 2015-04-30T05:37:45Z

sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala

let's put this in execution.stat?

It's annoying to add a top level package because we have rules to specifically exclude existing packages.

rxin · 2015-04-30T05:41:46Z

I'm going to let @mengxr to comment on the actual algorithm implementation.

mengxr · 2015-04-30T05:56:34Z

sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala

If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call

freqItems(Array("gender", "title"), 0.01)

I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call freqItems(Array("gender", "title")) @rxin

rxin · 2015-04-30T06:42:12Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

don't forget to add java.util.List ones

also make sure you add a test to the JavaDataFrameSuite

SparkQA · 2015-04-30T07:19:59Z

Test build #31386 has finished for PR 5799 at commit 8279d4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

rxin · 2015-04-30T07:57:42Z

I think it's better to just have freqItems to on a per column basis, and then I can add a struct expression to data frame so users can easily create composite columns to run freqItems on.

SparkQA · 2015-04-30T08:42:37Z

Test build #31392 has finished for PR 5799 at commit 482e741.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

SparkQA · 2015-04-30T10:27:48Z

Test build #31404 has finished for PR 5799 at commit 3a5c177.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- jaxb-api-2.2.7.jar
- jaxb-core-2.2.7.jar
- jaxb-impl-2.2.7.jar
- pmml-agent-1.1.15.jar
- pmml-model-1.1.15.jar
- pmml-schema-1.1.15.jar
This patch removes the following dependencies:
- activation-1.1.jar
- jaxb-api-2.2.2.jar
- jaxb-impl-2.2.3-1.jar

SparkQA · 2015-04-30T16:21:07Z

Test build #31423 has finished for PR 5799 at commit 0915e23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-30T16:50:35Z

Test build #31426 has finished for PR 5799 at commit 39b1bba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

mengxr · 2015-04-30T17:13:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala

space before {

mengxr · 2015-04-30T17:16:41Z

LGTM except minor inline comments.

mengxr · 2015-04-30T17:17:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala

organize imports

rxin · 2015-04-30T21:08:19Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala

I think we can just use Seq here, since Python has helper functions that can convert List into Seq.

SparkQA · 2015-04-30T22:45:10Z

Test build #31453 has finished for PR 5799 at commit a6ec82c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

brkyvz · 2015-04-30T23:39:14Z

@rxin, can you merge this in please. I can follow up with the comments on the PR where I'll add Python support. This is blocking df.stat.cov. Thanks!

rxin · 2015-04-30T23:40:26Z

Alright - merging in master.

Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. public API under: ``` df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame ``` The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives. cc mengxr rxin Let's get the implementations in, I can add python API in a follow up PR. Author: Burak Yavuz <[email protected]> Closes apache#5799 from brkyvz/freq-items and squashes the following commits: a6ec82c [Burak Yavuz] addressed comments v? 39b1bba [Burak Yavuz] removed toSeq 0915e23 [Burak Yavuz] addressed comments v2.1 3a5c177 [Burak Yavuz] addressed comments v2.0 482e741 [Burak Yavuz] removed old import 38e784d [Burak Yavuz] addressed comments v1.0 8279d4d [Burak Yavuz] added default value for support 3d82168 [Burak Yavuz] made base implementation

made base implementation

3d82168

implemented frequent items

rxin reviewed Apr 30, 2015
View reviewed changes

added default value for support

8279d4d

rxin reviewed Apr 30, 2015
View reviewed changes

mengxr reviewed Apr 30, 2015
View reviewed changes

brkyvz added 2 commits April 29, 2015 23:36

addressed comments v1.0

38e784d

removed old import

482e741

rxin reviewed Apr 30, 2015
View reviewed changes

addressed comments v2.0

3a5c177

brkyvz added 2 commits April 30, 2015 07:35

addressed comments v2.1

0915e23

removed toSeq

39b1bba

mengxr reviewed Apr 30, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala Outdated

Copy link

Contributor

mengxr Apr 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before {

mengxr reviewed Apr 30, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala Outdated

Copy link

Contributor

mengxr Apr 30, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports

addressed comments v?

a6ec82c

rxin reviewed Apr 30, 2015
View reviewed changes

asfgit closed this in 149b3ee Apr 30, 2015

[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames #5799

[SPARK-7242][SQL][MLLIB] Frequent items for DataFrames #5799

Uh oh!

Conversation

brkyvz commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2015

Uh oh!

brkyvz commented Apr 30, 2015

Uh oh!

rxin commented Apr 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants