Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Jun 3, 2014

It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:

  1. Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
  2. When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
  3. Added the APIs documentation for MultivariateOnlineSummarizer.
  4. Added the unittests for MultivariateOnlineSummarizer.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mateiz
Copy link
Contributor

mateiz commented Jun 3, 2014

MultivariateStatisticalSummary is a public API -- we can't rename it arbitrarily. Why does it need to be renamed?

@dbtsai
Copy link
Member Author

dbtsai commented Jun 3, 2014

Since the "Statistical" in MultivariateStatisticalSummary is already in the package name as "stat", I think it worths to have a concise name. Also, most people spell the abbreviation of statistics as "stats", so I changed it from "stat" to "stats".

Since it's already a public API, I've no problem to change it back.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15407/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15405/

@dbtsai
Copy link
Member Author

dbtsai commented Jun 3, 2014

Don't know why jenkins is not happy with removing "private class ColumnStatisticsAggregator(private val n: Int)". After all, it's a private class.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15408/

@mengxr
Copy link
Contributor

mengxr commented Jun 4, 2014

Maybe this is a MIMA problem. Found this (from @pwendell):

https://groups.google.com/forum/#!topic/migration-manager-user/5aQ0xxsL2lU

@dbtsai
Copy link
Member Author

dbtsai commented Jun 4, 2014

@mengxr Get you. It's false-positive error. Do you have any comment or feedback moving it out as public api? I'm building a feature scaling api in MlUtils which depends on this. Thanks.

@mengxr
Copy link
Contributor

mengxr commented Jun 6, 2014

@dbtsai The current workaround is excluding it in project/MimaExcludes.scala. Please check the examples there. At least, we need to make Jenkins happy.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@dbtsai
Copy link
Member Author

dbtsai commented Jun 6, 2014

k... better to have Mima exclude the private class automatically, or we can have annotation for the private class.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15492/

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15504/

@mengxr
Copy link
Contributor

mengxr commented Jul 10, 2014

@dbtsai About the package name, stat is the standard acronym for statistics instead of stats. Checkout the urls returned by Google:

https://www.google.com/#q=statistics+department

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge or aggregate may be better than overloading add here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streaming has special meaning in spark. Change it to online?

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 955. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16558/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 955. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16560/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 955. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16561/consoleFull

@dbtsai dbtsai changed the title [SPARK-1969][MLlib] Public available online summarizer for mean, variance, min, and max [SPARK-1969][MLlib] Online summarizer APIs for mean, variance, min, and max Jul 11, 2014
@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 955:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16558/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 955:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16560/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 955:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16561/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 955. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16576/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 955:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class MultivariateOnlineSummarizer extends MultivariateStatisticalSummary with Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16576/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Jul 12, 2014

Merged. Thanks!

@asfgit asfgit closed this in 5596086 Jul 12, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…nd max

It basically moved the private ColumnStatisticsAggregator class from RowMatrix to public available DeveloperApi with documentation and unitests.

Changes:
1) Moved the private implementation from org.apache.spark.mllib.linalg.ColumnStatisticsAggregator to org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
2) When creating OnlineSummarizer object, the number of columns is not needed in the constructor. It's determined when users add the first sample.
3) Added the APIs documentation for MultivariateOnlineSummarizer.
4) Added the unittests for MultivariateOnlineSummarizer.

Author: DB Tsai <[email protected]>

Closes apache#955 from dbtsai/dbtsai-summarizer and squashes the following commits:

b13ac90 [DB Tsai] dbtsai-summarizer
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants