Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Jun 25, 2014

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called VectorTransformer is defined for generic transformation on a vector. It contains one method to be implemented, transform which applies transformation on a vector.

There are two implementations of VectorTransformer now, and they all can be easily extended with PMML transformation support.

  1. StandardScaler - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

  2. Normalizer - Normalizes samples individually to unit L^n norm

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16102/

@mengxr
Copy link
Contributor

mengxr commented Jul 10, 2014

Is there a reference implementation that you followed or this is all new? Does PMML standard define something similar?

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17802/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17803/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n -> p, which is commonly used for norms.

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(n: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean, withStd: Boolean)
trait VectorTransformer {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17804/consoleFull

@dbtsai
Copy link
Member Author

dbtsai commented Aug 3, 2014

TODO

  1. p = Double.PositiveInfinity. 1, 2, and inf.
  2. Add withStd back.

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1207:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Int) extends VectorTransformer with Serializable {
class StandardScaler(withMean: Boolean)
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17813/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17834/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17833/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind throwing a warning message if both withMean and withStd are false?

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^p^ (fun to read ^o^)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol...

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1207. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17839/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 4, 2014

LGTM. Merged into both master and branch-1.1. Thanks!!

@asfgit asfgit closed this in ae58aea Aug 4, 2014
asfgit pushed a commit that referenced this pull request Aug 4, 2014
…dependent variables or features of data

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <[email protected]>

Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs

(cherry picked from commit ae58aea)
Signed-off-by: Xiangrui Meng <[email protected]>
@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1207:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class Normalizer(p: Double) extends VectorTransformer {
class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
trait VectorTransformer extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17841/consoleFull

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…dependent variables or features of data

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.

In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.

There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.

1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

2) `Normalizer` - Normalizes samples individually to unit L^n norm

Author: DB Tsai <[email protected]>

Closes apache#1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:

78c15d3 [DB Tsai] Alpine Data Labs
wangyum pushed a commit that referenced this pull request May 26, 2023
mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants