Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Nov 30, 2016

What changes were proposed in this pull request?

Update ML programming and migration guide for 2.1 release.

How was this patch tested?

Doc change, no test.

@yanboliang yanboliang changed the title [SPARK-18324][ML][DOC] ML programming guide update and migration guide for 2.1 release [SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release Nov 30, 2016
@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69385 has finished for PR 16076 at commit f3f1fc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @jkbradley @MLnick @sethah

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments.

One more: We need to note [https://issues.apache.org/jira/browse/SPARK-18291] or revert that change in order to avoid breaking APIs. CC @mengxr

docs/ml-guide.md Outdated
In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`.
* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`.
* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
Deprecate all setter methods for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all setter methods" --> "all Param setter methods except for input/output column Params"

* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
* `model` in `regression.LinearRegressionSummary`
* `validateParams` in `PipelineStage`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: validateParams in Evaluator

docs/ml-guide.md Outdated
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.
* [SPARK-18481](https://issues.apache.org/jira/browse/SPARK-18481):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go under breaking changes. I'm starting to wonder if we should revert this one change. It doesn't gain us much.

Copy link
Contributor Author

@yanboliang yanboliang Dec 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to breaking changes session. I think the Params setter methods are rarely used, since it takes no effect and should not be exposed for model classes (This is also why we deprecate them from 2.1). Even they are used, it's no breaking changes for Scala users. And for Java users, they always should make forced type conversion, such as:

RandomForestClassificationModel model = (RandomForestClassificationModel)rf.fit(dataFrame).setFeatureSubsetStrategy("sqrt");

So I think it has little affect and actually no breaking happens, so I think it's OK. But I'm open to hear your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I was thinking the "final" change was breaking for Scala, but it's not since the model constructor is private. I'm OK with moving those back to changes of behavior, or just removing these items completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I remove these completely. Thanks.

docs/ml-guide.md Outdated
`QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic).
The output buckets will differ for same input data and params.
* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
Fix a bug of `ChiSqSelector` which will likely change its result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the more specific text from the JIRA: "ChiSquareSelector use pValue rather than raw statistic for SelectKBest features."

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69456 has finished for PR 16076 at commit e784901.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@jkbradley I addressed other comments except for SPARK-18291, since I think it's SparkR related issue and should be listed at SparkR session of migration guide. I'm preparing a PR to update SparkR migration guide. Thanks.

@jkbradley
Copy link
Member

Sounds good about SPARK-18291.

I responded inline above about SPARK-18481. Apart from this update, this looks ready to me. Thank you!

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69559 has finished for PR 16076 at commit f583653.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
I'll merge this with master and branch-2.1
Thank you!

asfgit pushed a commit that referenced this pull request Dec 3, 2016
…2.1 release

## What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang <[email protected]>

Closes #16076 from yanboliang/spark-18324.

(cherry picked from commit 2dc0d7e)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 2dc0d7e Dec 3, 2016
@yanboliang yanboliang deleted the spark-18324 branch December 3, 2016 05:15
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…2.1 release

## What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang <[email protected]>

Closes apache#16076 from yanboliang/spark-18324.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…2.1 release

## What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.

## How was this patch tested?
Doc change, no test.

Author: Yanbo Liang <[email protected]>

Closes apache#16076 from yanboliang/spark-18324.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants