-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18324][ML][DOC] Update ML programming and migration guide for 2.1 release #16076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #69385 has finished for PR 16076 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments.
One more: We need to note [https://issues.apache.org/jira/browse/SPARK-18291] or revert that change in order to avoid breaking APIs. CC @mengxr
docs/ml-guide.md
Outdated
| In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`. | ||
| * In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. | ||
| * [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592): | ||
| Deprecate all setter methods for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"all setter methods" --> "all Param setter methods except for input/output column Params"
| * `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`) | ||
| * `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`) | ||
| * `model` in `regression.LinearRegressionSummary` | ||
| * `validateParams` in `PipelineStage` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: validateParams in Evaluator
docs/ml-guide.md
Outdated
| `KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected. | ||
| * [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389): | ||
| `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode. | ||
| * [SPARK-18481](https://issues.apache.org/jira/browse/SPARK-18481): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go under breaking changes. I'm starting to wonder if we should revert this one change. It doesn't gain us much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to breaking changes session. I think the Params setter methods are rarely used, since it takes no effect and should not be exposed for model classes (This is also why we deprecate them from 2.1). Even they are used, it's no breaking changes for Scala users. And for Java users, they always should make forced type conversion, such as:
RandomForestClassificationModel model = (RandomForestClassificationModel)rf.fit(dataFrame).setFeatureSubsetStrategy("sqrt");
So I think it has little affect and actually no breaking happens, so I think it's OK. But I'm open to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I was thinking the "final" change was breaking for Scala, but it's not since the model constructor is private. I'm OK with moving those back to changes of behavior, or just removing these items completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I remove these completely. Thanks.
docs/ml-guide.md
Outdated
| `QuantileDiscretizer` now uses `spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously used custom sampling logic). | ||
| The output buckets will differ for same input data and params. | ||
| * [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870): | ||
| Fix a bug of `ChiSqSelector` which will likely change its result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the more specific text from the JIRA: "ChiSquareSelector use pValue rather than raw statistic for SelectKBest features."
|
Test build #69456 has finished for PR 16076 at commit
|
|
@jkbradley I addressed other comments except for SPARK-18291, since I think it's SparkR related issue and should be listed at SparkR session of migration guide. I'm preparing a PR to update SparkR migration guide. Thanks. |
|
Sounds good about SPARK-18291. I responded inline above about SPARK-18481. Apart from this update, this looks ready to me. Thank you! |
|
Test build #69559 has finished for PR 16076 at commit
|
|
LGTM |
…2.1 release ## What changes were proposed in this pull request? Update ML programming and migration guide for 2.1 release. ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <[email protected]> Closes #16076 from yanboliang/spark-18324. (cherry picked from commit 2dc0d7e) Signed-off-by: Joseph K. Bradley <[email protected]>
…2.1 release ## What changes were proposed in this pull request? Update ML programming and migration guide for 2.1 release. ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <[email protected]> Closes apache#16076 from yanboliang/spark-18324.
…2.1 release ## What changes were proposed in this pull request? Update ML programming and migration guide for 2.1 release. ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <[email protected]> Closes apache#16076 from yanboliang/spark-18324.
What changes were proposed in this pull request?
Update ML programming and migration guide for 2.1 release.
How was this patch tested?
Doc change, no test.