-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. #10234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
edddc09
removed old files
thunterdb 4b328e5
removed old files
thunterdb 62dc368
last changes
thunterdb 42bf337
missing conversions
thunterdb cb47dd1
Frames
thunterdb 6d66645
added stubs and spark.mlxxx everywhere
thunterdb c75a5ca
put the intro again into the ml-guide
thunterdb 8432ac9
links in the side bar
thunterdb 68f7e53
changes
thunterdb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,8 @@ | ||
| <div class="left-menu-wrapper"> | ||
| <div class="left-menu"> | ||
| <h3>spark.ml package</h3> | ||
| <h3><a href="ml-guide.html">spark.ml package</a></h3> | ||
| {% include nav-left.html nav=include.nav-ml %} | ||
| <h3>spark.mllib package</h3> | ||
| <h3><a href="mllib-guide.html">spark.mllib package</a></h3> | ||
| {% include nav-left.html nav=include.nav-mllib %} | ||
| </div> | ||
| </div> |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| --- | ||
| layout: global | ||
| title: Multilayer perceptron classifier - spark.ml | ||
| displayTitle: Multilayer perceptron classifier - spark.ml | ||
| --- | ||
|
|
||
| > This section has been moved into the | ||
| [classification and regression section](ml-classification-regression.html#multilayer-perceptron-classifier). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,171 +1,8 @@ | ||
| --- | ||
| layout: global | ||
| title: Decision Trees - SparkML | ||
| displayTitle: <a href="ml-guide.html">ML</a> - Decision Trees | ||
| title: Decision trees - spark.ml | ||
| displayTitle: Decision trees - spark.ml | ||
| --- | ||
|
|
||
| **Table of Contents** | ||
|
|
||
| * This will become a table of contents (this text will be scraped). | ||
| {:toc} | ||
|
|
||
|
|
||
| # Overview | ||
|
|
||
| [Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning) | ||
| and their ensembles are popular methods for the machine learning tasks of | ||
| classification and regression. Decision trees are widely used since they are easy to interpret, | ||
| handle categorical features, extend to the multiclass classification setting, do not require | ||
| feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble | ||
| algorithms such as random forests and boosting are among the top performers for classification and | ||
| regression tasks. | ||
|
|
||
| MLlib supports decision trees for binary and multiclass classification and for regression, | ||
| using both continuous and categorical features. The implementation partitions data by rows, | ||
| allowing distributed training with millions or even billions of instances. | ||
|
|
||
| Users can find more information about the decision tree algorithm in the [MLlib Decision Tree guide](mllib-decision-tree.html). In this section, we demonstrate the Pipelines API for Decision Trees. | ||
|
|
||
| The Pipelines API for Decision Trees offers a bit more functionality than the original API. In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities). | ||
|
|
||
| Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html). | ||
|
|
||
| # Inputs and Outputs | ||
|
|
||
| We list the input and output (prediction) column types here. | ||
| All output columns are optional; to exclude an output column, set its corresponding Param to an empty string. | ||
|
|
||
| ## Input Columns | ||
|
|
||
| <table class="table"> | ||
| <thead> | ||
| <tr> | ||
| <th align="left">Param name</th> | ||
| <th align="left">Type(s)</th> | ||
| <th align="left">Default</th> | ||
| <th align="left">Description</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <td>labelCol</td> | ||
| <td>Double</td> | ||
| <td>"label"</td> | ||
| <td>Label to predict</td> | ||
| </tr> | ||
| <tr> | ||
| <td>featuresCol</td> | ||
| <td>Vector</td> | ||
| <td>"features"</td> | ||
| <td>Feature vector</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
|
|
||
| ## Output Columns | ||
|
|
||
| <table class="table"> | ||
| <thead> | ||
| <tr> | ||
| <th align="left">Param name</th> | ||
| <th align="left">Type(s)</th> | ||
| <th align="left">Default</th> | ||
| <th align="left">Description</th> | ||
| <th align="left">Notes</th> | ||
| </tr> | ||
| </thead> | ||
| <tbody> | ||
| <tr> | ||
| <td>predictionCol</td> | ||
| <td>Double</td> | ||
| <td>"prediction"</td> | ||
| <td>Predicted label</td> | ||
| <td></td> | ||
| </tr> | ||
| <tr> | ||
| <td>rawPredictionCol</td> | ||
| <td>Vector</td> | ||
| <td>"rawPrediction"</td> | ||
| <td>Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction</td> | ||
| <td>Classification only</td> | ||
| </tr> | ||
| <tr> | ||
| <td>probabilityCol</td> | ||
| <td>Vector</td> | ||
| <td>"probability"</td> | ||
| <td>Vector of length # classes equal to rawPrediction normalized to a multinomial distribution</td> | ||
| <td>Classification only</td> | ||
| </tr> | ||
| </tbody> | ||
| </table> | ||
|
|
||
| # Examples | ||
|
|
||
| The below examples demonstrate the Pipelines API for Decision Trees. The main differences between this API and the [original MLlib Decision Tree API](mllib-decision-tree.html) are: | ||
|
|
||
| * support for ML Pipelines | ||
| * separation of Decision Trees for classification vs. regression | ||
| * use of DataFrame metadata to distinguish continuous and categorical features | ||
|
|
||
|
|
||
| ## Classification | ||
|
|
||
| The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. | ||
| We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize. | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier). | ||
|
|
||
| {% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html). | ||
|
|
||
| {% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %} | ||
|
|
||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier). | ||
|
|
||
| {% include_example python/ml/decision_tree_classification_example.py %} | ||
|
|
||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
|
|
||
| ## Regression | ||
|
|
||
| The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. | ||
| We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize. | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor). | ||
|
|
||
| {% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %} | ||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html). | ||
|
|
||
| {% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeRegressionExample.java %} | ||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
|
|
||
| More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor). | ||
|
|
||
| {% include_example python/ml/decision_tree_regression_example.py %} | ||
| </div> | ||
|
|
||
| </div> | ||
| > This section has been moved into the | ||
| [classification and regression section](ml-classification-regression.html#decision-trees). |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.mllib right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not this file; the "ml-" prefix ones are for spark.ml. (It's true the functionality is almost the same currently, but it's a bit different and will diverge more.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the purpose now. It was the old MLlib text, but a lot of it still applies. The distinction is removed.