Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions docs/ml-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ parameter to select between these two algorithms, or leave it unset and Spark wi

For more background and more details about the implementation of binomial logistic regression, refer to the documentation of [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression).

**Example**
**Examples**

The following example shows how to train binomial and multinomial logistic regression
models for binary classification with elastic net regularization. `elasticNetParam` corresponds to
Expand Down Expand Up @@ -137,7 +137,7 @@ We minimize the weighted negative log-likelihood, using a multinomial response m

For a detailed derivation please see [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model).

**Example**
**Examples**

The following example shows how to train a multiclass logistic regression
model with elastic net regularization.
Expand All @@ -164,7 +164,7 @@ model with elastic net regularization.
Decision trees are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).

**Example**
**Examples**

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
Expand Down Expand Up @@ -201,7 +201,7 @@ More details on parameters can be found in the [Python API documentation](api/py
Random forests are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).

**Example**
**Examples**

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
Expand Down Expand Up @@ -234,7 +234,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
Gradient-boosted trees (GBTs) are a popular classification and regression method using ensembles of decision trees.
More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).

**Example**
**Examples**

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
Expand Down Expand Up @@ -284,7 +284,7 @@ The number of nodes `$N$` in the output layer corresponds to the number of class

MLPC employs backpropagation for learning the model. We use the logistic loss function for optimization and L-BFGS as an optimization routine.

**Example**
**Examples**

<div class="codetabs">

Expand All @@ -311,7 +311,7 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu

Predictions are done by evaluating each binary classifier and the index of the most confident classifier is output as label.

**Example**
**Examples**

The example below demonstrates how to load the
[Iris dataset](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale), parse it as a DataFrame and perform multiclass classification using `OneVsRest`. The test error is calculated to measure the algorithm accuracy.
Expand Down Expand Up @@ -348,7 +348,7 @@ naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-c
and [Bernoulli naive Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
More information can be found in the section on [Naive Bayes in MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).

**Example**
**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -383,7 +383,7 @@ summaries is similar to the logistic regression case.

> When fitting LinearRegressionModel without intercept on dataset with constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.

**Example**
**Examples**

The following
example demonstrates training an elastic net regularized linear
Expand Down Expand Up @@ -511,7 +511,7 @@ others.
</tbody>
</table>

**Example**
**Examples**

The following example demonstrates training a GLM with a Gaussian response and identity link
function and extracting model summary statistics.
Expand Down Expand Up @@ -544,7 +544,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.
Decision trees are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on decision trees](#decision-trees).

**Example**
**Examples**

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
Expand Down Expand Up @@ -579,7 +579,7 @@ More details on parameters can be found in the [Python API documentation](api/py
Random forests are a popular family of classification and regression methods.
More information about the `spark.ml` implementation can be found further in the [section on random forests](#random-forests).

**Example**
**Examples**

The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the tree-based algorithms can recognize.
Expand Down Expand Up @@ -612,7 +612,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.
Gradient-boosted trees (GBTs) are a popular regression method using ensembles of decision trees.
More information about the `spark.ml` implementation can be found further in the [section on GBTs](#gradient-boosted-trees-gbts).

**Example**
**Examples**

Note: For this example dataset, `GBTRegressor` actually only needs 1 iteration, but that will not
be true in general.
Expand Down Expand Up @@ -700,7 +700,7 @@ The implementation matches the result from R's survival function

> When fitting AFTSurvivalRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is different from R survival::survreg.

**Example**
**Examples**

<div class="codetabs">

Expand Down Expand Up @@ -765,7 +765,7 @@ is treated as piecewise linear function. The rules for prediction therefore are:
predictions of the two closest features. In case there are multiple values
with the same feature then the same rules as in previous point are used.

### Examples
**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down
8 changes: 5 additions & 3 deletions docs/ml-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
</tbody>
</table>

### Example
**Examples**

<div class="codetabs">

Expand Down Expand Up @@ -94,6 +94,8 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -128,7 +130,7 @@ Bisecting K-means can often be much faster than regular K-means, but it will gen

`BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model.

### Example
**Examples**

<div class="codetabs">

Expand Down Expand Up @@ -210,7 +212,7 @@ model.
</tbody>
</table>

### Example
**Examples**

<div class="codetabs">

Expand Down
2 changes: 1 addition & 1 deletion docs/ml-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ This approach is named "ALS-WR" and discussed in the paper
It makes `regParam` less dependent on the scale of the dataset, so we can apply the
best parameter learned from a sampled subset to the full dataset and expect similar performance.

## Examples
**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down
30 changes: 30 additions & 0 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,8 @@ can then be used as features for prediction, document similarity calculations, e
Please refer to the [MLlib user guide on Word2Vec](mllib-feature-extraction.html#word2vec) for more
details.

**Examples**

In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.

<div class="codetabs">
Expand Down Expand Up @@ -220,6 +222,8 @@ for more details on the API.
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
"tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -321,6 +325,8 @@ An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (t

`NGram` takes as input a sequence of strings (e.g. the output of a [Tokenizer](ml-features.html#tokenizer)). The parameter `n` is used to determine the number of terms in each $n$-gram. The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than `n` strings, no output is produced.

**Examples**

<div class="codetabs">

<div data-lang="scala" markdown="1">
Expand Down Expand Up @@ -358,6 +364,8 @@ for binarization. Feature values greater than the threshold are binarized to 1.0
to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
for `inputCol`.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -388,6 +396,8 @@ for more details on the API.

[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/index.html#org.apache.spark.ml.feature.PCA) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -418,6 +428,8 @@ for more details on the API.

[Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -458,6 +470,8 @@ for the transform is unitary. No shift is applied to the transformed
sequence (e.g. the $0$th element of the transformed sequence is the
$0$th DCT coefficient and _not_ the $N/2$th).

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -663,6 +677,8 @@ for more details on the API.

[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

**Examples**

<div class="codetabs">
<div data-lang="scala" markdown="1">

Expand Down Expand Up @@ -701,6 +717,8 @@ It can both automatically decide which features are categorical and convert orig

Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

**Examples**

In the example below, we read in a dataset of labeled points and then use `VectorIndexer` to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as `DecisionTreeRegressor` that handle categorical features.

<div class="codetabs">
Expand Down Expand Up @@ -734,6 +752,8 @@ for more details on the API.

`Normalizer` is a `Transformer` which transforms a dataset of `Vector` rows, normalizing each `Vector` to have unit norm. It takes parameter `p`, which specifies the [p-norm](http://en.wikipedia.org/wiki/Norm_%28mathematics%29#p-norm) used for normalization. ($p = 2$ by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.

**Examples**

The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit $L^1$ norm and unit $L^\infty$ norm.

<div class="codetabs">
Expand Down Expand Up @@ -774,6 +794,8 @@ for more details on the API.

Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.

**Examples**

The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.

<div class="codetabs">
Expand Down Expand Up @@ -819,6 +841,8 @@ For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$`

Note that since zero values will probably be transformed to non-zero values, output of the transformer will be `DenseVector` even for sparse input.

**Examples**

The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].

<div class="codetabs">
Expand Down Expand Up @@ -860,6 +884,8 @@ data, and thus does not destroy any sparsity.
`MaxAbsScaler` computes summary statistics on a data set and produces a `MaxAbsScalerModel`. The
model can then transform each feature individually to range [-1, 1].

**Examples**

The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [-1, 1].

<div class="codetabs">
Expand Down Expand Up @@ -903,6 +929,8 @@ Note also that the splits that you provided have to be in strictly increasing or

More details can be found in the API docs for [Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).

**Examples**

The following example demonstrates how to bucketize a column of `Double`s into another index-wised column.

<div class="codetabs">
Expand Down Expand Up @@ -951,6 +979,8 @@ v_N
\end{pmatrix}
\]`

**Examples**

This example below demonstrates how to transform vectors using a transforming vector value.

<div class="codetabs">
Expand Down
4 changes: 2 additions & 2 deletions docs/ml-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ To help construct the parameter grid, users can use the [`ParamGridBuilder`](api

After identifying the best `ParamMap`, `CrossValidator` finally re-fits the `Estimator` using the best `ParamMap` and the entire dataset.

## Example: model selection via cross-validation
**Examples: model selection via cross-validation**

The following example demonstrates using `CrossValidator` to select from a grid of parameters.

Expand Down Expand Up @@ -102,7 +102,7 @@ It splits the dataset into these two parts using the `trainRatio` parameter. For

Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` and the entire dataset.

## Example: model selection via train validation split
**Examples: model selection via train validation split**

<div class="codetabs">

Expand Down