You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/Microsoft.ML.FastTree/doc.xml
+58Lines changed: 58 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -73,6 +73,64 @@
73
73
<para><ahref='http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1013203451'>Greedy function approximation: A gradient boosting machine</a></para>
74
74
</remarks>
75
75
</member>
76
+
77
+
<membername="TreeEnsembleFeaturizerTransform">
78
+
<summary>
79
+
Trains a tree ensemble, or loads it from a file, then maps a numeric feature vector
80
+
to three outputs:
81
+
<list>
82
+
<item>
83
+
<description>A vector containing the individual tree outputs of the tree ensemble.</description>
84
+
</item>
85
+
<item>
86
+
<description>A vector indicating the leaves that the feature vector falls on in the tree ensemble.</description>
87
+
</item>
88
+
<item>
89
+
<description>A vector indicating the paths that the feature vector falls on in the tree ensemble.</description>
90
+
</item>
91
+
</list>
92
+
If a both a model file and a trainer are specified - will use the model file. If neither are specified,
93
+
will train a default FastTree model.
94
+
This can handle key labels by training a regression model towards their optionally permuted indices.
95
+
</summary>
96
+
<remarks>
97
+
In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features.
98
+
<para>A most obvious example could be to use the model's scores as features to downstream models. For example, we might run clustering on the original features,
99
+
and use the cluster distances as the new feature set.
100
+
Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para>
101
+
There's a number of famous or popular examples of this technique:
102
+
<list>
103
+
<item>
104
+
<description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'.
105
+
It is observed that the Euclidian distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together,
106
+
and far away from pictures of kittens. </description>
107
+
</item>
108
+
<item>
109
+
<description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description>
110
+
</item>
111
+
<item>
112
+
<description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model,
113
+
and there's no reason to compute them. </description>
114
+
</item>
115
+
</list>
116
+
<para>Tree featurizer uses the decision tree ensembles for feature engineering in the same fashion as above.</para>
117
+
<para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training).
118
+
If we associate each leaf of each tree with a sequential integer, we can, for every incoming example x,
119
+
produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para>
120
+
<para>Thus, for every example x, we produce a 10000-valued vector L, with exactly 100 1s and the rest zeroes.
121
+
This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para>
122
+
<para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para>
123
+
<para>We could repeat the same thought process for the non-leaf, or internal, nodes of the trees (we know that each tree has exactly 99 of them in our 100-leaf example),
124
+
and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para>
125
+
<para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para>
126
+
<para>The TreeLeafFeaturizer is also producing the third vector, T, which is defined as Ti(x) = output of tree #i on example x.</para>
internalconststringSummary="Trains a multiclass Naive Bayes predictor that supports binary feature values.";
35
-
internalconststringRemarks=@"<remarks>
36
-
<a href ='https://en.wikipedia.org/wiki/Naive_Bayes_classifier'>Naive Bayes</a> is a probabilistic classifier that can be used for multiclass problems.
37
-
Using Bayes' theorem, the conditional probability for a sample belonging to a class can be calculated based on the sample count for each feature combination groups.
38
-
However, Naive Bayes Classifier is feasible only if the number of features and the values each feature can take is relatively small.
39
-
It also assumes that the features are strictly independent.
Trains a multiclass Naive Bayes predictor that supports binary feature values.
8
+
</summary>
9
+
<remarks>
10
+
<a href ='https://en.wikipedia.org/wiki/Naive_Bayes_classifier'>Naive Bayes</a> is a probabilistic classifier that can be used for multiclass problems.
11
+
Using Bayes' theorem, the conditional probability for a sample belonging to a class can be calculated based on the sample count for each feature combination groups.
12
+
However, Naive Bayes Classifier is feasible only if the number of features and the values each feature can take is relatively small.
13
+
It assumes independence among the presence of features in a class even though they may be dependent on each other.
14
+
This multi-class trainer accepts binary feature values of type float, i.e., feature values are either true or false.
15
+
Specifically a feature value greater than zero is treated as true.
16
+
These learner will request normalization from the data pipeline if the
17
+
classifier indicates it would benefit from it. Note that even if the
18
+
classifier indicates that it does not need caching, OVA will always
19
+
request caching, as it will be performing multiple passes over the data set.
In this strategy, a binary classification algorithm is used to train one classifier for each class, which distinguishes that class from all other classes.
35
+
Prediction is then performed by running these binary classifiers, and choosing the prediction with the highest confidence score.
36
+
</summary>
37
+
<remarks>
38
+
<para>This algorithm can be treated as a wrapper for all the binary classifiers in ML.NET.
39
+
A few binary classifiers already have implementation for multi-class problems,
40
+
thus users can choose either one depending on the context.
41
+
</para>
42
+
<para>
43
+
The OVA version of a binary classifier, such as wrapping a LightGbmBinaryClassifier ,
44
+
can be different from LightGbmClassifier, which develops a multi-class classifier directly.
0 commit comments