|
95 | 95 | <para>Generally, ensemble models provide better coverage and accuracy than single decision trees. |
96 | 96 | Each tree in a decision forest outputs a Gaussian distribution.</para> |
97 | 97 | <para>For more see: </para> |
98 | | - <list> |
| 98 | + <list type='bullet'> |
99 | 99 | <item><description><a href='http://en.wikipedia.org/wiki/Random_forest'>Wikipedia: Random forest</a></description></item> |
100 | 100 | <item><description><a href='http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf'>Quantile regression forest</a></description></item> |
101 | 101 | <item><description><a href='https://blogs.technet.microsoft.com/machinelearning/2014/09/10/from-stumps-to-trees-to-forests/'>From Stumps to Trees to Forests</a></description></item> |
|
146 | 146 | <summary> |
147 | 147 | Trains a tree ensemble, or loads it from a file, then maps a numeric feature vector |
148 | 148 | to three outputs: |
149 | | - <list> |
| 149 | + <list type='number'> |
150 | 150 | <item><description>A vector containing the individual tree outputs of the tree ensemble.</description></item> |
151 | 151 | <item><description>A vector indicating the leaves that the feature vector falls on in the tree ensemble.</description></item> |
152 | 152 | <item><description>A vector indicating the paths that the feature vector falls on in the tree ensemble.</description></item> |
|
157 | 157 | </summary> |
158 | 158 | <remarks> |
159 | 159 | In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features. |
160 | | - <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features, |
| 160 | + <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features, |
161 | 161 | and use the cluster distances as the new feature set. |
162 | | - Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> |
| 162 | + Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> |
163 | 163 | There are a number of famous or popular examples of this technique: |
164 | | - <list> |
165 | | - <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. |
166 | | - It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, |
| 164 | + <list type='bullet'> |
| 165 | + <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. |
| 166 | + It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, |
167 | 167 | and far away from pictures of kittens. </description></item> |
168 | | - <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> |
169 | | - <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, |
170 | | - and there's no reason to compute them. </description></item> |
| 168 | + <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> |
| 169 | + <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, |
| 170 | + and there's no reason to compute them. </description></item> |
171 | 171 | </list> |
172 | 172 | <para>Tree featurizer uses the decision tree ensembles for feature engineering in the same fashion as above.</para> |
173 | | - <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training). |
| 173 | + <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training). |
174 | 174 | If we associate each leaf of each tree with a sequential integer, we can, for every incoming example x, |
175 | | - produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> |
| 175 | + produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> |
176 | 176 | <para>Thus, for every example x, we produce a 10000-valued vector L, with exactly 100 1s and the rest zeroes. |
177 | | - This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> |
178 | | - <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> |
| 177 | + This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> |
| 178 | + <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> |
179 | 179 | <para>We could repeat the same thought process for the non-leaf, or internal, nodes of the trees (we know that each tree has exactly 99 of them in our 100-leaf example), |
180 | | - and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> |
181 | | - <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> |
| 180 | + and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> |
| 181 | + <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> |
182 | 182 | <para>The TreeLeafFeaturizer is also producing the third vector, T, which is defined as Ti(x) = output of tree #i on example x.</para> |
183 | 183 | </remarks> |
184 | 184 | <example> |
|
0 commit comments