Skip to content

Commit ecc6857

Browse files
authored
Adding documentation about entry points, and entry points graphs: EntryPoints.md and GraphRunner.md (#295)
* Adding EntryPoints.md and GraphRunner.md * addressing PR feedback * Updating the title of the GraphRunner.md file * adressing Tom's feedback * adressing feedback * code formatting for class names * Addressing Gal's comments * Adding an example of an entry point. Fixing casing on ML.NET * fixing link
1 parent 496d3b9 commit ecc6857

File tree

2 files changed

+359
-0
lines changed

2 files changed

+359
-0
lines changed

docs/code/EntryPoints.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# Entry Points And Helper Classes
2+
3+
## Overview
4+
5+
Entry points are a way to interface with ML.NET components, by specifying an execution graph of connected inputs and outputs of those components.
6+
Both the manifest describing available components and their inputs/outputs, and an "experiment" graph description, are expressed in JSON.
7+
The recommended way of interacting with ML.NET through other, non-.NET programming languages, is by composing, and exchanging pipelines or experiment graphs.
8+
9+
Through the documentation, we also refer to entry points as 'entry points nodes', and that is because they are the nodes of the graph representing the experiment.
10+
The graph 'variables', the various values of the experiment graph JSON properties, serve to describe the relationship between the entry point nodes.
11+
The 'variables' are therefore the edges of the DAG (Directed Acyclic Graph).
12+
13+
All of ML.NET entry points are described by their manifest. The manifest is another JSON object that documents and describes the structure of an entry points.
14+
Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph.
15+
16+
This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.NET classes that help construct an entry point graph.
17+
18+
## EntryPoint manifest - the definition of an entry point
19+
20+
The components manifest is build by scanning the ML.NET assemblies through reflection and searching for types having the: `SignatureEntryPointModule` signature in their `LoadableClass` assembly attribute definition.
21+
An example of an entry point manifest object, specifically for the `ColumnTypeConverter` transform, is:
22+
23+
```javascript
24+
{
25+
"Name": "Transforms.ColumnTypeConverter",
26+
"Desc": "Converts a column to a different type, using standard conversions.",
27+
"FriendlyName": "Convert Transform",
28+
"ShortName": "Convert",
29+
"Inputs": [
30+
{ "Name": "Column",
31+
"Type": {
32+
"Kind": "Array",
33+
"ItemType": {
34+
"Kind": "Struct",
35+
"Fields": [
36+
{
37+
"Name": "ResultType",
38+
"Type": {
39+
"Kind": "Enum",
40+
"Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ]
41+
},
42+
"Desc": "The result type",
43+
"Aliases": [ "type" ],
44+
"Required": false,
45+
"SortOrder": 150,
46+
"IsNullable": true,
47+
"Default": null
48+
},
49+
{ "Name": "Range",
50+
"Type": "String",
51+
"Desc": "For a key column, this defines the range of values",
52+
"Aliases": [ "key" ],
53+
"Required": false,
54+
"SortOrder": 150,
55+
"IsNullable": false,
56+
"Default": null
57+
},
58+
{ "Name": "Name",
59+
"Type": "String",
60+
"Desc": "Name of the new column",
61+
"Aliases": [ "name" ],
62+
"Required": false,
63+
"SortOrder": 150,
64+
"IsNullable": false,
65+
"Default": null
66+
},
67+
{ "Name": "Source",
68+
"Type": "String",
69+
"Desc": "Name of the source column",
70+
"Aliases": [ "src" ],
71+
"Required": false,
72+
"SortOrder": 150,
73+
"IsNullable": false,
74+
"Default": null
75+
}
76+
]
77+
}
78+
},
79+
"Desc": "New column definition(s) (optional form: name:type:src)",
80+
"Aliases": [ "col" ],
81+
"Required": true,
82+
"SortOrder": 1,
83+
"IsNullable": false
84+
},
85+
{ "Name": "Data",
86+
"Type": "DataView",
87+
"Desc": "Input dataset",
88+
"Required": true,
89+
"SortOrder": 2,
90+
"IsNullable": false
91+
},
92+
{ "Name": "ResultType",
93+
"Type": {
94+
"Kind": "Enum",
95+
"Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ]
96+
},
97+
"Desc": "The result type",
98+
"Aliases": [ "type" ],
99+
"Required": false,
100+
"SortOrder": 2,
101+
"IsNullable": true,
102+
"Default": null
103+
},
104+
{ "Name": "Range",
105+
"Type": "String",
106+
"Desc": "For a key column, this defines the range of values",
107+
"Aliases": [ "key" ],
108+
"Required": false,
109+
"SortOrder": 150,
110+
"IsNullable": false,
111+
"Default": null
112+
}
113+
],
114+
"Outputs": [
115+
{
116+
"Name": "OutputData",
117+
"Type": "DataView",
118+
"Desc": "Transformed dataset"
119+
},
120+
{
121+
"Name": "Model",
122+
"Type": "TransformModel",
123+
"Desc": "Transform model"
124+
}
125+
],
126+
"InputKind": ["ITransformInput" ],
127+
"OutputKind": [ "ITransformOutput" ]
128+
}
129+
```
130+
131+
The respective entry point, constructed based on this manifest would be:
132+
133+
```javascript
134+
{
135+
"Name": "Transforms.ColumnTypeConverter",
136+
"Inputs": {
137+
"Column": [{
138+
"Name": "Features",
139+
"Source": "Features"
140+
}],
141+
"Data": "$data0",
142+
"ResultType": "R4"
143+
},
144+
"Outputs": {
145+
"OutputData": "$Convert_Output",
146+
"Model": "$Convert_TransformModel"
147+
}
148+
}
149+
```
150+
151+
## `EntryPointGraph`
152+
153+
This class encapsulates the list of nodes (`EntryPointNode`) and edges
154+
(`EntryPointVariable` inside a `RunContext`) of the graph.
155+
156+
## `EntryPointNode`
157+
158+
This class represents a node in the graph, and wraps an entry point call. It
159+
has methods for creating and running entry points. It also has a reference to
160+
the `RunContext` to allow it to get and set values from `EntryPointVariable`s.
161+
162+
To express the inputs that are set through variables, a set of dictionaries
163+
are used. The `InputBindingMap` maps an input parameter name to a list of
164+
`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a
165+
`VariableBinding`. For example, if the JSON looks like this:
166+
167+
```javascript
168+
'foo': '$bar'
169+
```
170+
171+
the `InputBindingMap` will have one entry that maps the string "foo" to a list
172+
that has only one element, a `SimpleParameterBinding` with the name "foo" and
173+
the `InputMap` will map the `SimpleParameterBinding` to a
174+
`SimpleVariableBinding` with the name "bar". For a more complicated example,
175+
let's say we have this JSON:
176+
177+
```javascript
178+
'foo': [ '$bar[3]', '$baz']
179+
```
180+
181+
the `InputBindingMap` will have one entry that maps the string "foo" to a list
182+
that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and
183+
index 0 and another one with index 1. The `InputMap` will map the first
184+
`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar"
185+
and index 3 and the second `ArrayIndexParameterBinding` to a
186+
`SimpleVariableBinding` with the name "baz".
187+
188+
For outputs, a node assumes that an output is mapped to a variable, so the
189+
`OutputMap` is a simple dictionary from string to string.
190+
191+
## `EntryPointVariable`
192+
193+
This class represents an edge in the entry point graph. It has a name, a type
194+
and a value. Variables can be simple, arrays and/or dictionaries. Currently,
195+
only data views, file handles, predictor models and transform models are
196+
allowed as element types for a variable.
197+
198+
## `RunContext`
199+
200+
This class is just a container for all the variables in a graph.
201+
202+
## `VariableBinding` and Derived Classes
203+
204+
The abstract base class represents a "pointer to a (part of a) variable". It
205+
is used in conjunction with `ParameterBinding`s to specify inputs to an entry
206+
point node. The `SimpleVariableBinding` is a pointer to an entire variable,
207+
the `ArrayIndexVariableBinding` is a pointer to a specific index in an array
208+
variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific
209+
key in a dictionary variable.
210+
211+
## `ParameterBinding` and Derived Classes
212+
213+
The abstract base class represents a "pointer to a (part of a) parameter". It
214+
parallels the `VariableBinding` hierarchy and it is used to specify the inputs
215+
to an entry point node. The `SimpleParameterBinding` is a pointer to a
216+
non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a
217+
pointer to a specific index of an array parameter and the
218+
`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary
219+
parameter.
220+
221+
## How to create an entry point for an existing ML.NET component
222+
223+
The steps to take, to create an entry point for an existing ML.NET component, are:
224+
1. Add the `SignatureEntryPointModule` signature to the `LoadableClass` assembly attribute.
225+
2. Create a public static method, that:
226+
a. Takes as input, among others, an object representing the arguments of the component you want to expose.
227+
b. Initializes and run the components, returning one of the nested classes of `Microsoft.ML.Runtime.EntryPoints.CommonOutputs`
228+
c. Is annotated with the `TlcModule.EntryPoint` attribute
229+
230+
Based on the type of entry point being created, there are further conventions on the name of the method, for example, the Trainers entry points are typically called: 'TrainMultiClass', 'TrainBinary' etc, based on the task.
231+
Look at [OnlineGradientDescent](../../src/Microsoft.ML.StandardLearners/Standard/Online/OnlineGradientDescent.cs) for an example of a component and its entry point.

docs/code/GraphRunner.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Entry Point JSON Graph format
2+
3+
The entry point graph in ML.NET is an array of _nodes_. More information about the definition of entry points and classes that help construct entry point graphs
4+
can be found in the [EntryPoint.md document](./EntryPoints.md).
5+
6+
Each node is an object with the following fields:
7+
8+
- _name_: string. Required. Name of the entry point.
9+
- _inputs_: object. Optional. Specifies non-default inputs to the entry point.
10+
Note that if the entry point has required inputs (which is very common), the _inputs_ field is required.
11+
- _outputs_: object. Optional. Specifies the variables that will hold the node's outputs.
12+
13+
## Input and output types
14+
The following types are supported in JSON graphs:
15+
16+
- `string`. Represented as a JSON string, maps to a C# string.
17+
- `float`. Represented as a JSON float, maps to a C# float or double.
18+
- `bool`. Represented as a JSON bool, maps to a C# bool.
19+
- `enum`. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest).
20+
- `int`. Represented as a JSON integer, maps to a C# int or long.
21+
- `array` of the above. Represented as a JSON array, maps to a C# array.
22+
- `dictionary`. Currently not implemented. Represented as a JSON object, maps to a C# `Dictionary<string,T>`.
23+
- `component`. Represented as a JSON object with 2 fields: _name_:string and _settings_:object.
24+
25+
## Variables
26+
The following input/output types can not be represented as a JSON value:
27+
- `IDataView`
28+
- `IFileHandle`
29+
- `ITransformModel`
30+
- `IPredictorModel`
31+
32+
These must be passed as _variables_. The variable is represented as a JSON string that begins with `$`.
33+
Note the following rules:
34+
35+
- A variable can appear in the _outputs_ only once per graph. That is, the variable can be 'assigned' only once.
36+
- If the variable is present in _inputs_ of one node and in the _outputs_ of another node, this signifies a graph 'edge'.
37+
The same variable can participate in many edges.
38+
- If the variable is present only in _inputs_, but never in _outputs_, it is a _graph input_. All graph inputs must be provided before
39+
a graph can be run.
40+
- The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is
41+
ambiguous, ML.NET throws an exception.
42+
- Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, ML.NET throws an exception.
43+
_Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw._
44+
45+
### Variables for arrays and dictionaries.
46+
It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above).
47+
They are treated the same way as regular 'scalar' variables.
48+
49+
If we want to reference an item of the collection, we can use the `[]` syntax:
50+
- `$var[5]` denotes 5th element of an array variable.
51+
- `$var[foo]` and `$var['foo']` both denote the element with key 'foo' of a dictionary variable.
52+
_This is not yet implemented._
53+
54+
Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects:
55+
- `["$v1", "$v2", "$v3"]` denotes an array containing 3 variables.
56+
- `{"foo": "$v1", "bar": "$v2"}` denotes a collection containing 2 key-value pairs.
57+
_This is also not yet implemented._
58+
59+
## Example of a JSON entry point manifest object, and the respective entry point graph node
60+
Let's consider the following manifest snippet, describing an entry point _'CVSplit.Split'_:
61+
62+
```javascript
63+
{
64+
"name": "CVSplit.Split",
65+
"desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)",
66+
"inputs": [
67+
{
68+
"name": "Data",
69+
"type": "DataView",
70+
"desc": "Input dataset",
71+
"required": true
72+
},
73+
{
74+
"name": "NumFolds",
75+
"type": "Int",
76+
"desc": "Number of folds to split into",
77+
"required": false,
78+
"default": 2
79+
},
80+
{
81+
"name": "StratificationColumn",
82+
"type": "String",
83+
"desc": "Stratification column",
84+
"aliases": [
85+
"strat"
86+
],
87+
"required": false,
88+
"default": null
89+
}
90+
],
91+
"outputs": [
92+
{
93+
"name": "TrainData",
94+
"type": {
95+
"kind": "Array",
96+
"itemType": "DataView"
97+
},
98+
"desc": "Training data (one dataset per fold)"
99+
},
100+
{
101+
"name": "TestData",
102+
"type": {
103+
"kind": "Array",
104+
"itemType": "DataView"
105+
},
106+
"desc": "Testing data (one dataset per fold)"
107+
}
108+
]
109+
}
110+
```
111+
112+
As we can see, the entry point has 3 inputs (one of them required), and 2 outputs.
113+
The following is a correct graph containing call to this entry point:
114+
115+
```javascript
116+
{
117+
"nodes": [
118+
{
119+
"name": "CVSplit.Split",
120+
"inputs": {
121+
"Data": "$data1"
122+
},
123+
"outputs": {
124+
"TrainData": "$cv"
125+
}
126+
}]
127+
}
128+
```

0 commit comments

Comments
 (0)