Adding documentation about entry points, and entry points graphs: EntryPoints.md and GraphRunner.md (#295)

sfilipi · web-flow · commit ecc6857410f5 · 2018-06-21T10:55:55.000-07:00
* Adding EntryPoints.md and GraphRunner.md

* addressing PR feedback

* Updating the title of the GraphRunner.md file

* adressing Tom's feedback

* adressing feedback

* code formatting for class names

* Addressing Gal's comments

* Adding an example of an entry point. Fixing casing on ML.NET

* fixing link
diff --git a/docs/code/EntryPoints.md b/docs/code/EntryPoints.md
@@ -0,0 +1,231 @@
+﻿# Entry Points And Helper Classes 
+
+## Overview
+
+Entry points are a way to interface with ML.NET components, by specifying an execution graph of connected inputs and outputs of those components.
+Both the manifest describing available components and their inputs/outputs, and an "experiment" graph description, are expressed in JSON. 
+The recommended way of interacting with ML.NET through other, non-.NET programming languages, is by composing, and exchanging pipelines or experiment graphs.  
+
+Through the documentation, we also refer to entry points as 'entry points nodes', and that is because they are the nodes of the graph representing the experiment. 
+The graph 'variables', the various values of the experiment graph JSON properties, serve to describe the relationship between the entry point nodes. 
+The 'variables' are therefore the edges of the DAG (Directed Acyclic Graph). 
+
+All of ML.NET entry points are described by their manifest. The manifest is another JSON object that documents and describes the structure of an entry points. 
+Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph.  
+
+This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.NET classes that help construct an entry point graph.
+
+## EntryPoint manifest - the definition of an entry point
+
+The components manifest is build by scanning the ML.NET assemblies through reflection and searching for types having the: `SignatureEntryPointModule` signature in their `LoadableClass` assembly  attribute definition. 
+An example of an entry point manifest object, specifically for the `ColumnTypeConverter` transform, is:
+
+```javascript
+{
+    "Name": "Transforms.ColumnTypeConverter",
+    "Desc": "Converts a column to a different type, using standard conversions.",
+    "FriendlyName": "Convert Transform",
+    "ShortName": "Convert",
+    "Inputs": [
+        {   "Name": "Column",
+            "Type": {
+                "Kind": "Array",
+                "ItemType": {
+                    "Kind": "Struct",
+                    "Fields": [
+                        {
+                            "Name": "ResultType",
+                            "Type": {
+                                "Kind": "Enum",
+                                "Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ]
+                            },
+                            "Desc": "The result type",
+                            "Aliases": [ "type" ],
+                            "Required": false,
+                            "SortOrder": 150,
+                            "IsNullable": true,
+                            "Default": null
+                        },
+                        {   "Name": "Range",
+                            "Type": "String",
+                            "Desc": "For a key column, this defines the range of values",
+                            "Aliases": [ "key" ],
+                            "Required": false,
+                            "SortOrder": 150,
+                            "IsNullable": false,
+                            "Default": null
+                        },
+                        {   "Name": "Name",
+                            "Type": "String",
+                            "Desc": "Name of the new column",
+                            "Aliases": [ "name" ],
+                            "Required": false,
+                            "SortOrder": 150,
+                            "IsNullable": false,
+                            "Default": null
+                        },
+                        {   "Name": "Source",
+                            "Type": "String",
+                            "Desc": "Name of the source column",
+                            "Aliases": [ "src" ],
+                            "Required": false,
+                            "SortOrder": 150,
+                            "IsNullable": false,
+                            "Default": null
+                        }
+                    ]
+                }
+            },
+            "Desc": "New column definition(s) (optional form: name:type:src)",
+            "Aliases": [ "col" ],
+            "Required": true,
+            "SortOrder": 1,
+            "IsNullable": false
+        },
+        {   "Name": "Data",
+            "Type": "DataView",
+            "Desc": "Input dataset",
+            "Required": true,
+            "SortOrder": 2,
+            "IsNullable": false
+        },
+        {   "Name": "ResultType",
+            "Type": {
+                "Kind": "Enum",
+                "Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ]
+            },
+            "Desc": "The result type",
+            "Aliases": [ "type" ],
+            "Required": false,
+            "SortOrder": 2,
+            "IsNullable": true,
+            "Default": null
+        },
+        {   "Name": "Range",
+            "Type": "String",
+            "Desc": "For a key column, this defines the range of values",
+            "Aliases": [ "key" ],
+            "Required": false,
+            "SortOrder": 150,
+            "IsNullable": false,
+            "Default": null
+        }
+    ],
+    "Outputs": [
+	    {
+            "Name": "OutputData",
+            "Type": "DataView",
+            "Desc": "Transformed dataset" 
+        },
+        {
+            "Name": "Model",
+            "Type": "TransformModel",
+            "Desc": "Transform model"
+        }
+    ],
+    "InputKind": ["ITransformInput" ],
+    "OutputKind": [ "ITransformOutput" ]
+}
+```
+
+The respective entry point, constructed based on this manifest would be:
+
+```javascript
+    {
+        "Name": "Transforms.ColumnTypeConverter",
+        "Inputs": {
+            "Column": [{ 
+            "Name": "Features",
+                    "Source": "Features"
+                }],
+            "Data": "$data0",
+            "ResultType": "R4"
+        },
+        "Outputs": {
+            "OutputData": "$Convert_Output",
+            "Model": "$Convert_TransformModel"
+        }
+    }
+```
+
+## `EntryPointGraph`
+
+This class encapsulates the list of nodes (`EntryPointNode`) and edges
+(`EntryPointVariable` inside a `RunContext`) of the graph.
+
+## `EntryPointNode`
+
+This class represents a node in the graph, and wraps an entry point call. It
+has methods for creating and running entry points. It also has a reference to
+the `RunContext` to allow it to get and set values from `EntryPointVariable`s.
+
+To express the inputs that are set through variables, a set of dictionaries
+are used. The `InputBindingMap` maps an input parameter name to a list of
+`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a
+`VariableBinding`.  For example, if the JSON looks like this:
+
+```javascript
+'foo': '$bar'
+```
+
+the `InputBindingMap` will have one entry that maps the string "foo" to a list
+that has only one element, a `SimpleParameterBinding` with the name "foo" and
+the `InputMap` will map the `SimpleParameterBinding` to a
+`SimpleVariableBinding` with the name "bar". For a more complicated example,
+let's say we have this JSON:
+
+```javascript
+'foo': [ '$bar[3]', '$baz']
+```
+
+the `InputBindingMap` will have one entry that maps the string "foo" to a list
+that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and
+index 0 and another one with index 1. The `InputMap` will map the first
+`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar"
+and index 3 and the second `ArrayIndexParameterBinding` to a
+`SimpleVariableBinding` with the name "baz".
+
+For outputs, a node assumes that an output is mapped to a variable, so the
+`OutputMap` is a simple dictionary from string to string.
+
+## `EntryPointVariable`
+
+This class represents an edge in the entry point graph. It has a name, a type
+and a value. Variables can be simple, arrays and/or dictionaries. Currently,
+only data views, file handles, predictor models and transform models are
+allowed as element types for a variable.
+
+## `RunContext`
+
+This class is just a container for all the variables in a graph.
+
+## `VariableBinding` and Derived Classes
+
+The abstract base class represents a "pointer to a (part of a) variable". It
+is used in conjunction with `ParameterBinding`s to specify inputs to an entry
+point node. The `SimpleVariableBinding` is a pointer to an entire variable,
+the `ArrayIndexVariableBinding` is a pointer to a specific index in an array
+variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific
+key in a dictionary variable.
+
+## `ParameterBinding` and Derived Classes
+
+The abstract base class represents a "pointer to a (part of a) parameter". It
+parallels the `VariableBinding` hierarchy and it is used to specify the inputs
+to an entry point node. The `SimpleParameterBinding` is a pointer to a
+non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a
+pointer to a specific index of an array parameter and the
+`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary
+parameter.
+
+## How to create an entry point for an existing ML.NET component
+
+The steps to take, to create an entry point for an existing ML.NET component, are:
+1. Add the `SignatureEntryPointModule` signature to the `LoadableClass` assembly attribute.  
+2. Create a public static method, that:
+    a. Takes as input, among others, an object representing the arguments of the component you want to expose. 
+    b. Initializes and run the components, returning one of the nested classes of  `Microsoft.ML.Runtime.EntryPoints.CommonOutputs`
+    c. Is annotated with the `TlcModule.EntryPoint` attribute 
+
+Based on the type of entry point being created, there are further conventions on the name of the method, for example, the Trainers entry points are typically called: 'TrainMultiClass', 'TrainBinary' etc, based on the task. 
+Look at [OnlineGradientDescent](../../src/Microsoft.ML.StandardLearners/Standard/Online/OnlineGradientDescent.cs) for an example of a component and its entry point. 
diff --git a/docs/code/GraphRunner.md b/docs/code/GraphRunner.md
@@ -0,0 +1,128 @@
+﻿# Entry Point JSON Graph format
+
+The entry point graph in ML.NET is an array of _nodes_. More information about the definition of entry points and classes that help construct entry point graphs 
+can be found in the [EntryPoint.md document](./EntryPoints.md). 
+ 
+Each node is an object with the following fields:
+
+- _name_: string. Required. Name of the entry point.
+- _inputs_: object. Optional. Specifies non-default inputs to the entry point. 
+Note that if the entry point has required inputs (which is very common), the _inputs_ field is required.
+- _outputs_: object. Optional. Specifies the variables that will hold the node's outputs.
+
+## Input and output types
+The following types are supported in JSON graphs:
+
+- `string`. Represented as a JSON string, maps to a C# string.
+- `float`. Represented as a JSON float, maps to a C# float or double.
+- `bool`. Represented as a JSON bool, maps to a C# bool.
+- `enum`. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest).
+- `int`.  Represented as a JSON integer, maps to a C# int or long.
+- `array` of the above. Represented as a JSON array, maps to a C# array.
+- `dictionary`. Currently not implemented. Represented as a JSON object, maps to a C# `Dictionary<string,T>`.
+- `component`. Represented as a JSON object with 2 fields: _name_:string and _settings_:object.
+
+## Variables
+The following input/output types can not be represented as a JSON value:
+- `IDataView`
+- `IFileHandle`
+- `ITransformModel`
+- `IPredictorModel`
+
+These must be passed as _variables_. The variable is represented as a JSON string that begins with `$`. 
+Note the following rules:
+
+- A variable can appear in the _outputs_ only once per graph. That is, the variable can be 'assigned' only once. 
+- If the variable is present in _inputs_ of one node and in the _outputs_ of another node, this signifies a graph 'edge'. 
+The same variable can participate in many edges.
+- If the variable is present only in _inputs_, but never in _outputs_, it is a _graph input_. All graph inputs must be provided before
+a graph can be run.
+- The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is 
+ambiguous, ML.NET throws an exception.
+- Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, ML.NET throws an exception. 
+_Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw._
+
+### Variables for arrays and dictionaries.
+It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above).
+They are treated the same way as regular 'scalar' variables.
+
+If we want to reference an item of the collection, we can use the `[]` syntax:
+- `$var[5]` denotes 5th element of an array variable.
+- `$var[foo]` and `$var['foo']` both denote the element with key 'foo' of a dictionary variable.
+_This is not yet implemented._
+
+Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects:
+- `["$v1", "$v2", "$v3"]` denotes an array containing 3 variables.
+- `{"foo": "$v1", "bar": "$v2"}` denotes a collection containing 2 key-value pairs.
+_This is also not yet implemented._
+
+## Example of a JSON entry point manifest object, and the respective entry point graph node
+Let's consider the following manifest snippet, describing an entry point _'CVSplit.Split'_:
+
+```javascript
+    {
+      "name": "CVSplit.Split",
+      "desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)",
+      "inputs": [
+        {
+          "name": "Data",
+          "type": "DataView",
+          "desc": "Input dataset",
+          "required": true
+        },
+        {
+          "name": "NumFolds",
+          "type": "Int",
+          "desc": "Number of folds to split into",
+          "required": false,
+          "default": 2
+        },
+        {
+          "name": "StratificationColumn",
+          "type": "String",
+          "desc": "Stratification column",
+          "aliases": [
+            "strat"
+          ],
+          "required": false,
+          "default": null
+        }
+      ],
+      "outputs": [
+        {
+          "name": "TrainData",
+          "type": {
+            "kind": "Array",
+            "itemType": "DataView"
+          },
+          "desc": "Training data (one dataset per fold)"
+        },
+        {
+          "name": "TestData",
+          "type": {
+            "kind": "Array",
+            "itemType": "DataView"
+          },
+          "desc": "Testing data (one dataset per fold)"
+        }
+      ]
+    }
+```
+
+As we can see, the entry point has 3 inputs (one of them required), and 2 outputs.
+The following is a correct graph containing call to this entry point:
+
+```javascript
+{
+  "nodes": [
+    {
+      "name": "CVSplit.Split",
+      "inputs": {
+        "Data": "$data1"
+      },
+      "outputs": {
+        "TrainData": "$cv"
+      }
+    }]
+}
+```