diff --git a/docs/code/EntryPoints.md b/docs/code/EntryPoints.md new file mode 100644 index 0000000000..dbcc4e6bc9 --- /dev/null +++ b/docs/code/EntryPoints.md @@ -0,0 +1,231 @@ +# Entry Points And Helper Classes + +## Overview + +Entry points are a way to interface with ML.NET components, by specifying an execution graph of connected inputs and outputs of those components. +Both the manifest describing available components and their inputs/outputs, and an "experiment" graph description, are expressed in JSON. +The recommended way of interacting with ML.NET through other, non-.NET programming languages, is by composing, and exchanging pipelines or experiment graphs. + +Through the documentation, we also refer to entry points as 'entry points nodes', and that is because they are the nodes of the graph representing the experiment. +The graph 'variables', the various values of the experiment graph JSON properties, serve to describe the relationship between the entry point nodes. +The 'variables' are therefore the edges of the DAG (Directed Acyclic Graph). + +All of ML.NET entry points are described by their manifest. The manifest is another JSON object that documents and describes the structure of an entry points. +Manifests are referenced to understand what an entry point does, and how it should be constructed, in a graph. + +This document briefly describes the structure of the entry points, the structure of an entry point manifest, and mentions the ML.NET classes that help construct an entry point graph. + +## EntryPoint manifest - the definition of an entry point + +The components manifest is build by scanning the ML.NET assemblies through reflection and searching for types having the: `SignatureEntryPointModule` signature in their `LoadableClass` assembly attribute definition. +An example of an entry point manifest object, specifically for the `ColumnTypeConverter` transform, is: + +```javascript +{ + "Name": "Transforms.ColumnTypeConverter", + "Desc": "Converts a column to a different type, using standard conversions.", + "FriendlyName": "Convert Transform", + "ShortName": "Convert", + "Inputs": [ + { "Name": "Column", + "Type": { + "Kind": "Array", + "ItemType": { + "Kind": "Struct", + "Fields": [ + { + "Name": "ResultType", + "Type": { + "Kind": "Enum", + "Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ] + }, + "Desc": "The result type", + "Aliases": [ "type" ], + "Required": false, + "SortOrder": 150, + "IsNullable": true, + "Default": null + }, + { "Name": "Range", + "Type": "String", + "Desc": "For a key column, this defines the range of values", + "Aliases": [ "key" ], + "Required": false, + "SortOrder": 150, + "IsNullable": false, + "Default": null + }, + { "Name": "Name", + "Type": "String", + "Desc": "Name of the new column", + "Aliases": [ "name" ], + "Required": false, + "SortOrder": 150, + "IsNullable": false, + "Default": null + }, + { "Name": "Source", + "Type": "String", + "Desc": "Name of the source column", + "Aliases": [ "src" ], + "Required": false, + "SortOrder": 150, + "IsNullable": false, + "Default": null + } + ] + } + }, + "Desc": "New column definition(s) (optional form: name:type:src)", + "Aliases": [ "col" ], + "Required": true, + "SortOrder": 1, + "IsNullable": false + }, + { "Name": "Data", + "Type": "DataView", + "Desc": "Input dataset", + "Required": true, + "SortOrder": 2, + "IsNullable": false + }, + { "Name": "ResultType", + "Type": { + "Kind": "Enum", + "Values": [ "I1","I2","U2","I4","U4","I8","U8","R4","Num","R8","TX","Text","TXT","BL","Bool","TimeSpan","TS","DT","DateTime","DZ","DateTimeZone","UG","U16" ] + }, + "Desc": "The result type", + "Aliases": [ "type" ], + "Required": false, + "SortOrder": 2, + "IsNullable": true, + "Default": null + }, + { "Name": "Range", + "Type": "String", + "Desc": "For a key column, this defines the range of values", + "Aliases": [ "key" ], + "Required": false, + "SortOrder": 150, + "IsNullable": false, + "Default": null + } + ], + "Outputs": [ + { + "Name": "OutputData", + "Type": "DataView", + "Desc": "Transformed dataset" + }, + { + "Name": "Model", + "Type": "TransformModel", + "Desc": "Transform model" + } + ], + "InputKind": ["ITransformInput" ], + "OutputKind": [ "ITransformOutput" ] +} +``` + +The respective entry point, constructed based on this manifest would be: + +```javascript + { + "Name": "Transforms.ColumnTypeConverter", + "Inputs": { + "Column": [{ + "Name": "Features", + "Source": "Features" + }], + "Data": "$data0", + "ResultType": "R4" + }, + "Outputs": { + "OutputData": "$Convert_Output", + "Model": "$Convert_TransformModel" + } + } +``` + +## `EntryPointGraph` + +This class encapsulates the list of nodes (`EntryPointNode`) and edges +(`EntryPointVariable` inside a `RunContext`) of the graph. + +## `EntryPointNode` + +This class represents a node in the graph, and wraps an entry point call. It +has methods for creating and running entry points. It also has a reference to +the `RunContext` to allow it to get and set values from `EntryPointVariable`s. + +To express the inputs that are set through variables, a set of dictionaries +are used. The `InputBindingMap` maps an input parameter name to a list of +`ParameterBinding`s. The `InputMap` maps a `ParameterBinding` to a +`VariableBinding`. For example, if the JSON looks like this: + +```javascript +'foo': '$bar' +``` + +the `InputBindingMap` will have one entry that maps the string "foo" to a list +that has only one element, a `SimpleParameterBinding` with the name "foo" and +the `InputMap` will map the `SimpleParameterBinding` to a +`SimpleVariableBinding` with the name "bar". For a more complicated example, +let's say we have this JSON: + +```javascript +'foo': [ '$bar[3]', '$baz'] +``` + +the `InputBindingMap` will have one entry that maps the string "foo" to a list +that has two elements, an `ArrayIndexParameterBinding` with the name "foo" and +index 0 and another one with index 1. The `InputMap` will map the first +`ArrayIndexParameterBinding` to an `ArrayIndexVariableBinding` with name "bar" +and index 3 and the second `ArrayIndexParameterBinding` to a +`SimpleVariableBinding` with the name "baz". + +For outputs, a node assumes that an output is mapped to a variable, so the +`OutputMap` is a simple dictionary from string to string. + +## `EntryPointVariable` + +This class represents an edge in the entry point graph. It has a name, a type +and a value. Variables can be simple, arrays and/or dictionaries. Currently, +only data views, file handles, predictor models and transform models are +allowed as element types for a variable. + +## `RunContext` + +This class is just a container for all the variables in a graph. + +## `VariableBinding` and Derived Classes + +The abstract base class represents a "pointer to a (part of a) variable". It +is used in conjunction with `ParameterBinding`s to specify inputs to an entry +point node. The `SimpleVariableBinding` is a pointer to an entire variable, +the `ArrayIndexVariableBinding` is a pointer to a specific index in an array +variable, and the `DictionaryKeyVariableBinding` is a pointer to a specific +key in a dictionary variable. + +## `ParameterBinding` and Derived Classes + +The abstract base class represents a "pointer to a (part of a) parameter". It +parallels the `VariableBinding` hierarchy and it is used to specify the inputs +to an entry point node. The `SimpleParameterBinding` is a pointer to a +non-array, non-dictionary parameter, the `ArrayIndexParameterBinding` is a +pointer to a specific index of an array parameter and the +`DictionaryKeyParameterBinding` is a pointer to a specific key of a dictionary +parameter. + +## How to create an entry point for an existing ML.NET component + +The steps to take, to create an entry point for an existing ML.NET component, are: +1. Add the `SignatureEntryPointModule` signature to the `LoadableClass` assembly attribute. +2. Create a public static method, that: + a. Takes as input, among others, an object representing the arguments of the component you want to expose. + b. Initializes and run the components, returning one of the nested classes of `Microsoft.ML.Runtime.EntryPoints.CommonOutputs` + c. Is annotated with the `TlcModule.EntryPoint` attribute + +Based on the type of entry point being created, there are further conventions on the name of the method, for example, the Trainers entry points are typically called: 'TrainMultiClass', 'TrainBinary' etc, based on the task. +Look at [OnlineGradientDescent](../../src/Microsoft.ML.StandardLearners/Standard/Online/OnlineGradientDescent.cs) for an example of a component and its entry point. \ No newline at end of file diff --git a/docs/code/GraphRunner.md b/docs/code/GraphRunner.md new file mode 100644 index 0000000000..b7fddc9476 --- /dev/null +++ b/docs/code/GraphRunner.md @@ -0,0 +1,128 @@ +# Entry Point JSON Graph format + +The entry point graph in ML.NET is an array of _nodes_. More information about the definition of entry points and classes that help construct entry point graphs +can be found in the [EntryPoint.md document](./EntryPoints.md). + +Each node is an object with the following fields: + +- _name_: string. Required. Name of the entry point. +- _inputs_: object. Optional. Specifies non-default inputs to the entry point. +Note that if the entry point has required inputs (which is very common), the _inputs_ field is required. +- _outputs_: object. Optional. Specifies the variables that will hold the node's outputs. + +## Input and output types +The following types are supported in JSON graphs: + +- `string`. Represented as a JSON string, maps to a C# string. +- `float`. Represented as a JSON float, maps to a C# float or double. +- `bool`. Represented as a JSON bool, maps to a C# bool. +- `enum`. Represented as a JSON string, maps to a C# enum. The allowed values are those of the C# enum (they are also listed in the manifest). +- `int`. Represented as a JSON integer, maps to a C# int or long. +- `array` of the above. Represented as a JSON array, maps to a C# array. +- `dictionary`. Currently not implemented. Represented as a JSON object, maps to a C# `Dictionary`. +- `component`. Represented as a JSON object with 2 fields: _name_:string and _settings_:object. + +## Variables +The following input/output types can not be represented as a JSON value: +- `IDataView` +- `IFileHandle` +- `ITransformModel` +- `IPredictorModel` + +These must be passed as _variables_. The variable is represented as a JSON string that begins with `$`. +Note the following rules: + +- A variable can appear in the _outputs_ only once per graph. That is, the variable can be 'assigned' only once. +- If the variable is present in _inputs_ of one node and in the _outputs_ of another node, this signifies a graph 'edge'. +The same variable can participate in many edges. +- If the variable is present only in _inputs_, but never in _outputs_, it is a _graph input_. All graph inputs must be provided before +a graph can be run. +- The variable has a type, which is the type of inputs (and, optionally, output) that it appears in. If the type of the variable is +ambiguous, ML.NET throws an exception. +- Circular references. The experiment graph is expected to be a DAG. If the circular dependency is detected, ML.NET throws an exception. +_Currently, this is done lazily: if we couldn't ever run a node because it's waiting for inputs, we throw._ + +### Variables for arrays and dictionaries. +It is allowed to define variables for arrays and dictionaries, as long as the item types are valid variable types (the four types listed above). +They are treated the same way as regular 'scalar' variables. + +If we want to reference an item of the collection, we can use the `[]` syntax: +- `$var[5]` denotes 5th element of an array variable. +- `$var[foo]` and `$var['foo']` both denote the element with key 'foo' of a dictionary variable. +_This is not yet implemented._ + +Conversely, if we want to build a collection (array or dictionary) of variables, we can do it using JSON arrays and objects: +- `["$v1", "$v2", "$v3"]` denotes an array containing 3 variables. +- `{"foo": "$v1", "bar": "$v2"}` denotes a collection containing 2 key-value pairs. +_This is also not yet implemented._ + +## Example of a JSON entry point manifest object, and the respective entry point graph node +Let's consider the following manifest snippet, describing an entry point _'CVSplit.Split'_: + +```javascript + { + "name": "CVSplit.Split", + "desc": "Split the dataset into the specified number of cross-validation folds (train and test sets)", + "inputs": [ + { + "name": "Data", + "type": "DataView", + "desc": "Input dataset", + "required": true + }, + { + "name": "NumFolds", + "type": "Int", + "desc": "Number of folds to split into", + "required": false, + "default": 2 + }, + { + "name": "StratificationColumn", + "type": "String", + "desc": "Stratification column", + "aliases": [ + "strat" + ], + "required": false, + "default": null + } + ], + "outputs": [ + { + "name": "TrainData", + "type": { + "kind": "Array", + "itemType": "DataView" + }, + "desc": "Training data (one dataset per fold)" + }, + { + "name": "TestData", + "type": { + "kind": "Array", + "itemType": "DataView" + }, + "desc": "Testing data (one dataset per fold)" + } + ] + } +``` + +As we can see, the entry point has 3 inputs (one of them required), and 2 outputs. +The following is a correct graph containing call to this entry point: + +```javascript +{ + "nodes": [ + { + "name": "CVSplit.Split", + "inputs": { + "Data": "$data1" + }, + "outputs": { + "TrainData": "$cv" + } + }] +} +``` \ No newline at end of file