Skip to content

Conversation

@michaelgsharp
Copy link
Contributor

This is the initial featurizers PR to split apart PR #4157 and get all the common code into master.
Once this code goes in, I will create separate PR's for each of the 5 featurizers.

This is the common code for the featurizers. Its mostly changes to project files and solution files. Common.cs is for all the shared code for the featurizers. The change in Utils.cs is to allow Marshal.Invoke with multiple type parameters. The RowToRowMapperTransform.cs change has it create a new mapper when possible. This helps with thread safety/local caching of the mappers when run in a multi threaded approach.

@michaelgsharp michaelgsharp requested a review from a team October 29, 2019 23:31
@michaelgsharp michaelgsharp self-assigned this Oct 29, 2019
// Not all these types are currently supported. This is so the ordering will allign with the native code.
internal enum TypeId : uint
{
String = 1, SByte, Short, Int, Long, Byte, UShort,
Copy link
Member

@eerhardt eerhardt Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typical C# code has these be on separate lines. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Put them all on their own line.


In reply to: 340734891 [](ancestors = 340734891)

Copy link
Contributor

@justinormont justinormont Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a int next to each? This will assist with long term compatibility by allowing for inserting new values, and in case someone decides to sort the enum differently later (more important if serialized in a model). #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are correct. I have added an int next to each one now. #Resolved

Copy link
Contributor

@justinormont justinormont Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks #Resolved

var meth = MarshalActionInvokeCheckAndCreate(func, genArgs);
if (meth.ReturnType != typeof(TRet))
throw Contracts.ExceptParam(nameof(func), "Cannot be generic on return type");
return meth;
Copy link
Contributor

@justinormont justinormont Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think method is a reserved keyword in C#. Recommend the longer name.

Suggested change
return meth;
return method;
``` #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as meth because that is consistent with all the other methods here that do this same thing. I think if we want it to be method then we need to change them all.


In reply to: 347619466 [](ancestors = 347619466)


<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
<PackageDescription>Additional ML.NET featurizers</PackageDescription>
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional ML.NET featurizers [](start = 24, length = 29)

ML .NET featurizers with native code implementation. #Resolved

return meth;
}

private static MethodInfo MarshalInvokeCheckAndCreate<TRet>(Delegate func, Type[] genArgs)
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegate func, Type[] genArgs) [](start = 68, length = 30)

lets keep the order consistent types then delegate #Resolved

return meth;
}

private static MethodInfo MarshalActionInvokeCheckAndCreate(Delegate func, params Type[] typeArguments)
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Delegate func, params Type[] typeArguments) [](start = 67, length = 44)

same here, lets keep the order consistent, types then delegate #Resolved


internal enum FitResult : byte
{
Complete = 1, Continue, ResetAndContinue
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete = 1, Continue, ResetAndContinue [](start = 8, length = 40)

put each on its own line and explicitly define values. #Resolved

Tabular = 0x1001 | LastStaticValue + 3,
Nullable = 0x1001 | LastStaticValue + 4,
Vector = 0x1001 | LastStaticValue + 5,
MapId = 0x1002 | LastStaticValue + 6
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapId = 0x1002 | LastStaticValue + 6 [](start = 8, length = 36)

please add comments for this #Resolved

Copy link
Contributor

@justinormont justinormont Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this add LastStaticValue? In the future, this will shift all of the uint values by 1, as a new type is inserted to the top section.

Why the math? Seems more simple to directly map to the values.

Why is the base of the last one 0x1002? That skips a value. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these are taken directly as is from the Native code base. They are here just so we have a correct mapping on our side.

LastStaticValue is supposed to represent the last stand alone type. All the types that follow after it are compound types. There is probably a more appropriate name, but since I want to keep this identical to the Native code I left it as is.

The math is there just to keep it identical to the native code. I could directly map the values, but then I would have to map the values every time I need to check if it has changed in the native code. Leaving it like this makes comparisons much easier.

The number at the end of 0x100N represents how many types come after it that are associated to it. So Tabular, Nullable, and Vector, end in 1 meaning only 1 type follows it. The Map ends in 2 because we need types for the Keys and for the Values.


In reply to: 348115118 [](ancestors = 348115118)

return Encoding.UTF8.GetString(buffer);
}
}
}
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} [](start = 8, length = 1)

new line #Resolved

return TypeId.Int;
else if (type == typeof(long))
return TypeId.Long;
else if (type == typeof(byte))
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(byte)) [](start = 35, length = 7)

sbyte #Resolved

else if (type == typeof(long))
return TypeId.Long;
else if (type == typeof(byte))
return TypeId.Byte;
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte [](start = 30, length = 4)

SByte #Resolved

@codemzs
Copy link
Member

codemzs commented Nov 18, 2019

Can you please add comments for these functions? and how they are used? #Resolved


Refers to: src/Microsoft.ML.Featurizers/Common.cs:205 in 49f3798. [](commit_id = 49f3798, deletion_comment = False)

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

: base(provider, input)
{
var pred = parent.GetActiveOutputColumns(active);
_getters = parent._mapper.CreateGetters(input, pred, out _disposer);
Copy link
Member

@codemzs codemzs Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_getters = parent._mapper.CreateGetters(input, pred, out _disposer); [](start = 16, length = 68)

lets revert this and use a thread safe dictionary for now. I want to think a little more on this. #Resolved

@codecov
Copy link

codecov bot commented Nov 19, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@1fa6cb5). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #4413   +/-   ##
=========================================
  Coverage          ?   74.86%           
=========================================
  Files             ?      908           
  Lines             ?   159862           
  Branches          ?    17214           
=========================================
  Hits              ?   119688           
  Misses            ?    35354           
  Partials          ?     4820
Flag Coverage Δ
#Debug 74.86% <ø> (?)
#production 70.22% <ø> (?)
#test 90.19% <ø> (?)

@codemzs
Copy link
Member

codemzs commented Nov 22, 2019

@michaelgsharp Can you please sync to master?

ToStringTransformer is done.

CatagoryImputer is done.

TimeSeriesImputer is done.

RobustScaler is done.

Adding in samples and documentation. General code cleanup. Made the RowToRowMapperTransform create a new mapper if possible for each cursor.
michaelgsharp and others added 2 commits November 22, 2019 14:14
Added line in Directory.Build.Props that was removed during the rebase process. The line already exists in master and shouldn't be removed.
@michaelgsharp michaelgsharp merged commit 8d20cdd into dotnet:master Dec 3, 2019
@michaelgsharp michaelgsharp deleted the featurizers-project branch December 3, 2019 22:38
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants