Skip to content

Conversation

@sfilipi
Copy link
Member

@sfilipi sfilipi commented Mar 21, 2019

Towards #1209

Adding a sample for the Hash extension.
Disabling the compiler check for whether the members of a class are in use.

@sfilipi sfilipi self-assigned this Mar 21, 2019
@sfilipi sfilipi added the documentation Related to documentation of ML.NET label Mar 21, 2019
@sfilipi sfilipi added this to the 0319 milestone Mar 21, 2019
@codecov
Copy link

codecov bot commented Mar 21, 2019

Codecov Report

Merging #3042 into master will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3042      +/-   ##
==========================================
+ Coverage   72.49%    72.5%   +0.01%     
==========================================
  Files         804      804              
  Lines      144078   144150      +72     
  Branches    16179    16179              
==========================================
+ Hits       104446   104519      +73     
- Misses      35219    35220       +1     
+ Partials     4413     4411       -2
Flag Coverage Δ
#Debug 72.5% <ø> (+0.01%) ⬆️
#production 68.13% <ø> (ø) ⬆️
#test 88.72% <ø> (+0.02%) ⬆️
Impacted Files Coverage Δ
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 50% <ø> (ø) ⬆️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs 99.78% <0%> (+0.03%) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 91.97% <0%> (+1%) ⬆️

// The original value of the 6012 category is MLS
// The original value of the 19014 category is NFL
// The original value of the 36205 category is MLB

Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra empty line? #Resolved

namespace Microsoft.ML.Samples.Dynamic
{
public static class Hash
{
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add xml string --- This example demonstrates hashing of string and integer types. #Resolved


var data = mlContext.Data.LoadFromEnumerable(rawData);

// Construct the pipeline that would hash the two columns and store the results in new columns.
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Construct the pipeline that would hash the two columns and store the results in new columns.
// Construct the pipeline that would hash the two columns and store the results in new columns.
// The first transform hashes the string column and the second column hashes the integer column.
``` #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


In reply to: 267874410 [](ancestors = 267874410)


}

private class InputData
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private class InputData
private class DataPoint
``` #Resolved

public uint Age;
}

private class TransformedData : InputData
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransformedDataPoint or Result? #Resolved


// Output
//
// Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces are missed.

Suggested change
// Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127
// Category: MLB - CategoryHashed: 36206. Age: 18 - AgeHashed 127
``` #Resolved

{
public static class Hash
{
public static void Example()
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to at least execute this function in a test? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a workitem on me to convert all those samples to baseline tests. Can't wait to get to it, for fear of breaking the output through other changes.


In reply to: 267876000 [](ancestors = 267876000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2954


In reply to: 267958858 [](ancestors = 267958858,267876000)

<PropertyGroup>
<TargetFramework>netcoreapp2.1</TargetFramework>
<OutputType>Exe</OutputType>
<WarningsNotAsErrors>649</WarningsNotAsErrors>
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

649 [](start = 4, length = 46)

If you're getting errors because of the data classes you create, you can solve this by doing a public xyz Xyz {get; set;} in your classes. #Pending

Copy link
Member Author

@sfilipi sfilipi Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rogan. Leave this here, so people don't keep bumping into this?
It disables just that check.


In reply to: 267927422 [](ancestors = 267927422)

var transformer = pipeline.Fit(data);
var transformedData = transformer.Transform(data);

// Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store [](start = 15, length = 5)

Convert #Resolved

var transformer = pipeline.Fit(data);
var transformedData = transformer.Transform(data);

// Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, [](start = 70, length = 1)

no comma #Resolved

var transformer = pipeline.Fit(data);
var transformedData = transformer.Transform(data);

// Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consumption [](start = 116, length = 11)

Periods at the end of all comments. (Hi Gleb!) #Resolved

Console.WriteLine($"Category: {item.Category} - CategoryHashed: {item.CategoryHashed}. Age: {item.Age} - AgeHashed {item.AgeHashed}");
}

// Output
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output [](start = 15, length = 6)

Expected? Do we have a standard we're using? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we don't. we might want to do a pass over all of them and just nit at things like this one.


In reply to: 267928171 [](ancestors = 267928171)

// Output
//
// Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127
// Category: NFL - CategoryHashed: 19015.Age: 14 - AgeHashed 62
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

19015.Age [](start = 47, length = 9)

Space between #Resolved

// Output
//
// Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127
// Category: NFL - CategoryHashed: 19015.Age: 14 - AgeHashed 62
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L - C [](start = 27, length = 5)

These spaces are a copy/paste residue off the command line. #Resolved

// Category: MLS - CategoryHashed: 6013.Age: 14 - AgeHashed 62

// for the Category column, where we set the maximumNumberOfInvertsparameter, the names of the original categories,
// and their correspondance with the generated hash values is preserved.
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what you mean here.

Also, MaximumNumberOfInverts means that at most N names are kept. See

public void InspectSlotNamesForReversibleHash()
(You'll want to use -1 to get them all.) #Resolved

transformedData.Schema["CategoryHashed"].Annotations.GetValue("KeyValues", ref slotNames);

var indices = slotNames.GetIndices();
var categoryNames = slotNames.GetValues();
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that mucking around with annotations is a bit confusing, so maybe add a few more "what I am doing on this line" comments? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have added too much explanations, let me know if it feels ridiculous.
Some of the details will also be in the page where the example gets embedded, but based on how i use the API docs (just scroll down to the example, and ignore the rest :P ) it won't hurt.


In reply to: 267929818 [](ancestors = 267929818)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is complicated, which is the fault of our API, but inevitably users would want to do this, so we might as well keep it.


In reply to: 267964622 [](ancestors = 267964622,267929818)

// Construct the pipeline that would hash the two columns and store the results in new columns.
// The first transform hashes the string column and the second column hashes the integer column.
//
// Hashing is not a reversible operation, so there is not way to retrive the original value from the hashed value.
Copy link

@shmoradims shmoradims Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not [](start = 66, length = 3)

not -> no #Resolved

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[ConvertType](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Conversion/Hash.cs)]
Copy link
Contributor

@zeahmed zeahmed Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConvertType [](start = 27, length = 11)

Hashing??? #Resolved

var data = mlContext.Data.LoadFromEnumerable(rawData);

// Construct the pipeline that would hash the two columns and store the results in new columns.
// The first transform hashes the string column and the second column hashes the integer column.
Copy link
Contributor

@zeahmed zeahmed Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column [](start = 75, length = 6)

transform??? #Resolved


// Get a small dataset as an IEnumerable.
var rawData = new[] {
new DataPoint() { Category = "MLB" , Age = 18 },
Copy link

@shmoradims shmoradims Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Age [](start = 53, length = 3)

I don't think it makes sense to apply hash to a numeric feature, because the value can be used as is and be more useful. It makes sense to hash an integer column, when the integer is in fact categorical, like "CategoryId", "StoreId" things like that. So I suggest changing Age to CategoryId or CategoryCode or something like that. #Pending

Copy link
Contributor

@zeahmed zeahmed Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends.

For linear models, usually age is treated as the categorical feature because this way linear model can capture the non-monotonic relationship between the feature and label. Hashing is actually not needed for Age feature as one-hot encoding is sufficient because of low cardinality.

For trees, Age can be used as-is.

I am not sure if this sample on small data will convey the importance of using hashing. Shall we add some comments about where hashing will be useful?


In reply to: 268000748 [](ancestors = 268000748)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that on the transform documentation, and not the example.


In reply to: 268004737 [](ancestors = 268004737,268000748)

Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@zeahmed zeahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi merged commit 49403ab into dotnet:master Mar 22, 2019
@sfilipi sfilipi deleted the hashSample branch March 22, 2019 17:41
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

documentation Related to documentation of ML.NET

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants