-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hash sample #3042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash sample #3042
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3042 +/- ##
==========================================
+ Coverage 72.49% 72.5% +0.01%
==========================================
Files 804 804
Lines 144078 144150 +72
Branches 16179 16179
==========================================
+ Hits 104446 104519 +73
- Misses 35219 35220 +1
+ Partials 4413 4411 -2
|
| // The original value of the 6012 category is MLS | ||
| // The original value of the 19014 category is NFL | ||
| // The original value of the 36205 category is MLB | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra empty line? #Resolved
| namespace Microsoft.ML.Samples.Dynamic | ||
| { | ||
| public static class Hash | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add xml string --- This example demonstrates hashing of string and integer types. #Resolved
|
|
||
| var data = mlContext.Data.LoadFromEnumerable(rawData); | ||
|
|
||
| // Construct the pipeline that would hash the two columns and store the results in new columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Construct the pipeline that would hash the two columns and store the results in new columns. | |
| // Construct the pipeline that would hash the two columns and store the results in new columns. | |
| // The first transform hashes the string column and the second column hashes the integer column. | |
| ``` #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| } | ||
|
|
||
| private class InputData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private class InputData | |
| private class DataPoint | |
| ``` #Resolved |
| public uint Age; | ||
| } | ||
|
|
||
| private class TransformedData : InputData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransformedDataPoint or Result? #Resolved
|
|
||
| // Output | ||
| // | ||
| // Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces are missed.
| // Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127 | |
| // Category: MLB - CategoryHashed: 36206. Age: 18 - AgeHashed 127 | |
| ``` #Resolved |
| { | ||
| public static class Hash | ||
| { | ||
| public static void Example() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to at least execute this function in a test? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a workitem on me to convert all those samples to baseline tests. Can't wait to get to it, for fear of breaking the output through other changes.
In reply to: 267876000 [](ancestors = 267876000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| <PropertyGroup> | ||
| <TargetFramework>netcoreapp2.1</TargetFramework> | ||
| <OutputType>Exe</OutputType> | ||
| <WarningsNotAsErrors>649</WarningsNotAsErrors> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
649 [](start = 4, length = 46)
If you're getting errors because of the data classes you create, you can solve this by doing a public xyz Xyz {get; set;} in your classes. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Rogan. Leave this here, so people don't keep bumping into this?
It disables just that check.
In reply to: 267927422 [](ancestors = 267927422)
| var transformer = pipeline.Fit(data); | ||
| var transformedData = transformer.Transform(data); | ||
|
|
||
| // Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store [](start = 15, length = 5)
Convert #Resolved
| var transformer = pipeline.Fit(data); | ||
| var transformedData = transformer.Transform(data); | ||
|
|
||
| // Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, [](start = 70, length = 1)
no comma #Resolved
| var transformer = pipeline.Fit(data); | ||
| var transformedData = transformer.Transform(data); | ||
|
|
||
| // Store the post transformation from the IDataView format, to an IEnumerable<TransformedData> for easy consumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consumption [](start = 116, length = 11)
Periods at the end of all comments. (Hi Gleb!) #Resolved
| Console.WriteLine($"Category: {item.Category} - CategoryHashed: {item.CategoryHashed}. Age: {item.Age} - AgeHashed {item.AgeHashed}"); | ||
| } | ||
|
|
||
| // Output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output [](start = 15, length = 6)
Expected? Do we have a standard we're using? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we don't. we might want to do a pass over all of them and just nit at things like this one.
In reply to: 267928171 [](ancestors = 267928171)
| // Output | ||
| // | ||
| // Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127 | ||
| // Category: NFL - CategoryHashed: 19015.Age: 14 - AgeHashed 62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19015.Age [](start = 47, length = 9)
Space between #Resolved
| // Output | ||
| // | ||
| // Category: MLB - CategoryHashed: 36206.Age: 18 - AgeHashed 127 | ||
| // Category: NFL - CategoryHashed: 19015.Age: 14 - AgeHashed 62 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L - C [](start = 27, length = 5)
These spaces are a copy/paste residue off the command line. #Resolved
| // Category: MLS - CategoryHashed: 6013.Age: 14 - AgeHashed 62 | ||
|
|
||
| // for the Category column, where we set the maximumNumberOfInvertsparameter, the names of the original categories, | ||
| // and their correspondance with the generated hash values is preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear what you mean here.
Also, MaximumNumberOfInverts means that at most N names are kept. See
| public void InspectSlotNamesForReversibleHash() |
| transformedData.Schema["CategoryHashed"].Annotations.GetValue("KeyValues", ref slotNames); | ||
|
|
||
| var indices = slotNames.GetIndices(); | ||
| var categoryNames = slotNames.GetValues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that mucking around with annotations is a bit confusing, so maybe add a few more "what I am doing on this line" comments? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have added too much explanations, let me know if it feels ridiculous.
Some of the details will also be in the page where the example gets embedded, but based on how i use the API docs (just scroll down to the example, and ignore the rest :P ) it won't hurt.
In reply to: 267929818 [](ancestors = 267929818)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code is complicated, which is the fault of our API, but inevitably users would want to do this, so we might as well keep it.
In reply to: 267964622 [](ancestors = 267964622,267929818)
| // Construct the pipeline that would hash the two columns and store the results in new columns. | ||
| // The first transform hashes the string column and the second column hashes the integer column. | ||
| // | ||
| // Hashing is not a reversible operation, so there is not way to retrive the original value from the hashed value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not [](start = 66, length = 3)
not -> no #Resolved
| /// <example> | ||
| /// <format type="text/markdown"> | ||
| /// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConvertType [](start = 27, length = 11)
Hashing??? #Resolved
| var data = mlContext.Data.LoadFromEnumerable(rawData); | ||
|
|
||
| // Construct the pipeline that would hash the two columns and store the results in new columns. | ||
| // The first transform hashes the string column and the second column hashes the integer column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column [](start = 75, length = 6)
transform??? #Resolved
|
|
||
| // Get a small dataset as an IEnumerable. | ||
| var rawData = new[] { | ||
| new DataPoint() { Category = "MLB" , Age = 18 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Age [](start = 53, length = 3)
I don't think it makes sense to apply hash to a numeric feature, because the value can be used as is and be more useful. It makes sense to hash an integer column, when the integer is in fact categorical, like "CategoryId", "StoreId" things like that. So I suggest changing Age to CategoryId or CategoryCode or something like that. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends.
For linear models, usually age is treated as the categorical feature because this way linear model can capture the non-monotonic relationship between the feature and label. Hashing is actually not needed for Age feature as one-hot encoding is sufficient because of low cardinality.
For trees, Age can be used as-is.
I am not sure if this sample on small data will convey the importance of using hashing. Shall we add some comments about where hashing will be useful?
In reply to: 268000748 [](ancestors = 268000748)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that on the transform documentation, and not the example.
In reply to: 268004737 [](ancestors = 268004737,268000748)
shmoradims
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
zeahmed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
Towards #1209
Adding a sample for the Hash extension.
Disabling the compiler check for whether the members of a class are in use.