-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hash sample #3042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash sample #3042
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,95 @@ | ||||||||||
| using System; | ||||||||||
| using Microsoft.ML.Data; | ||||||||||
|
|
||||||||||
| namespace Microsoft.ML.Samples.Dynamic | ||||||||||
| { | ||||||||||
| // This example demonstrates hashing of categorical string and integer data types. | ||||||||||
| public static class Hash | ||||||||||
| { | ||||||||||
| public static void Example() | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to at least execute this function in a test? #Pending
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a workitem on me to convert all those samples to baseline tests. Can't wait to get to it, for fear of breaking the output through other changes. In reply to: 267876000 [](ancestors = 267876000) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
| { | ||||||||||
| // Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging, | ||||||||||
| // as well as the source of randomness. | ||||||||||
| var mlContext = new MLContext(seed: 1); | ||||||||||
|
|
||||||||||
| // Get a small dataset as an IEnumerable. | ||||||||||
| var rawData = new[] { | ||||||||||
| new DataPoint() { Category = "MLB" , Age = 18 }, | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to apply hash to a numeric feature, because the value can be used as is and be more useful. It makes sense to hash an integer column, when the integer is in fact categorical, like "CategoryId", "StoreId" things like that. So I suggest changing Age to CategoryId or CategoryCode or something like that. #Pending
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends. For linear models, usually age is treated as the categorical feature because this way linear model can capture the non-monotonic relationship between the feature and label. Hashing is actually not needed for Age feature as one-hot encoding is sufficient because of low cardinality. For trees, Age can be used as-is. I am not sure if this sample on small data will convey the importance of using hashing. Shall we add some comments about where hashing will be useful? In reply to: 268000748 [](ancestors = 268000748)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's do that on the transform documentation, and not the example. In reply to: 268004737 [](ancestors = 268004737,268000748) |
||||||||||
| new DataPoint() { Category = "NFL" , Age = 14 }, | ||||||||||
| new DataPoint() { Category = "NFL" , Age = 15 }, | ||||||||||
| new DataPoint() { Category = "MLB" , Age = 18 }, | ||||||||||
| new DataPoint() { Category = "MLS" , Age = 14 }, | ||||||||||
| }; | ||||||||||
|
|
||||||||||
| var data = mlContext.Data.LoadFromEnumerable(rawData); | ||||||||||
|
|
||||||||||
| // Construct the pipeline that would hash the two columns and store the results in new columns. | ||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||
| // The first transform hashes the string column and the second transform hashes the integer column. | ||||||||||
| // | ||||||||||
| // Hashing is not a reversible operation, so there is no way to retrive the original value from the hashed value. | ||||||||||
| // Sometimes, for debugging, or model explainability, users will need to know what values in the original columns generated | ||||||||||
| // the values in the hashed columns, since the algorithms will mostly use the hashed values for further computations. | ||||||||||
| // The Hash method will preserve the mapping from the original values to the hashed values in the Annotations of the | ||||||||||
| // newly created column (column populated with the hashed values). | ||||||||||
| // | ||||||||||
| // Setting the maximumNumberOfInverts parameters to -1 will preserve the full map. | ||||||||||
| // If that parameter is left to the default 0 value, the mapping is not preserved. | ||||||||||
| var pipeline = mlContext.Transforms.Conversion.Hash("CategoryHashed", "Category", numberOfBits: 16, maximumNumberOfInverts: -1) | ||||||||||
| .Append(mlContext.Transforms.Conversion.Hash("AgeHashed", "Age", numberOfBits: 8)); | ||||||||||
|
|
||||||||||
| // Let's fit our pipeline, and then apply it to the same data. | ||||||||||
| var transformer = pipeline.Fit(data); | ||||||||||
| var transformedData = transformer.Transform(data); | ||||||||||
|
|
||||||||||
| // Convert the post transformation from the IDataView format to an IEnumerable<TransformedData> for easy consumption. | ||||||||||
| var convertedData = mlContext.Data.CreateEnumerable<TransformedDataPoint>(transformedData, true); | ||||||||||
|
|
||||||||||
| Console.WriteLine("Category CategoryHashed\t Age\t AgeHashed"); | ||||||||||
| foreach (var item in convertedData) | ||||||||||
| Console.WriteLine($"{item.Category}\t {item.CategoryHashed}\t\t {item.Age}\t {item.AgeHashed}"); | ||||||||||
|
|
||||||||||
| // Expected data after the transformation. | ||||||||||
| // | ||||||||||
| // Category CategoryHashed Age AgeHashed | ||||||||||
| // MLB 36206 18 127 | ||||||||||
| // NFL 19015 14 62 | ||||||||||
| // NFL 19015 15 43 | ||||||||||
| // MLB 36206 18 127 | ||||||||||
| // MLS 6013 14 62 | ||||||||||
|
|
||||||||||
| // For the Category column, where we set the maximumNumberOfInverts parameter, the names of the original categories, | ||||||||||
| // and their correspondance with the generated hash values is preserved in the Annotations in the format of indices and values. | ||||||||||
| // the indices array will have the hashed values, and the corresponding element, position-wise, in the values array will | ||||||||||
| // contain the original value. | ||||||||||
| // | ||||||||||
| // See below for an example on how to retrieve the mapping. | ||||||||||
| var slotNames = new VBuffer<ReadOnlyMemory<char>>(); | ||||||||||
| transformedData.Schema["CategoryHashed"].Annotations.GetValue("KeyValues", ref slotNames); | ||||||||||
|
|
||||||||||
| var indices = slotNames.GetIndices(); | ||||||||||
| var categoryNames = slotNames.GetValues(); | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find that mucking around with annotations is a bit confusing, so maybe add a few more "what I am doing on this line" comments? #Resolved
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might have added too much explanations, let me know if it feels ridiculous. In reply to: 267929818 [](ancestors = 267929818) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the code is complicated, which is the fault of our API, but inevitably users would want to do this, so we might as well keep it. In reply to: 267964622 [](ancestors = 267964622,267929818) |
||||||||||
|
|
||||||||||
| for (int i = 0; i < indices.Length; i++) | ||||||||||
| Console.WriteLine($"The original value of the {indices[i]} category is {categoryNames[i]}"); | ||||||||||
|
|
||||||||||
| // Output Data | ||||||||||
| // | ||||||||||
| // The original value of the 6012 category is MLS | ||||||||||
| // The original value of the 19014 category is NFL | ||||||||||
| // The original value of the 36205 category is MLB | ||||||||||
| } | ||||||||||
|
|
||||||||||
| private class DataPoint | ||||||||||
| { | ||||||||||
| public string Category; | ||||||||||
| public uint Age; | ||||||||||
| } | ||||||||||
|
|
||||||||||
| private class TransformedDataPoint : DataPoint | ||||||||||
| { | ||||||||||
| public uint CategoryHashed; | ||||||||||
| public uint AgeHashed; | ||||||||||
| } | ||||||||||
|
|
||||||||||
| } | ||||||||||
| } | ||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,6 +3,7 @@ | |
| <PropertyGroup> | ||
| <TargetFramework>netcoreapp2.1</TargetFramework> | ||
| <OutputType>Exe</OutputType> | ||
| <WarningsNotAsErrors>649</WarningsNotAsErrors> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If you're getting errors because of the data classes you create, you can solve this by doing a
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Rogan. Leave this here, so people don't keep bumping into this? In reply to: 267927422 [](ancestors = 267927422) |
||
| </PropertyGroup> | ||
|
|
||
| <ItemGroup> | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add xml string --- This example demonstrates hashing of string and integer types. #Resolved