Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
using System;
using Microsoft.ML.Data;

namespace Microsoft.ML.Samples.Dynamic
{
// This example demonstrates hashing of categorical string and integer data types.
public static class Hash
{
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add xml string --- This example demonstrates hashing of string and integer types. #Resolved

public static void Example()
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to at least execute this function in a test? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a workitem on me to convert all those samples to baseline tests. Can't wait to get to it, for fear of breaking the output through other changes.


In reply to: 267876000 [](ancestors = 267876000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2954


In reply to: 267958858 [](ancestors = 267958858,267876000)

{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext(seed: 1);

// Get a small dataset as an IEnumerable.
var rawData = new[] {
new DataPoint() { Category = "MLB" , Age = 18 },
Copy link

@shmoradims shmoradims Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Age [](start = 53, length = 3)

I don't think it makes sense to apply hash to a numeric feature, because the value can be used as is and be more useful. It makes sense to hash an integer column, when the integer is in fact categorical, like "CategoryId", "StoreId" things like that. So I suggest changing Age to CategoryId or CategoryCode or something like that. #Pending

Copy link
Contributor

@zeahmed zeahmed Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends.

For linear models, usually age is treated as the categorical feature because this way linear model can capture the non-monotonic relationship between the feature and label. Hashing is actually not needed for Age feature as one-hot encoding is sufficient because of low cardinality.

For trees, Age can be used as-is.

I am not sure if this sample on small data will convey the importance of using hashing. Shall we add some comments about where hashing will be useful?


In reply to: 268000748 [](ancestors = 268000748)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that on the transform documentation, and not the example.


In reply to: 268004737 [](ancestors = 268004737,268000748)

new DataPoint() { Category = "NFL" , Age = 14 },
new DataPoint() { Category = "NFL" , Age = 15 },
new DataPoint() { Category = "MLB" , Age = 18 },
new DataPoint() { Category = "MLS" , Age = 14 },
};

var data = mlContext.Data.LoadFromEnumerable(rawData);

// Construct the pipeline that would hash the two columns and store the results in new columns.
Copy link
Member

@wschin wschin Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Construct the pipeline that would hash the two columns and store the results in new columns.
// Construct the pipeline that would hash the two columns and store the results in new columns.
// The first transform hashes the string column and the second column hashes the integer column.
``` #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


In reply to: 267874410 [](ancestors = 267874410)

// The first transform hashes the string column and the second transform hashes the integer column.
//
// Hashing is not a reversible operation, so there is no way to retrive the original value from the hashed value.
// Sometimes, for debugging, or model explainability, users will need to know what values in the original columns generated
// the values in the hashed columns, since the algorithms will mostly use the hashed values for further computations.
// The Hash method will preserve the mapping from the original values to the hashed values in the Annotations of the
// newly created column (column populated with the hashed values).
//
// Setting the maximumNumberOfInverts parameters to -1 will preserve the full map.
// If that parameter is left to the default 0 value, the mapping is not preserved.
var pipeline = mlContext.Transforms.Conversion.Hash("CategoryHashed", "Category", numberOfBits: 16, maximumNumberOfInverts: -1)
.Append(mlContext.Transforms.Conversion.Hash("AgeHashed", "Age", numberOfBits: 8));

// Let's fit our pipeline, and then apply it to the same data.
var transformer = pipeline.Fit(data);
var transformedData = transformer.Transform(data);

// Convert the post transformation from the IDataView format to an IEnumerable<TransformedData> for easy consumption.
var convertedData = mlContext.Data.CreateEnumerable<TransformedDataPoint>(transformedData, true);

Console.WriteLine("Category CategoryHashed\t Age\t AgeHashed");
foreach (var item in convertedData)
Console.WriteLine($"{item.Category}\t {item.CategoryHashed}\t\t {item.Age}\t {item.AgeHashed}");

// Expected data after the transformation.
//
// Category CategoryHashed Age AgeHashed
// MLB 36206 18 127
// NFL 19015 14 62
// NFL 19015 15 43
// MLB 36206 18 127
// MLS 6013 14 62

// For the Category column, where we set the maximumNumberOfInverts parameter, the names of the original categories,
// and their correspondance with the generated hash values is preserved in the Annotations in the format of indices and values.
// the indices array will have the hashed values, and the corresponding element, position-wise, in the values array will
// contain the original value.
//
// See below for an example on how to retrieve the mapping.
var slotNames = new VBuffer<ReadOnlyMemory<char>>();
transformedData.Schema["CategoryHashed"].Annotations.GetValue("KeyValues", ref slotNames);

var indices = slotNames.GetIndices();
var categoryNames = slotNames.GetValues();
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that mucking around with annotations is a bit confusing, so maybe add a few more "what I am doing on this line" comments? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have added too much explanations, let me know if it feels ridiculous.
Some of the details will also be in the page where the example gets embedded, but based on how i use the API docs (just scroll down to the example, and ignore the rest :P ) it won't hurt.


In reply to: 267929818 [](ancestors = 267929818)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is complicated, which is the fault of our API, but inevitably users would want to do this, so we might as well keep it.


In reply to: 267964622 [](ancestors = 267964622,267929818)


for (int i = 0; i < indices.Length; i++)
Console.WriteLine($"The original value of the {indices[i]} category is {categoryNames[i]}");

// Output Data
//
// The original value of the 6012 category is MLS
// The original value of the 19014 category is NFL
// The original value of the 36205 category is MLB
}

private class DataPoint
{
public string Category;
public uint Age;
}

private class TransformedDataPoint : DataPoint
{
public uint CategoryHashed;
public uint AgeHashed;
}

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
<PropertyGroup>
<TargetFramework>netcoreapp2.1</TargetFramework>
<OutputType>Exe</OutputType>
<WarningsNotAsErrors>649</WarningsNotAsErrors>
Copy link
Contributor

@rogancarr rogancarr Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

649 [](start = 4, length = 46)

If you're getting errors because of the data classes you create, you can solve this by doing a public xyz Xyz {get; set;} in your classes. #Pending

Copy link
Member Author

@sfilipi sfilipi Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rogan. Leave this here, so people don't keep bumping into this?
It disables just that check.


In reply to: 267927422 [](ancestors = 267927422)

</PropertyGroup>

<ItemGroup>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@ public static class ConversionsExtensionsCatalog
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="maximumNumberOfInverts"/>Specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[Hash](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Conversion/Hash.cs)]
/// ]]></format>
/// </example>

public static HashingEstimator Hash(this TransformsCatalog.ConversionTransforms catalog, string outputColumnName, string inputColumnName = null,
int numberOfBits = HashDefaults.NumberOfBits, int maximumNumberOfInverts = HashDefaults.MaximumNumberOfInverts)
=> new HashingEstimator(CatalogUtils.GetEnvironment(catalog), outputColumnName, inputColumnName, numberOfBits, maximumNumberOfInverts);
Expand Down