-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created samples for 'ProduceNgrams' and 'ProduceHashedNgrams' APIs. #3177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| foreach (var item in featureRow.Items()) | ||
| Console.Write($"{slots[item.Key]} "); | ||
| Console.WriteLine(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the concern for me right now. There is no way to get this meta data through the Transformer or through the prediction engine. The only way is through IDataView obtained from .Transform call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have to transform the data anyway, why not just print off the transformed IDV instead of off a prediction? You can limit it to one row with a TakeRows filter.
In reply to: 271476465 [](ancestors = 271476465)
| new TextData(){ Text = "The value at each position corresponds to," }, | ||
| new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" }, | ||
| new TextData(){ Text = "the inverse of the number of documents that contain the Ngram (Idf), or." }, | ||
| new TextData(){ Text = "or compute both and multipy together (Tf-Idf)." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it! #Closed
| var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text") | ||
| .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) | ||
| .Append(mlContext.Transforms.Text.ProduceNgrams("NgramFeatures", "Tokens", | ||
| ngramLength: 3, useAllLengths: false, weighting: NgramExtractingEstimator.WeightingCriteria.Tf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngramLength: 3, useAllLengths: false, weighting: NgramExtractingEstimator.WeightingCriteria.Tf [](start = 16, length = 94)
one parameter in one line might be more presentable. #Resolved
| // This is acheived by calling 'TokenizeIntoWords' first followed by 'ProduceNgrams'. | ||
| // Please note that the length of the output feature vector depends on the Ngram settings. | ||
| var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text") | ||
| .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) [](start = 14, length = 66)
add one line of comment on why this is here. #Resolved
| } | ||
|
|
||
| // Print the first 10 feature values. | ||
| Console.Write("Features: "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Console.Write("Features: "); [](start = 11, length = 29)
i'd remove unnecessary printings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'd like to keep it because its more clear to read each line with this prefix on console.
In reply to: 271816037 [](ancestors = 271816037)
| transformedDataView.Schema["NgramFeatures"].GetSlotNames(ref slotNames); | ||
| var NgramFeaturesColumn = transformedDataView.GetColumn<VBuffer<float>>(transformedDataView.Schema["NgramFeatures"]); | ||
| var slots = slotNames.GetValues(); | ||
| Console.Write("Ngrams: "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Console.Write("Ngrams: "); [](start = 10, length = 28)
i'd remove this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'd like to keep it because its more clear to read each line with this prefix on console.
In reply to: 271816588 [](ancestors = 271816588)
| // Print the length of the feature vector. | ||
| Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}"); | ||
|
|
||
| // Preview of the produced . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Preview of the produced . [](start = 11, length = 29)
comment about slot names #Resolved
| var prediction = predictionEngine.Predict(samples[0]); | ||
|
|
||
| // Print the length of the feature vector. | ||
| Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| var prediction = predictionEngine.Predict(samples[0]); | ||
|
|
||
| // Print the length of the feature vector. | ||
| Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to the other file, is this needed? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #3177 +/- ##
==========================================
+ Coverage 72.54% 72.58% +0.04%
==========================================
Files 807 807
Lines 144774 144956 +182
Branches 16208 16212 +4
==========================================
+ Hits 105022 105215 +193
+ Misses 35338 35325 -13
- Partials 4414 4416 +2
|
| new TextData(){ Text = "Each position in the vector corresponds to a particular Ngram." }, | ||
| new TextData(){ Text = "The value at each position corresponds to," }, | ||
| new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" }, | ||
| new TextData(){ Text = "the inverse of the number of documents that contain the Ngram (Idf), or." }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or. [](start = 109, length = 3)
omit this one. #Resolved
| new TextData(){ Text = "This is an example to compute Ngrams using hashing." }, | ||
| new TextData(){ Text = "Ngram is a sequence of 'N' consecutive words/tokens." }, | ||
| new TextData(){ Text = "ML.NET's ProduceHashedNgrams API produces count of Ngrams and hashes it as an index into a vector of given bit length." }, | ||
| new TextData(){ Text = "The hashing schem reduces the size of the output feature vector" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schem [](start = 52, length = 5)
process? #Resolved
| Console.Write($"{prediction.NgramFeatures[i]:F4} "); | ||
|
|
||
| // Expected output: | ||
| // Number of Features: 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Number of Features: 256 [](start = 12, length = 28)
if you use maximumNumberOfInverts you can also show up slot names as in just ngrams. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that but its bit complex to represent it on console as many ngram can fall into one slot.
In reply to: 271903797 [](ancestors = 271903797)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you can control with that parameter how many you actually want to remember, right? So if you put 1, you will have only one of them
In reply to: 271929298 [](ancestors = 271929298,271903797)
Ivanidzo4ka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
| // Please note that the length of the output feature vector depends on the n-gram settings. | ||
| var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text") | ||
| // 'ProduceNgrams' takes key type as input. Converting the tokens into key type using 'MapValueToKey'. | ||
| .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) [](start = 16, length = 64)
This seems like a holdover from the internal codebase. I wonder if we should consider doing a breaking change to move this keytype conversion into the ProduceNGrams operation. The question is what other use cases do we expect to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with comments.
Related to #1209.