Skip to content

Conversation

@zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Apr 2, 2019

Related to #1209.

foreach (var item in featureRow.Items())
Console.Write($"{slots[item.Key]} ");
Console.WriteLine();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the concern for me right now. There is no way to get this meta data through the Transformer or through the prediction engine. The only way is through IDataView obtained from .Transform call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have to transform the data anyway, why not just print off the transformed IDV instead of off a prediction? You can limit it to one row with a TakeRows filter.


In reply to: 271476465 [](ancestors = 271476465)

new TextData(){ Text = "The value at each position corresponds to," },
new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" },
new TextData(){ Text = "the inverse of the number of documents that contain the Ngram (Idf), or." },
new TextData(){ Text = "or compute both and multipy together (Tf-Idf)." },
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it! #Closed

var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text")
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("NgramFeatures", "Tokens",
ngramLength: 3, useAllLengths: false, weighting: NgramExtractingEstimator.WeightingCriteria.Tf));
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngramLength: 3, useAllLengths: false, weighting: NgramExtractingEstimator.WeightingCriteria.Tf [](start = 16, length = 94)

one parameter in one line might be more presentable. #Resolved

// This is acheived by calling 'TokenizeIntoWords' first followed by 'ProduceNgrams'.
// Please note that the length of the output feature vector depends on the Ngram settings.
var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text")
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) [](start = 14, length = 66)

add one line of comment on why this is here. #Resolved

}

// Print the first 10 feature values.
Console.Write("Features: ");
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console.Write("Features: "); [](start = 11, length = 29)

i'd remove unnecessary printings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd like to keep it because its more clear to read each line with this prefix on console.


In reply to: 271816037 [](ancestors = 271816037)

transformedDataView.Schema["NgramFeatures"].GetSlotNames(ref slotNames);
var NgramFeaturesColumn = transformedDataView.GetColumn<VBuffer<float>>(transformedDataView.Schema["NgramFeatures"]);
var slots = slotNames.GetValues();
Console.Write("Ngrams: ");
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console.Write("Ngrams: "); [](start = 10, length = 28)

i'd remove this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd like to keep it because its more clear to read each line with this prefix on console.


In reply to: 271816588 [](ancestors = 271816588)

// Print the length of the feature vector.
Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}");

// Preview of the produced .
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Preview of the produced . [](start = 11, length = 29)

comment about slot names #Resolved

var prediction = predictionEngine.Predict(samples[0]);

// Print the length of the feature vector.
Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}");
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


In reply to: 271817449 [](ancestors = 271817449)

var prediction = predictionEngine.Predict(samples[0]);

// Print the length of the feature vector.
Console.WriteLine($"Number of Features: {prediction.NgramFeatures.Length}");
Copy link
Member

@sfilipi sfilipi Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to the other file, is this needed? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


In reply to: 271819189 [](ancestors = 271819189)

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #3177 into master will increase coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3177      +/-   ##
==========================================
+ Coverage   72.54%   72.58%   +0.04%     
==========================================
  Files         807      807              
  Lines      144774   144956     +182     
  Branches    16208    16212       +4     
==========================================
+ Hits       105022   105215     +193     
+ Misses      35338    35325      -13     
- Partials     4414     4416       +2
Flag Coverage Δ
#Debug 72.58% <ø> (+0.04%) ⬆️
#production 68.14% <ø> (+0.01%) ⬆️
#test 88.88% <ø> (+0.04%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
src/Microsoft.ML.DataView/KeyDataViewType.cs 74.57% <0%> (-3.76%) ⬇️
src/Microsoft.ML.Transforms/Text/LdaTransform.cs 89.26% <0%> (-0.63%) ⬇️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs 73.7% <0%> (-0.34%) ⬇️
test/Microsoft.ML.Tests/ImagesTests.cs 98.69% <0%> (-0.13%) ⬇️
...Microsoft.ML.Tests/Transformers/NormalizerTests.cs 100% <0%> (ø) ⬆️
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 44.87% <0%> (ø) ⬆️
src/Microsoft.ML.Maml/MAML.cs 26.21% <0%> (+1.45%) ⬆️
...rosoft.ML.ImageAnalytics/VectorToImageTransform.cs 76.77% <0%> (+4.53%) ⬆️
... and 3 more

new TextData(){ Text = "Each position in the vector corresponds to a particular Ngram." },
new TextData(){ Text = "The value at each position corresponds to," },
new TextData(){ Text = "the number of times Ngram occured in the data (Tf), or" },
new TextData(){ Text = "the inverse of the number of documents that contain the Ngram (Idf), or." },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or. [](start = 109, length = 3)

omit this one. #Resolved

new TextData(){ Text = "This is an example to compute Ngrams using hashing." },
new TextData(){ Text = "Ngram is a sequence of 'N' consecutive words/tokens." },
new TextData(){ Text = "ML.NET's ProduceHashedNgrams API produces count of Ngrams and hashes it as an index into a vector of given bit length." },
new TextData(){ Text = "The hashing schem reduces the size of the output feature vector" },
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schem [](start = 52, length = 5)

process? #Resolved

Console.Write($"{prediction.NgramFeatures[i]:F4} ");

// Expected output:
// Number of Features: 256
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Number of Features: 256 [](start = 12, length = 28)

if you use maximumNumberOfInverts you can also show up slot names as in just ngrams. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but its bit complex to represent it on console as many ngram can fall into one slot.


In reply to: 271903797 [](ancestors = 271903797)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you can control with that parameter how many you actually want to remember, right? So if you put 1, you will have only one of them


In reply to: 271929298 [](ancestors = 271929298,271903797)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

// Please note that the length of the output feature vector depends on the n-gram settings.
var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "Text")
// 'ProduceNgrams' takes key type as input. Converting the tokens into key type using 'MapValueToKey'.
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) [](start = 16, length = 64)

This seems like a holdover from the internal codebase. I wonder if we should consider doing a breaking change to move this keytype conversion into the ProduceNGrams operation. The question is what other use cases do we expect to see?

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments.

@zeahmed zeahmed merged commit 854f154 into dotnet:master Apr 5, 2019
zeahmed added a commit to zeahmed/machinelearning that referenced this pull request Apr 8, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants