-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
When calling ProduceWordBags with weighting parameter specified it gets lost and the results always use WeightingCriteria.Tf.
The simplest repro-steps I have (based on LdaTransform sample):
// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);
string review = nameof(SamplesUtils.DatasetUtils.SampleTopicsData.Review);
// A pipeline for featurizing the "Review" column
var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);
// The transformed data
var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);
var preview = transformed_data.Preview();
var bagsColumn = transformed_data.GetColumn<VBuffer<float>>("bags");
foreach (var featureRow in bagsColumn )
{
foreach (var value in featureRow.GetValues())
Console.Write($"{value} ");
Console.WriteLine("");
}Expected output:
1.386294 0.6931472 0.6931472 1.386294 0.6931472 0.2876821 0 0 0 0 0 0 0
0 0.6931472 0.6931472 0 0.6931472 0.2876821 1.386294 1.386294 0 0 0 0 0
0.6931472 0.6931472 0.6931472 0.6931472 0.6931472
0 0 0 0 0 0.2876821 0 0 0.6931472 0.6931472 0.6931472 0.6931472 0.6931472
Actual output:
1 1 1 1 1 1 0 0 0 0 0 0 0
0 1 1 0 1 1 1 1 0 0 0 0 0
1 1 1 1 1
0 0 0 0 0 1 0 0 1 1 1 1 1
I'll send a PR with a fix shortly.
Metadata
Metadata
Assignees
Labels
No labels