Enabling Ranking Cross Validation #5263

kere-nel · 2020-06-26T22:34:33Z

Enabled Ranking Cross Validation
Fixed some syntax errors in the ranking CodeGen generated code

Issue: Cross Validation is needed in order to integrate the AutoML Ranking Experiment with ModelBuilder.
Resolves: #2685

User Scenarios:

If the user doesn't provide a GroupIdColumnName, the default becomes "GroupId". This is used as the SamplingKeyColumn used to split the data.
If the user provides both a GroupIdColumnName and a SamplingKeyColumnName, both most be the same or an exception is thrown.

Review: If the user only provides a SamplingKeyColumnName, should we throw an error. Since we use the groupId to split the CV data, the user should not be populating the SamplingKeyColumnName. As of right now, an error is only thrown when the GroupIdColumnName and a SamplingKeyColumnName differ.

src/Microsoft.ML.Data/TrainCatalog.cs

kere-nel · 2020-06-26T22:48:38Z

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

        }

+        [LightGBMFact]
+        public void AutoFitRankingCVTest()


This is the way experiments are used within codegen.
Review: Should I add cross validation tests to all other experiments?

Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

It's the other way around. If it has less than 15000 it runs train test split automatically on one of the Execute overloads, if it has more it runs CV. This only happens on 1 overload, but I believe Keren isn't using that overload on her tests.

In reply to: 447156630 [](ancestors = 447156630)

I have added CV testing for ranking only. I think it would be good to add testing for other task as well in the future.

src/Microsoft.ML.Data/TrainCatalog.cs

codecov · 2020-06-26T23:49:54Z

Codecov Report

Merging #5263 into master will increase coverage by 0.01%.
The diff coverage is 89.07%.

@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
+ Coverage   73.68%   73.70%   +0.01%     
==========================================
  Files        1022     1022              
  Lines      190320   190490     +170     
  Branches    20470    20484      +14     
==========================================
+ Hits       140238   140393     +155     
- Misses      44553    44560       +7     
- Partials     5529     5537       +8

Flag	Coverage Δ
#Debug	`73.70% <89.07%> (+0.01%)`	⬆️
#production	`69.43% <72.97%> (+<0.01%)`	⬆️
#test	`87.64% <100.00%> (+0.02%)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.AutoML/API/ColumnInference.cs	`100.00% <ø> (ø)`
...oML/Experiment/MetricsAgents/BinaryMetricsAgent.cs	`74.35% <ø> (ø)`
...toML/Experiment/MetricsAgents/MultiMetricsAgent.cs	`69.69% <ø> (ø)`
...Experiment/MetricsAgents/RegressionMetricsAgent.cs	`66.66% <ø> (ø)`
src/Microsoft.ML.AutoML/Utils/BestResultUtil.cs	`53.84% <0.00%> (ø)`
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs	`85.50% <ø> (ø)`
...c/Microsoft.ML.Data/Evaluators/RankingEvaluator.cs	`78.55% <ø> (ø)`
src/Microsoft.ML.AutoML/API/RankingExperiment.cs	`67.74% <33.33%> (+7.13%)`	⬆️
...crosoft.ML.AutoML/Utils/UserInputValidationUtil.cs	`90.50% <40.00%> (-1.46%)`	⬇️
src/Microsoft.ML.AutoML/API/ExperimentBase.cs	`76.02% <65.85%> (+9.36%)`	⬆️
... and 56 more

src/Microsoft.ML.Data/TrainCatalog.cs

antoniovs1029 · 2020-07-07T17:43:34Z

src/Microsoft.ML.AutoML/API/ExperimentBase.cs

            IProgress<CrossValidationRunDetail<TMetrics>> progressHandler = null)
        {
            UserInputValidationUtil.ValidateNumberOfCVFoldsArg(numberOfCVFolds);
-            var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.SamplingKeyColumnName);
+            UserInputValidationUtil.ValidateSamplingKey(columnInformation?.SamplingKeyColumnName, columnInformation?.GroupIdColumnName, _task);
+            var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.GroupIdColumnName);


Potential Bug: Should we add the new ValidateSamplingKey() method also to the other Execute overloads that use columnInformation.SamplingKeyColumn to split data?

For example this overload:

machinelearning/src/Microsoft.ML.AutoML/API/ExperimentBase.cs

Lines 96 to 116 in e8fa731

public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation,

IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null)

{

// Cross val threshold for # of dataset rows --

// If dataset has < threshold # of rows, use cross val.

// Else, run experiment using train-validate split.

const int crossValRowCountThreshold = 15000;

var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold);

if (rowCount < crossValRowCountThreshold)

{

const int numCrossValFolds = 10;

var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName);

return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler);

}

else

{

var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName);

return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler);

}

}

That overload uses CVSplit if the data is over 15000 rows, using SamplingKeyColumnName. But notice that overload also splits the data using TrainTestSplit if it is under that number of rows, and TrainTestSplit also uses samplingKeyColumn; the samplingKeyColumn for TTSplit should also be the GroupId column if it's done for ranking, for the same reasons we've already discussed offline. I.e. in both cases we need samplingKeyColumnName to be the same as GroupIdColumnName, thus needing to call ValidateSamplingKey() there too.

(unresolving) I see you've updated other Execute overloads addressing this comment, but I think you missed one overload? The one at:
https://github.com/antoniovs1029/machinelearning/blob/bb13d629000c218136e741b643767cf45ae12fc4/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L221-L232

In reply to: 451036924 [](ancestors = 451036924)

BTW, just curious, do you happen to know which Execute method does ModelBuilder calls? /cc: @LittleLittleCloud

In reply to: 451951124 [](ancestors = 451951124,451036924)

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

src/Microsoft.ML.AutoML/API/ExperimentBase.cs

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs

src/Microsoft.ML.Data/TrainCatalog.cs

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

src/Microsoft.ML.AutoML/API/ExperimentBase.cs

src/Microsoft.ML.AutoML/API/ColumnInference.cs

test/Microsoft.ML.Functional.Tests/Validation.cs

src/Microsoft.ML.AutoML/Experiment/MetricsAgents/IMetricsAgent.cs

src/Microsoft.ML.AutoML/Experiment/MetricsAgents/MultiMetricsAgent.cs

antoniovs1029 · 2020-07-10T03:03:45Z

src/Microsoft.ML.AutoML/API/ExperimentBase.cs

+                columnInformation = new ColumnInformation()
+                {
+                    LabelColumnName = labelColumnName,
+                    GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId
+                };


Suggestion: As per the feedback we got from @justinormont today, I think it would be better to set both GroupIdColumnName and SamplingKeyColumnName in here. Something like:

columnInformation = new ColumnInformation() { LabelColumnName = labelColumnName, SamplingKeyColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId, GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId // For ranking, we want to enforce having the same column as samplingKeyColum and GroupIdColumn }

With your current implementation it won't make any difference to do this, but I do think this might be clearer for future AutoML.NET developers.

A similar change would need to take place in the other overload that receives a samplingKeyColumnName but no columnInformation.

I would lean towards deferring this to the next update. I will take a quick look, but having a column with two column info seems to be causing issues.

It's just a two line change (adding the line in here, and in the other overload), and it's just to have it clear in the columnInformation object that we'll be using the samplingKeyColumn provided by the user both as SamplingKeyColumnName and GroupIdColumnName (which is actually what we're doing). So I think it's clearer this way. But whatever you decide is fine 😉

In reply to: 452987600 [](ancestors = 452987600)

Just to be clear, mapping a groupId column to both SamplingKeyColumnName and GroupIdColumnName doesn't work with the current implementation. The current implementation uses GroupIdColumnName as the SamplingKeyColumnName, so if the user provides a SamplingKeyColumnName, we throw an error (unless they are both the same).

Yeah, I know the current implementation throws if they're not the same. That's way I suggested using the samplingKeyColumn to set both SamplingKeyColumnName and GroupIdColumnName.

In general if the user provides a ColumnInformation object containing SamplingKeyColumnName and GroupIdColumnName, then we should accept that if both are the same (and in the current implementation this is doable). So I'm just not sure what's the problem in here.

In reply to: 453055824 [](ancestors = 453055824)

LittleLittleCloud · 2020-06-29T18:06:20Z

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

        }

+        [LightGBMFact]
+        public void AutoFitRankingCVTest()


Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

...lTests/ConsoleCodeGeneratorTests.ConsoleAppModelBuilderCSFileContentRankingTest.approved.txt

kere-nel requested review from a team as code owners June 26, 2020 22:34

kere-nel requested review from LittleLittleCloud and antoniovs1029 June 26, 2020 22:36

kere-nel commented Jun 26, 2020

View reviewed changes

src/Microsoft.ML.Data/TrainCatalog.cs Outdated Show resolved Hide resolved

kere-nel commented Jun 26, 2020

View reviewed changes

antoniovs1029 reviewed Jun 26, 2020

View reviewed changes

src/Microsoft.ML.Data/TrainCatalog.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 1, 2020

View reviewed changes

src/Microsoft.ML.Data/TrainCatalog.cs Show resolved Hide resolved

kere-nel added 2 commits July 6, 2020 17:29

adding cross validation for ranking

39f1e6a

CV test

6a8a242

kere-nel force-pushed the ranking_cv branch from 95660b2 to 6a8a242 Compare July 7, 2020 17:06

antoniovs1029 reviewed Jul 7, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs Show resolved Hide resolved

antoniovs1029 reviewed Jul 7, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 7, 2020

View reviewed changes

src/Microsoft.ML.AutoML/API/ExperimentBase.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 7, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs Outdated Show resolved Hide resolved

kere-nel added 2 commits July 8, 2020 09:52

fixing overloafs

827c1c6

resolving comments

1621e99

kere-nel commented Jul 9, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 9, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Utils/UserInputValidationUtil.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 9, 2020

View reviewed changes

src/Microsoft.ML.Data/TrainCatalog.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 9, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs Show resolved Hide resolved

resolving comments - docs

b47d353

kere-nel commented Jul 9, 2020

View reviewed changes

src/Microsoft.ML.AutoML/API/ExperimentBase.cs Show resolved Hide resolved

antoniovs1029 reviewed Jul 9, 2020

View reviewed changes

src/Microsoft.ML.AutoML/API/ColumnInference.cs Show resolved Hide resolved

kere-nel added 3 commits July 9, 2020 13:48

adding more tests

e4a714f

changing the way a custom groupId gets handled

e2c9a93

adding doc suggestion

6e2792a

antoniovs1029 reviewed Jul 10, 2020

View reviewed changes

test/Microsoft.ML.Functional.Tests/Validation.cs Show resolved Hide resolved

antoniovs1029 reviewed Jul 10, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/MetricsAgents/IMetricsAgent.cs Outdated Show resolved Hide resolved

antoniovs1029 reviewed Jul 10, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/MetricsAgents/MultiMetricsAgent.cs Show resolved Hide resolved

antoniovs1029 reviewed Jul 10, 2020

View reviewed changes

LittleLittleCloud approved these changes Jul 10, 2020

View reviewed changes

kere-nel added 4 commits July 10, 2020 10:59

resolving comments

567865e

addressing comments

41905b8

removed white space

25afec7

resolving comments

fc883a0

antoniovs1029 approved these changes Jul 10, 2020

View reviewed changes

kere-nel merged commit f87a3bb into dotnet:master Jul 10, 2020

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

	public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation,
	IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null)
	{
	// Cross val threshold for # of dataset rows --
	// If dataset has < threshold # of rows, use cross val.
	// Else, run experiment using train-validate split.
	const int crossValRowCountThreshold = 15000;

	var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold);
	if (rowCount < crossValRowCountThreshold)
	{
	const int numCrossValFolds = 10;
	var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName);
	return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler);
	}
	else
	{
	var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName);
	return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler);
	}
	}

Enabling Ranking Cross Validation #5263

Enabling Ranking Cross Validation #5263

Uh oh!

Conversation

kere-nel commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kere-nel Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

antoniovs1029 Jul 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kere-nel commented Jun 26, 2020 •

edited

Loading

kere-nel Jun 26, 2020 •

edited

Loading

antoniovs1029 Jul 10, 2020 •

edited

Loading

codecov bot commented Jun 26, 2020 •

edited

Loading

antoniovs1029 Jul 7, 2020 •

edited

Loading