Skip to content

Conversation

@kere-nel
Copy link
Contributor

@kere-nel kere-nel commented Jun 26, 2020

  • Enabled Ranking Cross Validation
  • Fixed some syntax errors in the ranking CodeGen generated code

Issue: Cross Validation is needed in order to integrate the AutoML Ranking Experiment with ModelBuilder.
Resolves: #2685

User Scenarios:

  • If the user doesn't provide a GroupIdColumnName, the default becomes "GroupId". This is used as the SamplingKeyColumn used to split the data.

  • If the user provides both a GroupIdColumnName and a SamplingKeyColumnName, both most be the same or an exception is thrown.

Review: If the user only provides a SamplingKeyColumnName, should we throw an error. Since we use the groupId to split the CV data, the user should not be populating the SamplingKeyColumnName. As of right now, an error is only thrown when the GroupIdColumnName and a SamplingKeyColumnName differ.

@kere-nel kere-nel requested review from a team as code owners June 26, 2020 22:34
}

[LightGBMFact]
public void AutoFitRankingCVTest()
Copy link
Contributor Author

@kere-nel kere-nel Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way experiments are used within codegen.
Review: Should I add cross validation tests to all other experiments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

Copy link
Contributor

@antoniovs1029 antoniovs1029 Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way around. If it has less than 15000 it runs train test split automatically on one of the Execute overloads, if it has more it runs CV. This only happens on 1 overload, but I believe Keren isn't using that overload on her tests.


In reply to: 447156630 [](ancestors = 447156630)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added CV testing for ranking only. I think it would be good to add testing for other task as well in the future.

@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #5263 into master will increase coverage by 0.01%.
The diff coverage is 89.07%.

@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
+ Coverage   73.68%   73.70%   +0.01%     
==========================================
  Files        1022     1022              
  Lines      190320   190490     +170     
  Branches    20470    20484      +14     
==========================================
+ Hits       140238   140393     +155     
- Misses      44553    44560       +7     
- Partials     5529     5537       +8     
Flag Coverage Δ
#Debug 73.70% <89.07%> (+0.01%) ⬆️
#production 69.43% <72.97%> (+<0.01%) ⬆️
#test 87.64% <100.00%> (+0.02%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.AutoML/API/ColumnInference.cs 100.00% <ø> (ø)
...oML/Experiment/MetricsAgents/BinaryMetricsAgent.cs 74.35% <ø> (ø)
...toML/Experiment/MetricsAgents/MultiMetricsAgent.cs 69.69% <ø> (ø)
...Experiment/MetricsAgents/RegressionMetricsAgent.cs 66.66% <ø> (ø)
src/Microsoft.ML.AutoML/Utils/BestResultUtil.cs 53.84% <0.00%> (ø)
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs 85.50% <ø> (ø)
...c/Microsoft.ML.Data/Evaluators/RankingEvaluator.cs 78.55% <ø> (ø)
src/Microsoft.ML.AutoML/API/RankingExperiment.cs 67.74% <33.33%> (+7.13%) ⬆️
...crosoft.ML.AutoML/Utils/UserInputValidationUtil.cs 90.50% <40.00%> (-1.46%) ⬇️
src/Microsoft.ML.AutoML/API/ExperimentBase.cs 76.02% <65.85%> (+9.36%) ⬆️
... and 56 more

Comment on lines 194 to 198
IProgress<CrossValidationRunDetail<TMetrics>> progressHandler = null)
{
UserInputValidationUtil.ValidateNumberOfCVFoldsArg(numberOfCVFolds);
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.SamplingKeyColumnName);
UserInputValidationUtil.ValidateSamplingKey(columnInformation?.SamplingKeyColumnName, columnInformation?.GroupIdColumnName, _task);
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.GroupIdColumnName);
Copy link
Contributor

@antoniovs1029 antoniovs1029 Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Bug: Should we add the new ValidateSamplingKey() method also to the other Execute overloads that use columnInformation.SamplingKeyColumn to split data?

For example this overload:

public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation,
IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null)
{
// Cross val threshold for # of dataset rows --
// If dataset has < threshold # of rows, use cross val.
// Else, run experiment using train-validate split.
const int crossValRowCountThreshold = 15000;
var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold);
if (rowCount < crossValRowCountThreshold)
{
const int numCrossValFolds = 10;
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName);
return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler);
}
else
{
var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName);
return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler);
}
}

That overload uses CVSplit if the data is over 15000 rows, using SamplingKeyColumnName. But notice that overload also splits the data using TrainTestSplit if it is under that number of rows, and TrainTestSplit also uses samplingKeyColumn; the samplingKeyColumn for TTSplit should also be the GroupId column if it's done for ranking, for the same reasons we've already discussed offline. I.e. in both cases we need samplingKeyColumnName to be the same as GroupIdColumnName, thus needing to call ValidateSamplingKey() there too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unresolving) I see you've updated other Execute overloads addressing this comment, but I think you missed one overload? The one at:
https://github.com/antoniovs1029/machinelearning/blob/bb13d629000c218136e741b643767cf45ae12fc4/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L221-L232


In reply to: 451036924 [](ancestors = 451036924)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, just curious, do you happen to know which Execute method does ModelBuilder calls? /cc: @LittleLittleCloud


In reply to: 451951124 [](ancestors = 451951124,451036924)

Comment on lines 73 to 77
columnInformation = new ColumnInformation()
{
LabelColumnName = labelColumnName,
GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: As per the feedback we got from @justinormont today, I think it would be better to set both GroupIdColumnName and SamplingKeyColumnName in here. Something like:

columnInformation = new ColumnInformation()
{
    LabelColumnName = labelColumnName,
    SamplingKeyColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId,
    GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId // For ranking, we want to enforce having the same column as samplingKeyColum and GroupIdColumn
}

With your current implementation it won't make any difference to do this, but I do think this might be clearer for future AutoML.NET developers.

A similar change would need to take place in the other overload that receives a samplingKeyColumnName but no columnInformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards deferring this to the next update. I will take a quick look, but having a column with two column info seems to be causing issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a two line change (adding the line in here, and in the other overload), and it's just to have it clear in the columnInformation object that we'll be using the samplingKeyColumn provided by the user both as SamplingKeyColumnName and GroupIdColumnName (which is actually what we're doing). So I think it's clearer this way. But whatever you decide is fine 😉


In reply to: 452987600 [](ancestors = 452987600)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, mapping a groupId column to both SamplingKeyColumnName and GroupIdColumnName doesn't work with the current implementation. The current implementation uses GroupIdColumnName as the SamplingKeyColumnName, so if the user provides a SamplingKeyColumnName, we throw an error (unless they are both the same).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know the current implementation throws if they're not the same. That's way I suggested using the samplingKeyColumn to set both SamplingKeyColumnName and GroupIdColumnName.

In general if the user provides a ColumnInformation object containing SamplingKeyColumnName and GroupIdColumnName, then we should accept that if both are the same (and in the current implementation this is doable). So I'm just not sure what's the problem in here.


In reply to: 453055824 [](ancestors = 453055824)

}

[LightGBMFact]
public void AutoFitRankingCVTest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

@kere-nel kere-nel merged commit f87a3bb into dotnet:master Jul 10, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants