-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Enabling Ranking Cross Validation #5263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| [LightGBMFact] | ||
| public void AutoFitRankingCVTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way experiments are used within codegen.
Review: Should I add cross validation tests to all other experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add cross validation tests to all other experiments?
If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)
tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the other way around. If it has less than 15000 it runs train test split automatically on one of the Execute overloads, if it has more it runs CV. This only happens on 1 overload, but I believe Keren isn't using that overload on her tests.
In reply to: 447156630 [](ancestors = 447156630)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added CV testing for ranking only. I think it would be good to add testing for other task as well in the future.
Codecov Report
@@ Coverage Diff @@
## master #5263 +/- ##
==========================================
+ Coverage 73.68% 73.70% +0.01%
==========================================
Files 1022 1022
Lines 190320 190490 +170
Branches 20470 20484 +14
==========================================
+ Hits 140238 140393 +155
- Misses 44553 44560 +7
- Partials 5529 5537 +8
|
| IProgress<CrossValidationRunDetail<TMetrics>> progressHandler = null) | ||
| { | ||
| UserInputValidationUtil.ValidateNumberOfCVFoldsArg(numberOfCVFolds); | ||
| var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.SamplingKeyColumnName); | ||
| UserInputValidationUtil.ValidateSamplingKey(columnInformation?.SamplingKeyColumnName, columnInformation?.GroupIdColumnName, _task); | ||
| var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.GroupIdColumnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Bug: Should we add the new ValidateSamplingKey() method also to the other Execute overloads that use columnInformation.SamplingKeyColumn to split data?
For example this overload:
machinelearning/src/Microsoft.ML.AutoML/API/ExperimentBase.cs
Lines 96 to 116 in e8fa731
| public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation, | |
| IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null) | |
| { | |
| // Cross val threshold for # of dataset rows -- | |
| // If dataset has < threshold # of rows, use cross val. | |
| // Else, run experiment using train-validate split. | |
| const int crossValRowCountThreshold = 15000; | |
| var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold); | |
| if (rowCount < crossValRowCountThreshold) | |
| { | |
| const int numCrossValFolds = 10; | |
| var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName); | |
| return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler); | |
| } | |
| else | |
| { | |
| var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName); | |
| return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler); | |
| } | |
| } |
That overload uses CVSplit if the data is over 15000 rows, using SamplingKeyColumnName. But notice that overload also splits the data using TrainTestSplit if it is under that number of rows, and TrainTestSplit also uses samplingKeyColumn; the samplingKeyColumn for TTSplit should also be the GroupId column if it's done for ranking, for the same reasons we've already discussed offline. I.e. in both cases we need samplingKeyColumnName to be the same as GroupIdColumnName, thus needing to call ValidateSamplingKey() there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unresolving) I see you've updated other Execute overloads addressing this comment, but I think you missed one overload? The one at:
https://github.com/antoniovs1029/machinelearning/blob/bb13d629000c218136e741b643767cf45ae12fc4/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L221-L232
In reply to: 451036924 [](ancestors = 451036924)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, just curious, do you happen to know which Execute method does ModelBuilder calls? /cc: @LittleLittleCloud
In reply to: 451951124 [](ancestors = 451951124,451036924)
src/Microsoft.ML.AutoML/Experiment/MetricsAgents/IMetricsAgent.cs
Outdated
Show resolved
Hide resolved
| columnInformation = new ColumnInformation() | ||
| { | ||
| LabelColumnName = labelColumnName, | ||
| GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: As per the feedback we got from @justinormont today, I think it would be better to set both GroupIdColumnName and SamplingKeyColumnName in here. Something like:
columnInformation = new ColumnInformation()
{
LabelColumnName = labelColumnName,
SamplingKeyColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId,
GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId // For ranking, we want to enforce having the same column as samplingKeyColum and GroupIdColumn
}With your current implementation it won't make any difference to do this, but I do think this might be clearer for future AutoML.NET developers.
A similar change would need to take place in the other overload that receives a samplingKeyColumnName but no columnInformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards deferring this to the next update. I will take a quick look, but having a column with two column info seems to be causing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a two line change (adding the line in here, and in the other overload), and it's just to have it clear in the columnInformation object that we'll be using the samplingKeyColumn provided by the user both as SamplingKeyColumnName and GroupIdColumnName (which is actually what we're doing). So I think it's clearer this way. But whatever you decide is fine 😉
In reply to: 452987600 [](ancestors = 452987600)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, mapping a groupId column to both SamplingKeyColumnName and GroupIdColumnName doesn't work with the current implementation. The current implementation uses GroupIdColumnName as the SamplingKeyColumnName, so if the user provides a SamplingKeyColumnName, we throw an error (unless they are both the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know the current implementation throws if they're not the same. That's way I suggested using the samplingKeyColumn to set both SamplingKeyColumnName and GroupIdColumnName.
In general if the user provides a ColumnInformation object containing SamplingKeyColumnName and GroupIdColumnName, then we should accept that if both are the same (and in the current implementation this is doable). So I'm just not sure what's the problem in here.
In reply to: 453055824 [](ancestors = 453055824)
| } | ||
|
|
||
| [LightGBMFact] | ||
| public void AutoFitRankingCVTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add cross validation tests to all other experiments?
If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)
tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.
Issue: Cross Validation is needed in order to integrate the AutoML Ranking Experiment with ModelBuilder.
Resolves: #2685
User Scenarios:
If the user doesn't provide a
GroupIdColumnName, the default becomes "GroupId". This is used as theSamplingKeyColumnused to split the data.If the user provides both a
GroupIdColumnNameand aSamplingKeyColumnName, both most be the same or an exception is thrown.Review: If the user only provides a
SamplingKeyColumnName, should we throw an error. Since we use the groupId to split the CV data, the user should not be populating theSamplingKeyColumnName. As of right now, an error is only thrown when theGroupIdColumnNameand aSamplingKeyColumnNamediffer.