Enable Binary Classification Metric Calculation on Huge Datasets

From experimenting, it seems like calculating binary classification metrics does not scale to huge datasets. Taking a heap dump to examine the high memory usage (before the program runs out of memory), I see a list of floats used by `UnweightedAucAggregator`. It looks like, to calculate AUC, every prediction is kept in memory. It also looks like there is already substantial logic to account for this scenario -- there's logic to reservoir sample predictions, and then calculate AUC on the sample. However, it looks like the size of the internal parameter `MaxAucExamples` to control the size of this reservoir sample is always set to -1, and not exposed to the end user?https://github.com/dotnet/machinelearning/blob/610ffcb67083c2e5e6e1a14884ba24b1da0384c7/src/Microsoft.ML.Data/Evaluators/BinaryClassifierEvaluator.cs#L45 
Perhaps we should somehow expose this parameter to enable binary metric calculation on huge datasets, or set the parameter to some reasonable default
@justinormont, @vinodshanbhag 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Binary Classification Metric Calculation on Huge Datasets #3838

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable Binary Classification Metric Calculation on Huge Datasets #3838

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions