Create functional tests for all V1 Evaluation scenarios

As laid out in #2498 , we need scenarios to cover the Evaluation functionality we want fully supported in V1.

* I can evaluate a model trained for any of my tasks on test data. The evaluation outputs metrics that are relevant to the task (e.g. AUC, accuracy, P/R, and F1 for binary classification)
* I can get the data that will allow me to plot PR curves