diff --git a/notebooks/overfit_wrap_up_ex_00.ipynb b/notebooks/overfit_wrap_up_ex_00.ipynb new file mode 100644 index 000000000..9ba2347e5 --- /dev/null +++ b/notebooks/overfit_wrap_up_ex_00.ipynb @@ -0,0 +1,308 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \ud83c\udfc1 Wrap-up quiz 2\n", + "\n", + "This notebook contains the guided project to answer the hands-on questions\n", + "corresponding to the module \"Selecting the best model\" of the Associate\n", + "Practitioner Course. In this test **we do not have access to your code**. Only\n", + "it's output is evaluated by using the multiple choice questions, to be\n", + "answered in the dedicated User Interface.\n", + "\n", + "First run the following cell to initialize jupyterlite. Notice that only basic\n", + "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n", + "Remember that the initial import of libraries can take longer than usual, it\n", + "may take around 10-20 seconds for the following cell to run. Please be\n", + "patient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install seaborn==0.13.2\n", + "import matplotlib\n", + "import numpy\n", + "import pandas\n", + "import seaborn\n", + "import sklearn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the `blood_transfusion.csv` dataset with the following cell of code. The\n", + "column \"Class\" contains the target variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n", + "target_name = \"Class\"\n", + "data = blood_transfusion.drop(columns=target_name)\n", + "target = blood_transfusion[target_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Select the correct answers from the following proposals.\n", + "\n", + "- a) The problem to be solved is a regression problem\n", + "- b) The problem to be solved is a binary classification problem (exactly 2\n", + " possible classes)\n", + "- c) The problem to be solved is a multiclass classification problem (more\n", + " than 2 possible classes)\n", + "- d) The proportions of the class counts are imbalanced: some classes have\n", + " more than twice as many rows than others\n", + "\n", + "_Select all answers that apply_\n", + "\n", + "Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n", + "answer this question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using a\n", + "[`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n", + "and the strategy `\"most_frequent\"`, what is the average of the accuracy scores\n", + "obtained by performing a 10-fold cross-validation?\n", + "\n", + "- a) ~25%\n", + "- b) ~50%\n", + "- c) ~75%\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: You can check the documentation of\n", + "[`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n", + "and\n", + "[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Repeat the previous experiment but compute the balanced accuracy instead of\n", + "the accuracy score. Pass `scoring=\"balanced_accuracy\"` when calling\n", + "`cross_validate` or `cross_val_score` functions, the mean score is:\n", + "\n", + "- a) ~25%\n", + "- b) ~50%\n", + "- c) ~75%\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n", + "\n", + "Why is it relevant to add a preprocessing step to scale the data using a\n", + "`StandardScaler` when working with a `KNeighborsClassifier`?\n", + "\n", + "- a) faster to compute the list of neighbors on scaled data\n", + "- b) k-nearest neighbors is based on computing some distances. Features need\n", + " to be normalized to contribute approximately equally to the distance\n", + " computation.\n", + "- c) This is irrelevant. One could use k-nearest neighbors without normalizing\n", + " the dataset and get a very similar cross-validation score.\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Create a scikit-learn pipeline (using\n", + "`sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n", + "the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n", + "\n", + "Inspect the parameters of the created pipeline. What is the value of K, the\n", + "number of neighbors considered when predicting with the k-nearest neighbors.\n", + "\n", + "- a) 1\n", + "- b) 3\n", + "- c) 5\n", + "- d) 8\n", + "- e) 10\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n", + "estimator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n", + "cross-validation. Use the balanced accuracy as a score. What can you say about\n", + "this model? Compare the average of the train and test scores to argument your\n", + "answer.\n", + "\n", + "- a) The model underfits\n", + "- b) The model generalizes\n", + "- c) The model overfits\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: compute the average test score and the average train score and compare\n", + "them. Make sure to pass `return_train_score=True` to the `cross_validate`\n", + "function to also compute the train score." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now study the effect of the parameter n_neighbors on the train and test\n", + "score using a validation curve. You can use the following parameter range:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Also, use a 5-fold cross-validation and compute the balanced accuracy score\n", + "instead of the default accuracy score (check the scoring parameter). Finally,\n", + "plot the average train and test scores for the different value of the\n", + "hyperparameter. Remember that the name of the parameter can be found using\n", + "`model.get_params()`.\n", + "\n", + "Select the true affirmations stated below:\n", + "\n", + "- a) The model underfits for a range of `n_neighbors` values between 1 to 10\n", + "- b) The model underfits for a range of `n_neighbors` values between 10 to 100\n", + "- c) The model underfits for a range of `n_neighbors` values between 100 to 500\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Select the most correct of the affirmations stated below:\n", + "\n", + "- a) The model overfits for a range of `n_neighbors` values between 1 to 10\n", + "- b) The model overfits for a range of `n_neighbors` values between 10 to 100\n", + "- c) The model overfits for a range of `n_neighbors` values between 100 to 500\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Select the most correct of the affirmations stated below:\n", + "#\n", + "# - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n", + "# - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n", + "# - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n", + "#\n", + "# _Select a single answer_" + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/notebooks/pipeline_wrap_up_ex_00.ipynb b/notebooks/pipeline_wrap_up_ex_00.ipynb new file mode 100644 index 000000000..ae2b129e2 --- /dev/null +++ b/notebooks/pipeline_wrap_up_ex_00.ipynb @@ -0,0 +1,282 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \ud83c\udfc1 Wrap-up quiz 1\n", + "\n", + "This notebook contains the guided project to answer the hands-on questions\n", + "corresponding to the module \"The predictive modeling pipeline\" of the\n", + "Associate Practitioner Course. In this test **we do not have access to your\n", + "code**. Only it's output is evaluated by using the multiple choice questions,\n", + "to be answered in the dedicated User Interface.\n", + "\n", + "First run the following cell to initialize jupyterlite. Notice that only basic\n", + "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n", + "Remember that the initial import of libraries can take longer than usual, it\n", + "may take around 10-20 seconds for the following cell to run. Please be\n", + "patient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install seaborn==0.13.2\n", + "import matplotlib\n", + "import numpy\n", + "import pandas\n", + "import seaborn\n", + "import sklearn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the `ames_housing_no_missing.csv` dataset with the following cell of code.\n", + "\n", + "The target is the \"SalePrice\" column. As we have not encountered any\n", + "regression problem yet, we convert the regression target into a classification\n", + "target, where the goal is to predict whether or not the sale price of a house\n", + "is greater than $200,000." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "ames_housing = pd.read_csv(\"../datasets/ames_housing_no_missing.csv\")\n", + "\n", + "target_name = \"SalePrice\"\n", + "data, target = ames_housing.drop(columns=target_name), ames_housing[target_name]\n", + "target = (target > 200_000).astype(int)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `data.info()` and ` data.head()` commands to examine the columns of\n", + "the dataframe. The dataset contains:\n", + "\n", + "- a) only numerical features\n", + "- b) only categorical features\n", + "- c) both numerical and categorical features\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How many features are available to predict whether or not a house is\n", + "expensive?\n", + "\n", + "- a) 79\n", + "- b) 80\n", + "- c) 81\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How many features are represented with numbers?\n", + "\n", + "- a) 0\n", + "- b) 36\n", + "- c) 42\n", + "- d) 79\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: you can use the method\n", + "[`df.select_dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html)\n", + "or the function\n", + "[`sklearn.compose.make_column_selector`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html)\n", + "as shown in a previous notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Refer to the [dataset description](https://www.openml.org/d/42165) regarding\n", + "the meaning of the features.\n", + "\n", + "Among the following features, which of them express a quantitative numerical\n", + "value (excluding ordinal categories)?\n", + "\n", + "- a) \"LotFrontage\"\n", + "- b) \"LotArea\"\n", + "- c) \"OverallQual\"\n", + "- d) \"OverallCond\"\n", + "- e) \"YearBuilt\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We consider the following numerical columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "numerical_features = [\n", + " \"LotFrontage\", \"LotArea\", \"MasVnrArea\", \"BsmtFinSF1\", \"BsmtFinSF2\",\n", + " \"BsmtUnfSF\", \"TotalBsmtSF\", \"1stFlrSF\", \"2ndFlrSF\", \"LowQualFinSF\",\n", + " \"GrLivArea\", \"BedroomAbvGr\", \"KitchenAbvGr\", \"TotRmsAbvGrd\", \"Fireplaces\",\n", + " \"GarageCars\", \"GarageArea\", \"WoodDeckSF\", \"OpenPorchSF\", \"EnclosedPorch\",\n", + " \"3SsnPorch\", \"ScreenPorch\", \"PoolArea\", \"MiscVal\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now create a predictive model that uses these numerical columns as input data.\n", + "Your predictive model should be a pipeline composed of a\n", + "[`sklearn.preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\n", + "to scale these numerical data and a\n", + "[`sklearn.linear_model.LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n", + "\n", + "What is the accuracy score obtained by 10-fold cross-validation (you can set\n", + "the parameter `cv=10` when calling `cross_validate`) of this pipeline?\n", + "\n", + "- a) ~0.5\n", + "- b) ~0.7\n", + "- c) ~0.9\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Instead of solely using the numerical columns, let us build a pipeline that\n", + "can process both the numerical and categorical features together as follows:\n", + "- the `numerical_features` (as defined above) should be processed as previously\n", + " done with a `StandardScaler`;\n", + "- the left-out columns should be treated as categorical variables using a\n", + " [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).\n", + "\n", + "To avoid any issue with rare categories that could only be present during the\n", + "prediction, you can pass the parameter `handle_unknown=\"ignore\"` to the\n", + "`OneHotEncoder`.\n", + "\n", + "What is the accuracy score obtained by 10-fold cross-validation of the\n", + "pipeline using both the numerical and categorical features?\n", + "\n", + "- a) ~0.7\n", + "- b) ~0.9\n", + "- c) ~1.0\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way to compare two models is by comparing their means, but small\n", + "differences in performance measures might easily turn out to be merely by\n", + "chance (e.g. when using random resampling during cross-validation), and not\n", + "because one model predicts systematically better than the other.\n", + "\n", + "Another way is to compare cross-validation test scores of both models\n", + "fold-to-fold, i.e. counting the number of folds where one model has a better\n", + "test score than the other. This provides some extra information: are some\n", + "partitions of the data making the classification task particularly easy or\n", + "hard for both models?\n", + "\n", + "Let's visualize the second approach:\n", + "\n", + "![Fold-to-fold comparison](../figures/numerical_pipeline_wrap_up_quiz_comparison.png)\n", + "\n", + "Select the true statement.\n", + "\n", + "The number of folds where the model using all features perform better than the\n", + "model using only numerical features lies in the range:\n", + "\n", + "- a) [0, 3]: the model using all features is consistently worse\n", + "- b) [4, 6]: both models are almost equivalent\n", + "- c) [7, 10]: the model using all features is consistently better\n", + "\n", + "_Select a single answer_" + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/notebooks/tuning_wrap_up_ex_00.ipynb b/notebooks/tuning_wrap_up_ex_00.ipynb new file mode 100644 index 000000000..426cff3b4 --- /dev/null +++ b/notebooks/tuning_wrap_up_ex_00.ipynb @@ -0,0 +1,375 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# \ud83c\udfc1 Wrap-up quiz 3\n", + "\n", + "This notebook contains the guided project to answer the hands-on questions\n", + "corresponding to the module \"Hyperparameter tuning\" of the Associate\n", + "Practitioner Course. In this test **we do not have access to your code**. Only\n", + "it's output is evaluated by using the multiple choice questions, to be\n", + "answered in the dedicated User Interface.\n", + "\n", + "First run the following cell to initialize jupyterlite. Notice that only basic\n", + "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n", + "Remember that the initial import of libraries can take longer than usual, it\n", + "may take around 10-20 seconds for the following cell to run. Please be\n", + "patient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install seaborn==0.13.2\n", + "import matplotlib\n", + "import numpy\n", + "import pandas\n", + "import seaborn\n", + "import sklearn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load the `penguins.csv` dataset with the following cell of code. The column\n", + "\"Species\" contains the target variable. We extract the numerical columns that\n", + "quantify some attributes of such animals and our goal is to predict their\n", + "species based on those attributes stored in the dataframe named data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "penguins = pd.read_csv(\"../datasets/penguins.csv\")\n", + "\n", + "columns = [\"Body Mass (g)\", \"Flipper Length (mm)\", \"Culmen Length (mm)\"]\n", + "target_name = \"Species\"\n", + "\n", + "# Remove lines with missing values for the columns of interest\n", + "penguins_non_missing = penguins[columns + [target_name]].dropna()\n", + "\n", + "data = penguins_non_missing[columns]\n", + "target = penguins_non_missing[target_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the loaded data to select the correct assertions:\n", + "\n", + "Inspect the target variable and select the correct assertions from the\n", + "following proposals.\n", + "\n", + "- a) The problem to be solved is a regression problem\n", + "- b) The problem to be solved is a binary classification problem\n", + " (exactly 2 possible classes)\n", + "- c) The problem to be solved is a multiclass classification problem\n", + " (more than 2 possible classes)\n", + "\n", + "_Select a single answer_\n", + "\n", + "Hint: `target.nunique()`is a helpful method to answer to this question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inspect the statistics of the target and individual features to select the\n", + "correct statements.\n", + "\n", + "- a) The proportions of the class counts are balanced: there are approximately\n", + " the same number of rows for each class\n", + "- b) The proportions of the class counts are imbalanced: some classes have\n", + " more than twice as many rows than others\n", + "- c) The input features have similar scales (ranges of values)\n", + "\n", + "_Select all answers that apply_\n", + "\n", + "Hint: `data.describe()`, and `target.value_counts()` are methods that are\n", + "helpful to answer to this question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's now consider the following pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.pipeline import Pipeline\n", + "\n", + "\n", + "model = Pipeline(steps=[\n", + " (\"preprocessor\", StandardScaler()),\n", + " (\"classifier\", KNeighborsClassifier(n_neighbors=5)),\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evaluate the pipeline using stratified 10-fold cross-validation with the\n", + "`balanced-accuracy` scoring metric to choose the correct statement in the list\n", + "below.\n", + "\n", + "You can use:\n", + "\n", + "- [`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)\n", + " to perform the cross-validation routine;\n", + "- provide an integer `10` to the parameter `cv` of `cross_validate` to use the\n", + " cross-validation with 10 folds;\n", + "- provide the string `\"balanced_accuracy\"` to the parameter `scoring` of\n", + " `cross_validate`.\n", + "\n", + "- a) The average cross-validated test balanced accuracy of the above pipeline\n", + " is between 0.9 and 1.0\n", + "- b) The average cross-validated test balanced accuracy of the above pipeline\n", + " is between 0.8 and 0.9\n", + "- c) The average cross-validated test balanced accuracy of the above pipeline\n", + " is between 0.5 and 0.8\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Repeat the evaluation by setting the parameters in order to select the correct\n", + "statements in the list below. We recall that you can use `model.get_params()`\n", + "to list the parameters of the pipeline and use\n", + "`model.set_params(param_name=param_value)` to update them.\n", + "\n", + "Remember that one way to compare two models is comparing the cross-validation\n", + "test scores of both models fold-to-fold, i.e. counting the number of folds\n", + "where one model has a better test score than the other.\n", + "\n", + "Looking at the individual cross-validation scores:\n", + "\n", + "- a) Using a model with `n_neighbors=5` is substantially better (at least 7 of\n", + " the cross-validations scores are strictly better) than a model with\n", + " `n_neighbors=51`\n", + "- b) Using a model with `n_neighbors=5` is substantially better (at least 7 of\n", + " the cross-validations scores are strictly better) than a model with\n", + " `n_neighbors=101`\n", + "- c) A 5 nearest neighbors using a `StandardScaler` is substantially better\n", + " (at least 7 of the cross-validations scores are strictly better) than a 5\n", + " nearest neighbors using the raw features (without scaling).\n", + "\n", + "_Select all answers that apply_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will now study the impact of different preprocessors defined in the list below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.preprocessing import MinMaxScaler\n", + "from sklearn.preprocessing import QuantileTransformer\n", + "from sklearn.preprocessing import PowerTransformer\n", + "\n", + "\n", + "all_preprocessors = [\n", + " None,\n", + " StandardScaler(),\n", + " MinMaxScaler(),\n", + " QuantileTransformer(n_quantiles=100),\n", + " PowerTransformer(method=\"box-cox\"),\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "

Note

\n", + "

The Box-Cox method is a common preprocessing strategy for positive values.\n", + "The other preprocessors work for any kind of numerical features. If you are\n", + "curious to read more about those methods, feel free to visit the\n", + "preprocessing section of the user\n", + "guide, although\n", + "that is not necessary to answer the following questions.

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use `sklearn.model_selection.GridSearchCV` to study the impact of the choice\n", + "of the preprocessor and the number of neighbors on the stratified 10-fold\n", + "cross-validated `balanced_accuracy` metric. We want to study the `n_neighbors`\n", + "in the range `[5, 51, 101]` and `preprocessor` in the range\n", + "`all_preprocessors`. Although we wouldn't do this in a real setting (and\n", + "prefer using nested cross validation), for this question, do the\n", + "cross-validation on the entire dataset.\n", + "\n", + "Which of the following statements hold:\n", + "\n", + "- a) Looking at the individual cross-validation scores, the best ranked model\n", + " using a `StandardScaler` is substantially better (at least 7 of the\n", + " cross-validations scores are strictly better) than using any other\n", + " preprocessor\n", + "- b) Using any of the preprocessors has always a better ranking than using no\n", + " preprocessor, irrespective of the value `of n_neighbors`\n", + "- c) Looking at the individual cross-validation scores, the model with\n", + " `n_neighbors=5` and `StandardScaler` is substantially better (at least 7 of\n", + " the cross-validations scores are strictly better) than the model with\n", + " `n_neighbors=51` and `StandardScaler`\n", + "- d) Looking at the individual cross-validation scores, the model with\n", + " `n_neighbors=51` and `StandardScaler` is substantially better (at least 7 of\n", + " the cross-validations scores are strictly better) than the model with\n", + " `n_neighbors=101` and `StandardScaler`\n", + "\n", + "_Select all answers that apply_\n", + "\n", + "Hint: pass `{\"preprocessor\": all_preprocessors, \"classifier__n_neighbors\": [5,\n", + "51, 101]}` for the `param_grid` argument to the `GridSearchCV` class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evaluate the generalization performance of the best models found in each fold\n", + "using nested cross-validation. Set `return_estimator=True` and `cv=10` for the\n", + "outer loop. The scoring metric must be the `balanced-accuracy`. The mean\n", + "generalization performance is\n", + "\n", + "- a) better than 0.97\n", + "- b) between 0.92 and 0.97\n", + "- c) below 0.92\n", + "\n", + "_Select a single answer_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Explore the set of best parameters that the different grid search models found\n", + "in each fold of the outer cross-validation. Remember that you can access them\n", + "with the `best_params_` attribute of the estimator. Select all the statements\n", + "that are true.\n", + "\n", + "- a) The tuned number of nearest neighbors is stable across folds\n", + "- b) The tuned number of nearest neighbors changes often across folds\n", + "- c) The optimal scaler is stable across folds\n", + "- d) The optimal scaler changes often across folds\n", + "\n", + "_Select all answers that apply_\n", + "\n", + "Hint: it is important to pass `return_estimator=True` to the `cross_validate`\n", + "function to be able to introspect trained model saved in the `\"estimator\"`\n", + "field of the CV results. If you forgot to do for the previous question, please\n", + "re-run the cross-validation with that option enabled." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Write your code here." + ] + } + ], + "metadata": { + "jupytext": { + "main_language": "python" + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file