probabl-ai · ArturoAmorQ · Jun 18, 2025 · Jun 18, 2025
diff --git a/notebooks/overfit_wrap_up_ex_00.ipynb b/notebooks/overfit_wrap_up_ex_00.ipynb
@@ -0,0 +1,308 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# \ud83c\udfc1 Wrap-up quiz 2\n",
+    "\n",
+    "This notebook contains the guided project to answer the hands-on questions\n",
+    "corresponding to the module \"Selecting the best model\" of the Associate\n",
+    "Practitioner Course. In this test **we do not have access to your code**. Only\n",
+    "it's output is evaluated by using the multiple choice questions, to be\n",
+    "answered in the dedicated User Interface.\n",
+    "\n",
+    "First run the following cell to initialize jupyterlite. Notice that only basic\n",
+    "libraries are available, such as pandas, matplotlib, seaborn and numpy.\n",
+    "Remember that the initial import of libraries can take longer than usual, it\n",
+    "may take around 10-20 seconds for the following cell to run. Please be\n",
+    "patient."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install seaborn==0.13.2\n",
+    "import matplotlib\n",
+    "import numpy\n",
+    "import pandas\n",
+    "import seaborn\n",
+    "import sklearn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Load the `blood_transfusion.csv` dataset with the following cell of code. The\n",
+    "column \"Class\" contains the target variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n",
+    "target_name = \"Class\"\n",
+    "data = blood_transfusion.drop(columns=target_name)\n",
+    "target = blood_transfusion[target_name]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Select the correct answers from the following proposals.\n",
+    "\n",
+    "- a) The problem to be solved is a regression problem\n",
+    "- b) The problem to be solved is a binary classification problem (exactly 2\n",
+    "  possible classes)\n",
+    "- c) The problem to be solved is a multiclass classification problem (more\n",
+    "  than 2 possible classes)\n",
+    "- d) The proportions of the class counts are imbalanced: some classes have\n",
+    "  more than twice as many rows than others\n",
+    "\n",
+    "_Select all answers that apply_\n",
+    "\n",
+    "Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n",
+    "answer this question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Using a\n",
+    "[`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
+    "and the strategy `\"most_frequent\"`, what is the average of the accuracy scores\n",
+    "obtained by performing a 10-fold cross-validation?\n",
+    "\n",
+    "- a) ~25%\n",
+    "- b) ~50%\n",
+    "- c) ~75%\n",
+    "\n",
+    "_Select a single answer_\n",
+    "\n",
+    "Hint: You can check the documentation of\n",
+    "[`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n",
+    "and\n",
+    "[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Repeat the previous experiment but compute the balanced accuracy instead of\n",
+    "the accuracy score. Pass `scoring=\"balanced_accuracy\"` when calling\n",
+    "`cross_validate` or `cross_val_score` functions, the mean score is:\n",
+    "\n",
+    "- a) ~25%\n",
+    "- b) ~50%\n",
+    "- c) ~75%\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n",
+    "\n",
+    "Why is it relevant to add a preprocessing step to scale the data using a\n",
+    "`StandardScaler` when working with a `KNeighborsClassifier`?\n",
+    "\n",
+    "- a) faster to compute the list of neighbors on scaled data\n",
+    "- b) k-nearest neighbors is based on computing some distances. Features need\n",
+    "  to be normalized to contribute approximately equally to the distance\n",
+    "  computation.\n",
+    "- c) This is irrelevant. One could use k-nearest neighbors without normalizing\n",
+    "  the dataset and get a very similar cross-validation score.\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Create a scikit-learn pipeline (using\n",
+    "`sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n",
+    "the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n",
+    "\n",
+    "Inspect the parameters of the created pipeline. What is the value of K, the\n",
+    "number of neighbors considered when predicting with the k-nearest neighbors.\n",
+    "\n",
+    "- a) 1\n",
+    "- b) 3\n",
+    "- c) 5\n",
+    "- d) 8\n",
+    "- e) 10\n",
+    "\n",
+    "_Select a single answer_\n",
+    "\n",
+    "Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n",
+    "estimator."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n",
+    "cross-validation. Use the balanced accuracy as a score. What can you say about\n",
+    "this model? Compare the average of the train and test scores to argument your\n",
+    "answer.\n",
+    "\n",
+    "- a) The model underfits\n",
+    "- b) The model generalizes\n",
+    "- c) The model overfits\n",
+    "\n",
+    "_Select a single answer_\n",
+    "\n",
+    "Hint: compute the average test score and the average train score and compare\n",
+    "them. Make sure to pass `return_train_score=True` to the `cross_validate`\n",
+    "function to also compute the train score."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We now study the effect of the parameter n_neighbors on the train and test\n",
+    "score using a validation curve. You can use the following parameter range:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Also, use a 5-fold cross-validation and compute the balanced accuracy score\n",
+    "instead of the default accuracy score (check the scoring parameter). Finally,\n",
+    "plot the average train and test scores for the different value of the\n",
+    "hyperparameter. Remember that the name of the parameter can be found using\n",
+    "`model.get_params()`.\n",
+    "\n",
+    "Select the true affirmations stated below:\n",
+    "\n",
+    "- a) The model underfits for a range of `n_neighbors` values between 1 to 10\n",
+    "- b) The model underfits for a range of `n_neighbors` values between 10 to 100\n",
+    "- c) The model underfits for a range of `n_neighbors` values between 100 to 500\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Select the most correct of the affirmations stated below:\n",
+    "\n",
+    "- a) The model overfits for a range of `n_neighbors` values between 1 to 10\n",
+    "- b) The model overfits for a range of `n_neighbors` values between 10 to 100\n",
+    "- c) The model overfits for a range of `n_neighbors` values between 100 to 500\n",
+    "\n",
+    "_Select a single answer_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Select the most correct of the affirmations stated below:\n",
+    "#\n",
+    "# - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n",
+    "# - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n",
+    "# - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n",
+    "#\n",
+    "# _Select a single answer_"
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "main_language": "python"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}