Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
308 changes: 308 additions & 0 deletions notebooks/overfit_wrap_up_ex_00.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# \ud83c\udfc1 Wrap-up quiz 2\n",
"\n",
"This notebook contains the guided project to answer the hands-on questions\n",
"corresponding to the module \"Selecting the best model\" of the Associate\n",
"Practitioner Course. In this test **we do not have access to your code**. Only\n",
"it's output is evaluated by using the multiple choice questions, to be\n",
"answered in the dedicated User Interface.\n",
"\n",
"First run the following cell to initialize jupyterlite. Notice that only basic\n",
"libraries are available, such as pandas, matplotlib, seaborn and numpy.\n",
"Remember that the initial import of libraries can take longer than usual, it\n",
"may take around 10-20 seconds for the following cell to run. Please be\n",
"patient."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install seaborn==0.13.2\n",
"import matplotlib\n",
"import numpy\n",
"import pandas\n",
"import seaborn\n",
"import sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the `blood_transfusion.csv` dataset with the following cell of code. The\n",
"column \"Class\" contains the target variable."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n",
"target_name = \"Class\"\n",
"data = blood_transfusion.drop(columns=target_name)\n",
"target = blood_transfusion[target_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select the correct answers from the following proposals.\n",
"\n",
"- a) The problem to be solved is a regression problem\n",
"- b) The problem to be solved is a binary classification problem (exactly 2\n",
" possible classes)\n",
"- c) The problem to be solved is a multiclass classification problem (more\n",
" than 2 possible classes)\n",
"- d) The proportions of the class counts are imbalanced: some classes have\n",
" more than twice as many rows than others\n",
"\n",
"_Select all answers that apply_\n",
"\n",
"Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n",
"answer this question."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a\n",
"[`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
"and the strategy `\"most_frequent\"`, what is the average of the accuracy scores\n",
"obtained by performing a 10-fold cross-validation?\n",
"\n",
"- a) ~25%\n",
"- b) ~50%\n",
"- c) ~75%\n",
"\n",
"_Select a single answer_\n",
"\n",
"Hint: You can check the documentation of\n",
"[`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n",
"and\n",
"[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Repeat the previous experiment but compute the balanced accuracy instead of\n",
"the accuracy score. Pass `scoring=\"balanced_accuracy\"` when calling\n",
"`cross_validate` or `cross_val_score` functions, the mean score is:\n",
"\n",
"- a) ~25%\n",
"- b) ~50%\n",
"- c) ~75%\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n",
"\n",
"Why is it relevant to add a preprocessing step to scale the data using a\n",
"`StandardScaler` when working with a `KNeighborsClassifier`?\n",
"\n",
"- a) faster to compute the list of neighbors on scaled data\n",
"- b) k-nearest neighbors is based on computing some distances. Features need\n",
" to be normalized to contribute approximately equally to the distance\n",
" computation.\n",
"- c) This is irrelevant. One could use k-nearest neighbors without normalizing\n",
" the dataset and get a very similar cross-validation score.\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create a scikit-learn pipeline (using\n",
"`sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n",
"the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n",
"\n",
"Inspect the parameters of the created pipeline. What is the value of K, the\n",
"number of neighbors considered when predicting with the k-nearest neighbors.\n",
"\n",
"- a) 1\n",
"- b) 3\n",
"- c) 5\n",
"- d) 8\n",
"- e) 10\n",
"\n",
"_Select a single answer_\n",
"\n",
"Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n",
"estimator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n",
"cross-validation. Use the balanced accuracy as a score. What can you say about\n",
"this model? Compare the average of the train and test scores to argument your\n",
"answer.\n",
"\n",
"- a) The model underfits\n",
"- b) The model generalizes\n",
"- c) The model overfits\n",
"\n",
"_Select a single answer_\n",
"\n",
"Hint: compute the average test score and the average train score and compare\n",
"them. Make sure to pass `return_train_score=True` to the `cross_validate`\n",
"function to also compute the train score."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now study the effect of the parameter n_neighbors on the train and test\n",
"score using a validation curve. You can use the following parameter range:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, use a 5-fold cross-validation and compute the balanced accuracy score\n",
"instead of the default accuracy score (check the scoring parameter). Finally,\n",
"plot the average train and test scores for the different value of the\n",
"hyperparameter. Remember that the name of the parameter can be found using\n",
"`model.get_params()`.\n",
"\n",
"Select the true affirmations stated below:\n",
"\n",
"- a) The model underfits for a range of `n_neighbors` values between 1 to 10\n",
"- b) The model underfits for a range of `n_neighbors` values between 10 to 100\n",
"- c) The model underfits for a range of `n_neighbors` values between 100 to 500\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write your code here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select the most correct of the affirmations stated below:\n",
"\n",
"- a) The model overfits for a range of `n_neighbors` values between 1 to 10\n",
"- b) The model overfits for a range of `n_neighbors` values between 10 to 100\n",
"- c) The model overfits for a range of `n_neighbors` values between 100 to 500\n",
"\n",
"_Select a single answer_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select the most correct of the affirmations stated below:\n",
"#\n",
"# - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n",
"# - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n",
"# - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n",
"#\n",
"# _Select a single answer_"
]
}
],
"metadata": {
"jupytext": {
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading