Skip to content

Commit 5b723b0

Browse files
authored
Add python notebooks for Associate wrap-up quizzes
2 parents 666534b + 5bc5dd2 commit 5b723b0

File tree

3 files changed

+965
-0
lines changed

3 files changed

+965
-0
lines changed
Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# \ud83c\udfc1 Wrap-up quiz 2\n",
8+
"\n",
9+
"This notebook contains the guided project to answer the hands-on questions\n",
10+
"corresponding to the module \"Selecting the best model\" of the Associate\n",
11+
"Practitioner Course. In this test **we do not have access to your code**. Only\n",
12+
"it's output is evaluated by using the multiple choice questions, to be\n",
13+
"answered in the dedicated User Interface.\n",
14+
"\n",
15+
"First run the following cell to initialize jupyterlite. Notice that only basic\n",
16+
"libraries are available, such as pandas, matplotlib, seaborn and numpy.\n",
17+
"Remember that the initial import of libraries can take longer than usual, it\n",
18+
"may take around 10-20 seconds for the following cell to run. Please be\n",
19+
"patient."
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": null,
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"%pip install seaborn==0.13.2\n",
29+
"import matplotlib\n",
30+
"import numpy\n",
31+
"import pandas\n",
32+
"import seaborn\n",
33+
"import sklearn"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"Load the `blood_transfusion.csv` dataset with the following cell of code. The\n",
41+
"column \"Class\" contains the target variable."
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"import pandas as pd\n",
51+
"\n",
52+
"blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n",
53+
"target_name = \"Class\"\n",
54+
"data = blood_transfusion.drop(columns=target_name)\n",
55+
"target = blood_transfusion[target_name]"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"Select the correct answers from the following proposals.\n",
63+
"\n",
64+
"- a) The problem to be solved is a regression problem\n",
65+
"- b) The problem to be solved is a binary classification problem (exactly 2\n",
66+
" possible classes)\n",
67+
"- c) The problem to be solved is a multiclass classification problem (more\n",
68+
" than 2 possible classes)\n",
69+
"- d) The proportions of the class counts are imbalanced: some classes have\n",
70+
" more than twice as many rows than others\n",
71+
"\n",
72+
"_Select all answers that apply_\n",
73+
"\n",
74+
"Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n",
75+
"answer this question."
76+
]
77+
},
78+
{
79+
"cell_type": "code",
80+
"execution_count": null,
81+
"metadata": {},
82+
"outputs": [],
83+
"source": [
84+
"# Write your code here."
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {},
90+
"source": [
91+
"Using a\n",
92+
"[`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
93+
"and the strategy `\"most_frequent\"`, what is the average of the accuracy scores\n",
94+
"obtained by performing a 10-fold cross-validation?\n",
95+
"\n",
96+
"- a) ~25%\n",
97+
"- b) ~50%\n",
98+
"- c) ~75%\n",
99+
"\n",
100+
"_Select a single answer_\n",
101+
"\n",
102+
"Hint: You can check the documentation of\n",
103+
"[`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n",
104+
"and\n",
105+
"[`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)."
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": null,
111+
"metadata": {},
112+
"outputs": [],
113+
"source": [
114+
"# Write your code here."
115+
]
116+
},
117+
{
118+
"cell_type": "markdown",
119+
"metadata": {},
120+
"source": [
121+
"Repeat the previous experiment but compute the balanced accuracy instead of\n",
122+
"the accuracy score. Pass `scoring=\"balanced_accuracy\"` when calling\n",
123+
"`cross_validate` or `cross_val_score` functions, the mean score is:\n",
124+
"\n",
125+
"- a) ~25%\n",
126+
"- b) ~50%\n",
127+
"- c) ~75%\n",
128+
"\n",
129+
"_Select a single answer_"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"metadata": {},
136+
"outputs": [],
137+
"source": [
138+
"# Write your code here."
139+
]
140+
},
141+
{
142+
"cell_type": "markdown",
143+
"metadata": {},
144+
"source": [
145+
"We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n",
146+
"\n",
147+
"Why is it relevant to add a preprocessing step to scale the data using a\n",
148+
"`StandardScaler` when working with a `KNeighborsClassifier`?\n",
149+
"\n",
150+
"- a) faster to compute the list of neighbors on scaled data\n",
151+
"- b) k-nearest neighbors is based on computing some distances. Features need\n",
152+
" to be normalized to contribute approximately equally to the distance\n",
153+
" computation.\n",
154+
"- c) This is irrelevant. One could use k-nearest neighbors without normalizing\n",
155+
" the dataset and get a very similar cross-validation score.\n",
156+
"\n",
157+
"_Select a single answer_"
158+
]
159+
},
160+
{
161+
"cell_type": "markdown",
162+
"metadata": {},
163+
"source": [
164+
"# Create a scikit-learn pipeline (using\n",
165+
"`sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n",
166+
"the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n",
167+
"\n",
168+
"Inspect the parameters of the created pipeline. What is the value of K, the\n",
169+
"number of neighbors considered when predicting with the k-nearest neighbors.\n",
170+
"\n",
171+
"- a) 1\n",
172+
"- b) 3\n",
173+
"- c) 5\n",
174+
"- d) 8\n",
175+
"- e) 10\n",
176+
"\n",
177+
"_Select a single answer_\n",
178+
"\n",
179+
"Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n",
180+
"estimator."
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {},
187+
"outputs": [],
188+
"source": [
189+
"# Write your code here."
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"metadata": {},
195+
"source": [
196+
"# Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n",
197+
"cross-validation. Use the balanced accuracy as a score. What can you say about\n",
198+
"this model? Compare the average of the train and test scores to argument your\n",
199+
"answer.\n",
200+
"\n",
201+
"- a) The model underfits\n",
202+
"- b) The model generalizes\n",
203+
"- c) The model overfits\n",
204+
"\n",
205+
"_Select a single answer_\n",
206+
"\n",
207+
"Hint: compute the average test score and the average train score and compare\n",
208+
"them. Make sure to pass `return_train_score=True` to the `cross_validate`\n",
209+
"function to also compute the train score."
210+
]
211+
},
212+
{
213+
"cell_type": "code",
214+
"execution_count": null,
215+
"metadata": {},
216+
"outputs": [],
217+
"source": [
218+
"# Write your code here."
219+
]
220+
},
221+
{
222+
"cell_type": "markdown",
223+
"metadata": {},
224+
"source": [
225+
"We now study the effect of the parameter n_neighbors on the train and test\n",
226+
"score using a validation curve. You can use the following parameter range:"
227+
]
228+
},
229+
{
230+
"cell_type": "code",
231+
"execution_count": null,
232+
"metadata": {},
233+
"outputs": [],
234+
"source": [
235+
"import numpy as np\n",
236+
"\n",
237+
"param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])"
238+
]
239+
},
240+
{
241+
"cell_type": "markdown",
242+
"metadata": {},
243+
"source": [
244+
"Also, use a 5-fold cross-validation and compute the balanced accuracy score\n",
245+
"instead of the default accuracy score (check the scoring parameter). Finally,\n",
246+
"plot the average train and test scores for the different value of the\n",
247+
"hyperparameter. Remember that the name of the parameter can be found using\n",
248+
"`model.get_params()`.\n",
249+
"\n",
250+
"Select the true affirmations stated below:\n",
251+
"\n",
252+
"- a) The model underfits for a range of `n_neighbors` values between 1 to 10\n",
253+
"- b) The model underfits for a range of `n_neighbors` values between 10 to 100\n",
254+
"- c) The model underfits for a range of `n_neighbors` values between 100 to 500\n",
255+
"\n",
256+
"_Select a single answer_"
257+
]
258+
},
259+
{
260+
"cell_type": "code",
261+
"execution_count": null,
262+
"metadata": {},
263+
"outputs": [],
264+
"source": [
265+
"# Write your code here."
266+
]
267+
},
268+
{
269+
"cell_type": "markdown",
270+
"metadata": {},
271+
"source": [
272+
"Select the most correct of the affirmations stated below:\n",
273+
"\n",
274+
"- a) The model overfits for a range of `n_neighbors` values between 1 to 10\n",
275+
"- b) The model overfits for a range of `n_neighbors` values between 10 to 100\n",
276+
"- c) The model overfits for a range of `n_neighbors` values between 100 to 500\n",
277+
"\n",
278+
"_Select a single answer_"
279+
]
280+
},
281+
{
282+
"cell_type": "code",
283+
"execution_count": null,
284+
"metadata": {},
285+
"outputs": [],
286+
"source": [
287+
"# Select the most correct of the affirmations stated below:\n",
288+
"#\n",
289+
"# - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n",
290+
"# - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n",
291+
"# - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n",
292+
"#\n",
293+
"# _Select a single answer_"
294+
]
295+
}
296+
],
297+
"metadata": {
298+
"jupytext": {
299+
"main_language": "python"
300+
},
301+
"kernelspec": {
302+
"display_name": "Python 3",
303+
"name": "python3"
304+
}
305+
},
306+
"nbformat": 4,
307+
"nbformat_minor": 5
308+
}

0 commit comments

Comments
 (0)