1+ {
2+  "cells" : [
3+   {
4+    "cell_type" : " markdown"  ,
5+    "metadata" : {},
6+    "source" : [
7+     " # \ud83c\udfc1  Wrap-up quiz 2\n "  ,
8+     " \n "  ,
9+     " This notebook contains the guided project to answer the hands-on questions\n "  ,
10+     " corresponding to the module \" Selecting the best model\"  of the Associate\n "  ,
11+     " Practitioner Course. In this test **we do not have access to your code**. Only\n "  ,
12+     " it's output is evaluated by using the multiple choice questions, to be\n "  ,
13+     " answered in the dedicated User Interface.\n "  ,
14+     " \n "  ,
15+     " First run the following cell to initialize jupyterlite. Notice that only basic\n "  ,
16+     " libraries are available, such as pandas, matplotlib, seaborn and numpy.\n "  ,
17+     " Remember that the initial import of libraries can take longer than usual, it\n "  ,
18+     " may take around 10-20 seconds for the following cell to run. Please be\n "  ,
19+     " patient." 
20+    ]
21+   },
22+   {
23+    "cell_type" : " code"  ,
24+    "execution_count" : null ,
25+    "metadata" : {},
26+    "outputs" : [],
27+    "source" : [
28+     " %pip install seaborn==0.13.2\n "  ,
29+     " import matplotlib\n "  ,
30+     " import numpy\n "  ,
31+     " import pandas\n "  ,
32+     " import seaborn\n "  ,
33+     " import sklearn" 
34+    ]
35+   },
36+   {
37+    "cell_type" : " markdown"  ,
38+    "metadata" : {},
39+    "source" : [
40+     " Load the `blood_transfusion.csv` dataset with the following cell of code. The\n "  ,
41+     " column \" Class\"  contains the target variable." 
42+    ]
43+   },
44+   {
45+    "cell_type" : " code"  ,
46+    "execution_count" : null ,
47+    "metadata" : {},
48+    "outputs" : [],
49+    "source" : [
50+     " import pandas as pd\n "  ,
51+     " \n "  ,
52+     " blood_transfusion = pd.read_csv(\" ../datasets/blood_transfusion.csv\" )\n "  ,
53+     " target_name = \" Class\"\n "  ,
54+     " data = blood_transfusion.drop(columns=target_name)\n "  ,
55+     " target = blood_transfusion[target_name]" 
56+    ]
57+   },
58+   {
59+    "cell_type" : " markdown"  ,
60+    "metadata" : {},
61+    "source" : [
62+     " Select the correct answers from the following proposals.\n "  ,
63+     " \n "  ,
64+     " - a) The problem to be solved is a regression problem\n "  ,
65+     " - b) The problem to be solved is a binary classification problem (exactly 2\n "  ,
66+     "   possible classes)\n "  ,
67+     " - c) The problem to be solved is a multiclass classification problem (more\n "  ,
68+     "   than 2 possible classes)\n "  ,
69+     " - d) The proportions of the class counts are imbalanced: some classes have\n "  ,
70+     "   more than twice as many rows than others\n "  ,
71+     " \n "  ,
72+     " _Select all answers that apply_\n "  ,
73+     " \n "  ,
74+     " Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n "  ,
75+     " answer this question." 
76+    ]
77+   },
78+   {
79+    "cell_type" : " code"  ,
80+    "execution_count" : null ,
81+    "metadata" : {},
82+    "outputs" : [],
83+    "source" : [
84+     " # Write your code here." 
85+    ]
86+   },
87+   {
88+    "cell_type" : " markdown"  ,
89+    "metadata" : {},
90+    "source" : [
91+     " Using a\n "  ,
92+     " [`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n "  ,
93+     " and the strategy `\" most_frequent\" `, what is the average of the accuracy scores\n "  ,
94+     " obtained by performing a 10-fold cross-validation?\n "  ,
95+     " \n "  ,
96+     " - a) ~25%\n "  ,
97+     " - b) ~50%\n "  ,
98+     " - c) ~75%\n "  ,
99+     " \n "  ,
100+     " _Select a single answer_\n "  ,
101+     " \n "  ,
102+     " Hint: You can check the documentation of\n "  ,
103+     " [`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n "  ,
104+     " and\n "  ,
105+     " [`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)." 
106+    ]
107+   },
108+   {
109+    "cell_type" : " code"  ,
110+    "execution_count" : null ,
111+    "metadata" : {},
112+    "outputs" : [],
113+    "source" : [
114+     " # Write your code here." 
115+    ]
116+   },
117+   {
118+    "cell_type" : " markdown"  ,
119+    "metadata" : {},
120+    "source" : [
121+     " Repeat the previous experiment but compute the balanced accuracy instead of\n "  ,
122+     " the accuracy score. Pass `scoring=\" balanced_accuracy\" ` when calling\n "  ,
123+     " `cross_validate` or `cross_val_score` functions, the mean score is:\n "  ,
124+     " \n "  ,
125+     " - a) ~25%\n "  ,
126+     " - b) ~50%\n "  ,
127+     " - c) ~75%\n "  ,
128+     " \n "  ,
129+     " _Select a single answer_" 
130+    ]
131+   },
132+   {
133+    "cell_type" : " code"  ,
134+    "execution_count" : null ,
135+    "metadata" : {},
136+    "outputs" : [],
137+    "source" : [
138+     " # Write your code here." 
139+    ]
140+   },
141+   {
142+    "cell_type" : " markdown"  ,
143+    "metadata" : {},
144+    "source" : [
145+     " We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n "  ,
146+     " \n "  ,
147+     " Why is it relevant to add a preprocessing step to scale the data using a\n "  ,
148+     " `StandardScaler` when working with a `KNeighborsClassifier`?\n "  ,
149+     " \n "  ,
150+     " - a) faster to compute the list of neighbors on scaled data\n "  ,
151+     " - b) k-nearest neighbors is based on computing some distances. Features need\n "  ,
152+     "   to be normalized to contribute approximately equally to the distance\n "  ,
153+     "   computation.\n "  ,
154+     " - c) This is irrelevant. One could use k-nearest neighbors without normalizing\n "  ,
155+     "   the dataset and get a very similar cross-validation score.\n "  ,
156+     " \n "  ,
157+     " _Select a single answer_" 
158+    ]
159+   },
160+   {
161+    "cell_type" : " markdown"  ,
162+    "metadata" : {},
163+    "source" : [
164+     " # Create a scikit-learn pipeline (using\n "  ,
165+     " `sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n "  ,
166+     " the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n "  ,
167+     " \n "  ,
168+     " Inspect the parameters of the created pipeline. What is the value of K, the\n "  ,
169+     " number of neighbors considered when predicting with the k-nearest neighbors.\n "  ,
170+     " \n "  ,
171+     " - a) 1\n "  ,
172+     " - b) 3\n "  ,
173+     " - c) 5\n "  ,
174+     " - d) 8\n "  ,
175+     " - e) 10\n "  ,
176+     " \n "  ,
177+     " _Select a single answer_\n "  ,
178+     " \n "  ,
179+     " Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n "  ,
180+     " estimator." 
181+    ]
182+   },
183+   {
184+    "cell_type" : " code"  ,
185+    "execution_count" : null ,
186+    "metadata" : {},
187+    "outputs" : [],
188+    "source" : [
189+     " # Write your code here." 
190+    ]
191+   },
192+   {
193+    "cell_type" : " markdown"  ,
194+    "metadata" : {},
195+    "source" : [
196+     " # Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n "  ,
197+     " cross-validation. Use the balanced accuracy as a score. What can you say about\n "  ,
198+     " this model? Compare the average of the train and test scores to argument your\n "  ,
199+     " answer.\n "  ,
200+     " \n "  ,
201+     " - a) The model underfits\n "  ,
202+     " - b) The model generalizes\n "  ,
203+     " - c) The model overfits\n "  ,
204+     " \n "  ,
205+     " _Select a single answer_\n "  ,
206+     " \n "  ,
207+     " Hint: compute the average test score and the average train score and compare\n "  ,
208+     " them. Make sure to pass `return_train_score=True` to the `cross_validate`\n "  ,
209+     " function to also compute the train score." 
210+    ]
211+   },
212+   {
213+    "cell_type" : " code"  ,
214+    "execution_count" : null ,
215+    "metadata" : {},
216+    "outputs" : [],
217+    "source" : [
218+     " # Write your code here." 
219+    ]
220+   },
221+   {
222+    "cell_type" : " markdown"  ,
223+    "metadata" : {},
224+    "source" : [
225+     " We now study the effect of the parameter n_neighbors on the train and test\n "  ,
226+     " score using a validation curve. You can use the following parameter range:" 
227+    ]
228+   },
229+   {
230+    "cell_type" : " code"  ,
231+    "execution_count" : null ,
232+    "metadata" : {},
233+    "outputs" : [],
234+    "source" : [
235+     " import numpy as np\n "  ,
236+     " \n "  ,
237+     " param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])" 
238+    ]
239+   },
240+   {
241+    "cell_type" : " markdown"  ,
242+    "metadata" : {},
243+    "source" : [
244+     " Also, use a 5-fold cross-validation and compute the balanced accuracy score\n "  ,
245+     " instead of the default accuracy score (check the scoring parameter). Finally,\n "  ,
246+     " plot the average train and test scores for the different value of the\n "  ,
247+     " hyperparameter. Remember that the name of the parameter can be found using\n "  ,
248+     " `model.get_params()`.\n "  ,
249+     " \n "  ,
250+     " Select the true affirmations stated below:\n "  ,
251+     " \n "  ,
252+     " - a) The model underfits for a range of `n_neighbors` values between 1 to 10\n "  ,
253+     " - b) The model underfits for a range of `n_neighbors` values between 10 to 100\n "  ,
254+     " - c) The model underfits for a range of `n_neighbors` values between 100 to 500\n "  ,
255+     " \n "  ,
256+     " _Select a single answer_" 
257+    ]
258+   },
259+   {
260+    "cell_type" : " code"  ,
261+    "execution_count" : null ,
262+    "metadata" : {},
263+    "outputs" : [],
264+    "source" : [
265+     " # Write your code here." 
266+    ]
267+   },
268+   {
269+    "cell_type" : " markdown"  ,
270+    "metadata" : {},
271+    "source" : [
272+     " Select the most correct of the affirmations stated below:\n "  ,
273+     " \n "  ,
274+     " - a) The model overfits for a range of `n_neighbors` values between 1 to 10\n "  ,
275+     " - b) The model overfits for a range of `n_neighbors` values between 10 to 100\n "  ,
276+     " - c) The model overfits for a range of `n_neighbors` values between 100 to 500\n "  ,
277+     " \n "  ,
278+     " _Select a single answer_" 
279+    ]
280+   },
281+   {
282+    "cell_type" : " code"  ,
283+    "execution_count" : null ,
284+    "metadata" : {},
285+    "outputs" : [],
286+    "source" : [
287+     " # Select the most correct of the affirmations stated below:\n "  ,
288+     " #\n "  ,
289+     " # - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n "  ,
290+     " # - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n "  ,
291+     " # - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n "  ,
292+     " #\n "  ,
293+     " # _Select a single answer_" 
294+    ]
295+   }
296+  ],
297+  "metadata" : {
298+   "jupytext" : {
299+    "main_language" : " python" 
300+   },
301+   "kernelspec" : {
302+    "display_name" : " Python 3"  ,
303+    "name" : " python3" 
304+   }
305+  },
306+  "nbformat" : 4 ,
307+  "nbformat_minor" : 5 
308+ }
0 commit comments