1+ {
2+ "cells" : [
3+ {
4+ "cell_type" : " markdown" ,
5+ "metadata" : {},
6+ "source" : [
7+ " # \ud83c\udfc1 Wrap-up quiz 2\n " ,
8+ " \n " ,
9+ " This notebook contains the guided project to answer the hands-on questions\n " ,
10+ " corresponding to the module \" Selecting the best model\" of the Associate\n " ,
11+ " Practitioner Course. In this test **we do not have access to your code**. Only\n " ,
12+ " it's output is evaluated by using the multiple choice questions, to be\n " ,
13+ " answered in the dedicated User Interface.\n " ,
14+ " \n " ,
15+ " First run the following cell to initialize jupyterlite. Notice that only basic\n " ,
16+ " libraries are available, such as pandas, matplotlib, seaborn and numpy.\n " ,
17+ " Remember that the initial import of libraries can take longer than usual, it\n " ,
18+ " may take around 10-20 seconds for the following cell to run. Please be\n " ,
19+ " patient."
20+ ]
21+ },
22+ {
23+ "cell_type" : " code" ,
24+ "execution_count" : null ,
25+ "metadata" : {},
26+ "outputs" : [],
27+ "source" : [
28+ " %pip install seaborn==0.13.2\n " ,
29+ " import matplotlib\n " ,
30+ " import numpy\n " ,
31+ " import pandas\n " ,
32+ " import seaborn\n " ,
33+ " import sklearn"
34+ ]
35+ },
36+ {
37+ "cell_type" : " markdown" ,
38+ "metadata" : {},
39+ "source" : [
40+ " Load the `blood_transfusion.csv` dataset with the following cell of code. The\n " ,
41+ " column \" Class\" contains the target variable."
42+ ]
43+ },
44+ {
45+ "cell_type" : " code" ,
46+ "execution_count" : null ,
47+ "metadata" : {},
48+ "outputs" : [],
49+ "source" : [
50+ " import pandas as pd\n " ,
51+ " \n " ,
52+ " blood_transfusion = pd.read_csv(\" ../datasets/blood_transfusion.csv\" )\n " ,
53+ " target_name = \" Class\"\n " ,
54+ " data = blood_transfusion.drop(columns=target_name)\n " ,
55+ " target = blood_transfusion[target_name]"
56+ ]
57+ },
58+ {
59+ "cell_type" : " markdown" ,
60+ "metadata" : {},
61+ "source" : [
62+ " Select the correct answers from the following proposals.\n " ,
63+ " \n " ,
64+ " - a) The problem to be solved is a regression problem\n " ,
65+ " - b) The problem to be solved is a binary classification problem (exactly 2\n " ,
66+ " possible classes)\n " ,
67+ " - c) The problem to be solved is a multiclass classification problem (more\n " ,
68+ " than 2 possible classes)\n " ,
69+ " - d) The proportions of the class counts are imbalanced: some classes have\n " ,
70+ " more than twice as many rows than others\n " ,
71+ " \n " ,
72+ " _Select all answers that apply_\n " ,
73+ " \n " ,
74+ " Hint: `target.unique()` and `target.value_counts()` are helpful methods to\n " ,
75+ " answer this question."
76+ ]
77+ },
78+ {
79+ "cell_type" : " code" ,
80+ "execution_count" : null ,
81+ "metadata" : {},
82+ "outputs" : [],
83+ "source" : [
84+ " # Write your code here."
85+ ]
86+ },
87+ {
88+ "cell_type" : " markdown" ,
89+ "metadata" : {},
90+ "source" : [
91+ " Using a\n " ,
92+ " [`sklearn.dummy.DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n " ,
93+ " and the strategy `\" most_frequent\" `, what is the average of the accuracy scores\n " ,
94+ " obtained by performing a 10-fold cross-validation?\n " ,
95+ " \n " ,
96+ " - a) ~25%\n " ,
97+ " - b) ~50%\n " ,
98+ " - c) ~75%\n " ,
99+ " \n " ,
100+ " _Select a single answer_\n " ,
101+ " \n " ,
102+ " Hint: You can check the documentation of\n " ,
103+ " [`sklearn.model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)\n " ,
104+ " and\n " ,
105+ " [`sklearn.model_selection.cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html)."
106+ ]
107+ },
108+ {
109+ "cell_type" : " code" ,
110+ "execution_count" : null ,
111+ "metadata" : {},
112+ "outputs" : [],
113+ "source" : [
114+ " # Write your code here."
115+ ]
116+ },
117+ {
118+ "cell_type" : " markdown" ,
119+ "metadata" : {},
120+ "source" : [
121+ " Repeat the previous experiment but compute the balanced accuracy instead of\n " ,
122+ " the accuracy score. Pass `scoring=\" balanced_accuracy\" ` when calling\n " ,
123+ " `cross_validate` or `cross_val_score` functions, the mean score is:\n " ,
124+ " \n " ,
125+ " - a) ~25%\n " ,
126+ " - b) ~50%\n " ,
127+ " - c) ~75%\n " ,
128+ " \n " ,
129+ " _Select a single answer_"
130+ ]
131+ },
132+ {
133+ "cell_type" : " code" ,
134+ "execution_count" : null ,
135+ "metadata" : {},
136+ "outputs" : [],
137+ "source" : [
138+ " # Write your code here."
139+ ]
140+ },
141+ {
142+ "cell_type" : " markdown" ,
143+ "metadata" : {},
144+ "source" : [
145+ " We will use a `sklearn.neighbors.KNeighborsClassifier` for the remainder of this quiz.\n " ,
146+ " \n " ,
147+ " Why is it relevant to add a preprocessing step to scale the data using a\n " ,
148+ " `StandardScaler` when working with a `KNeighborsClassifier`?\n " ,
149+ " \n " ,
150+ " - a) faster to compute the list of neighbors on scaled data\n " ,
151+ " - b) k-nearest neighbors is based on computing some distances. Features need\n " ,
152+ " to be normalized to contribute approximately equally to the distance\n " ,
153+ " computation.\n " ,
154+ " - c) This is irrelevant. One could use k-nearest neighbors without normalizing\n " ,
155+ " the dataset and get a very similar cross-validation score.\n " ,
156+ " \n " ,
157+ " _Select a single answer_"
158+ ]
159+ },
160+ {
161+ "cell_type" : " markdown" ,
162+ "metadata" : {},
163+ "source" : [
164+ " # Create a scikit-learn pipeline (using\n " ,
165+ " `sklearn.pipeline.make_pipeline`) where a StandardScaler will be used to scale\n " ,
166+ " the data followed by a KNeighborsClassifier. Use the default hyperparameters.\n " ,
167+ " \n " ,
168+ " Inspect the parameters of the created pipeline. What is the value of K, the\n " ,
169+ " number of neighbors considered when predicting with the k-nearest neighbors.\n " ,
170+ " \n " ,
171+ " - a) 1\n " ,
172+ " - b) 3\n " ,
173+ " - c) 5\n " ,
174+ " - d) 8\n " ,
175+ " - e) 10\n " ,
176+ " \n " ,
177+ " _Select a single answer_\n " ,
178+ " \n " ,
179+ " Hint: You can use `model.get_params()` to get the parameters of a scikit-learn\n " ,
180+ " estimator."
181+ ]
182+ },
183+ {
184+ "cell_type" : " code" ,
185+ "execution_count" : null ,
186+ "metadata" : {},
187+ "outputs" : [],
188+ "source" : [
189+ " # Write your code here."
190+ ]
191+ },
192+ {
193+ "cell_type" : " markdown" ,
194+ "metadata" : {},
195+ "source" : [
196+ " # Set `n_neighbors=1` in the previous model and evaluate it using a 10-fold\n " ,
197+ " cross-validation. Use the balanced accuracy as a score. What can you say about\n " ,
198+ " this model? Compare the average of the train and test scores to argument your\n " ,
199+ " answer.\n " ,
200+ " \n " ,
201+ " - a) The model underfits\n " ,
202+ " - b) The model generalizes\n " ,
203+ " - c) The model overfits\n " ,
204+ " \n " ,
205+ " _Select a single answer_\n " ,
206+ " \n " ,
207+ " Hint: compute the average test score and the average train score and compare\n " ,
208+ " them. Make sure to pass `return_train_score=True` to the `cross_validate`\n " ,
209+ " function to also compute the train score."
210+ ]
211+ },
212+ {
213+ "cell_type" : " code" ,
214+ "execution_count" : null ,
215+ "metadata" : {},
216+ "outputs" : [],
217+ "source" : [
218+ " # Write your code here."
219+ ]
220+ },
221+ {
222+ "cell_type" : " markdown" ,
223+ "metadata" : {},
224+ "source" : [
225+ " We now study the effect of the parameter n_neighbors on the train and test\n " ,
226+ " score using a validation curve. You can use the following parameter range:"
227+ ]
228+ },
229+ {
230+ "cell_type" : " code" ,
231+ "execution_count" : null ,
232+ "metadata" : {},
233+ "outputs" : [],
234+ "source" : [
235+ " import numpy as np\n " ,
236+ " \n " ,
237+ " param_range = np.array([1, 2, 5, 10, 20, 50, 100, 200, 500])"
238+ ]
239+ },
240+ {
241+ "cell_type" : " markdown" ,
242+ "metadata" : {},
243+ "source" : [
244+ " Also, use a 5-fold cross-validation and compute the balanced accuracy score\n " ,
245+ " instead of the default accuracy score (check the scoring parameter). Finally,\n " ,
246+ " plot the average train and test scores for the different value of the\n " ,
247+ " hyperparameter. Remember that the name of the parameter can be found using\n " ,
248+ " `model.get_params()`.\n " ,
249+ " \n " ,
250+ " Select the true affirmations stated below:\n " ,
251+ " \n " ,
252+ " - a) The model underfits for a range of `n_neighbors` values between 1 to 10\n " ,
253+ " - b) The model underfits for a range of `n_neighbors` values between 10 to 100\n " ,
254+ " - c) The model underfits for a range of `n_neighbors` values between 100 to 500\n " ,
255+ " \n " ,
256+ " _Select a single answer_"
257+ ]
258+ },
259+ {
260+ "cell_type" : " code" ,
261+ "execution_count" : null ,
262+ "metadata" : {},
263+ "outputs" : [],
264+ "source" : [
265+ " # Write your code here."
266+ ]
267+ },
268+ {
269+ "cell_type" : " markdown" ,
270+ "metadata" : {},
271+ "source" : [
272+ " Select the most correct of the affirmations stated below:\n " ,
273+ " \n " ,
274+ " - a) The model overfits for a range of `n_neighbors` values between 1 to 10\n " ,
275+ " - b) The model overfits for a range of `n_neighbors` values between 10 to 100\n " ,
276+ " - c) The model overfits for a range of `n_neighbors` values between 100 to 500\n " ,
277+ " \n " ,
278+ " _Select a single answer_"
279+ ]
280+ },
281+ {
282+ "cell_type" : " code" ,
283+ "execution_count" : null ,
284+ "metadata" : {},
285+ "outputs" : [],
286+ "source" : [
287+ " # Select the most correct of the affirmations stated below:\n " ,
288+ " #\n " ,
289+ " # - a) The model best generalizes for a range of `n_neighbors` values between 1 to 10\n " ,
290+ " # - b) The model best generalizes for a range of `n_neighbors` values between 10 to 100\n " ,
291+ " # - c) The model best generalizes for a range of `n_neighbors` values between 100 to 500\n " ,
292+ " #\n " ,
293+ " # _Select a single answer_"
294+ ]
295+ }
296+ ],
297+ "metadata" : {
298+ "jupytext" : {
299+ "main_language" : " python"
300+ },
301+ "kernelspec" : {
302+ "display_name" : " Python 3" ,
303+ "name" : " python3"
304+ }
305+ },
306+ "nbformat" : 4 ,
307+ "nbformat_minor" : 5
308+ }
0 commit comments