-
Notifications
You must be signed in to change notification settings - Fork 0
Dimensionality reduction techniques
[draft version]
[I need to learn how to be more clear and concise when explaining]
When handling large, high-dimensional datasets with many features, one frequently encounters the Curse of Dimensionality: data points become sparser in higher dimensions. This is a problem because when there are insufficient data points for each dimension, we may overfit the data.
The solution is to reduce the number of dimensions in our data. There are many different methods for doing this. First, we will discuss Principal Component Analysis, more commonly known as PCA:
From the original set of features, PCA extracts a smaller set of features called the Principal Components of the data. Each principal component is a weighted combination of each of the original features. The PCs aim to capture as much variance in the data as possible, while remaining orthogonal to each other. To get a sense of what this all means, watch the following animation (taken from StackExchange):
In this example, we have a dataset with two features. Say we wanted to reduce the data to just one feature. We need to find the combination of the two features that maximizes the variance of the data. Here, we can represent each combination as a vector. The gif cycles through different combinations by rotating the vector. The combination the most variance is the one that minimizes the mean squared error between the vector-line and the data points. It occurs when the rotating line segment aligns with the two shorten lines on the left and right fringes.
Now, let's write up an example in Jupyter Notebook:
