A lot of Machine Learning problems consist of hundreds to thousands of features. Having such a large number of features poses certain problems.
This problem is also sometimes known as The Curse of Dimensionality and Dimensionality Reduction or Dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables.
In other words, the goal is to take something that is very high dimensional and get it down to something that is easier to work with, without losing much of the information.
Why Dimensionality Reduction?
- We are living in a time where the connections between the different devices have increased because they have more sensors and measuring technologies that control some actions. That makes the features we should analyze bigger every time and more unintelligible.
- These techniques help us to reduce the quantity of relevant information that we should save so they reduce a lot the storage costs.
- Large dimensions are difficult to train on, it needs more computational power and time.
- In most datasets we find a high quantity or repeated data, columns that have just one value or which variance is so small that are not able to give the needed information for the model learning. The reduction of dimensionality helps us to filter this unnecessary information.
- One of the most important things is the human eye. We do not have the same capabilities as a machine so it’s necessary to adapt the data to be understood through our senses. This algorithm makes it easier to plot in two or three dimensions our data distribution.
- Multicollinearity. The detection of the redundant information is important to delete the unnecessary one. It happens many times that you can find variables represented in different units of measure (Example: m and cm). These variables with such strong correlation are not useful for model efficiency and model learning.
Real Dimension vs Apparent Dimension
- Real dimension of data generally is not equal to the apparent dimension of our dataset.
- Degrees of freedom and restrictions
Projection vs Manifold Learning
Projection : This technique deals with projecting every data point which is in high dimension, into a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
For instance the figure below, the points in 3D are projected onto a 2D plane. This is a lower-dimensional (2D) subspace of the high-dimensional (3D) space and the axes correspond to new features z1 and z2 (the coordinates of the projections on the plane).
Manifold Learning : Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.
Linear vs Nonlinear
Linear subspaces may be inefficient for some datasets. If the data is embedded on a manifold, we should capture the structure(unfolding).
PCA – Principal Component Analysis
The idea behind PCA is very simple:
- Identify a Hyperplane that lies closest to the data
- Project the data onto the hyperplane.
Variance
Variance visualization
Variance maximization
PCA is a variance maximizer. It projects the original data onto the directions where variance is maximum.
In this technique, variables are transformed into a new set of variables, which are linear combinations of original variables. These new set of variables are known as principal components.
They are obtained in such a way that the first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance.
Principal Component
The axis that explains the maximum amount of variance in the training set is called the principal components.
The axis orthogonal to this axis is called the second principal component.
Thus in 2D, there will be 2 principal components. However, for higher dimensions, PCA would find a third component orthogonal to the other two components and so on.