Dimensionality reduction
Used to
- accelerate the ML learning process
- visualize the data (<= 3 dimensions)
- avoid over-fitting
Methods (pro/cons)
SVD:
Advantages:
It’s very efficient (via Lanczos algorithm or similar it can be applied to really big matrices)
The basis is hierarchical, ordered by relevance
It tends to perform quite well for most data sets
Disadvantages:
If the data is strongly non-linear it may not work so well
Results are not always the best for visualization
It is difficult to interpret
Strongly focused in variance, sometimes there’s not a direct relationship between variance and predictive power so it can discard useful information
Conclusion: De-facto method for dimensionality reduction in generic datasets.
PCA:
Advantages:
Same as SVD
Disadvantages:
Same as SVD plus it is not as efficient to compute as the SVD
Conclusion: It’s the same as the SVD but not as efficient. Never use it.
T-SNE,ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps
Advantages:
Can work well when Data is strongly non-linear
Can work very well for visualization
Disadvantages:
Can be inefficient for large data
Certainly not a good idea unless the data is strongly non-linear
Sometimes they just work well for visualization but not for dimensionality reduction
NMF (Nonnegative Matrix Factorization)
Advantages:
Results are easier to interpret than the SVD
Provide an additive basis to represent the data (sometimes this is good)
Disadvantages:
Can overfit, frequently millions of solutions are possible, which one is the right one?
There’s no hierarchy in the basis (sometimes this is bad)
Feature Hashing / Hash Kernels / The Hashing Trick
Advantages:
Preserve the inner product between vectors in the original space, so distance and similarity can be preserved.
Works great for sparse data, it can create sparse representations
Extremely fast and simple
Can filter some noise
Disadvantages:
Limited to what the original data can do
Not suitable for data visualization
K-Means Based Methods for Dimensionality Reduction
Advantages:
Quite efficient
Can work well with non-linear data
The learned basis is useful to represent the data (compression)
In some cases can as well as deep learning methods
Disadvantages:
Not very popular
Can create different representations based on different initializations
Might need a little tuning to get it working
Autoencoders and Deep Autoencoders
Advantages:
Can find different levels of features
Probably state of the art for representing data at different levels
Can be trained to denoise data or to generate data
Disadvantages:
Can overfit big time
Very nice in theory but not to many practical applications
Can be inefficient for massive data
Which method to choose
If your goal is data visualization then T-SNE is quite standard, if T-SNE doesn’t work well then try ISOMAP, Laplacian Eigenmaps, etc.
If you are dealing with generic data and you are not sure the SVD is your swiss army knife.
There are many NMF algorithms chances are one of them can work very well for your data.
If you have sparse data in a very high dimensional space then feature hashing is probably your celestial solution.
If you are working with images, sound, speech, music, then some variation of an autoencoder is probably your best solution.
In all of the above cases the K-Means based methods can work so they can’t be discarded.
PCA is the same as the SVD so nobody should really use it.