Dimensionality Reduction
Dimensionality Reduction
1. Feature Selection
2. Feature Extraction.
To be clear, some supervised algorithms already have built-in feature selection, such
as Regularized Regression and Random Forests. As a stand-alone task, feature selection
can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms).
You can also combine multiple methods if needed.
Variance Thresholds
Variance thresholds remove features whose values don't change much from
observation to observation (i.e. their variance falls below a threshold). These
features provide little value. Because variance is dependent on scale, you should
always normalize your features first.
Correlation Thresholds
Correlation thresholds remove features that are highly correlated with others
(i.e. its values change very similarly to another's). These features provide
redundant information.
Genetic Algorithm
Genetic algorithms (GA) are a broad class of algorithms that can be adapted to
different purposes. They are search algorithms that are inspired by evolutionary
biology and natural selection, combining mutation and cross-over to efficiently
traverse large solution spaces. Here's a great intro to the intuition behind GA's.
In machine learning, GA's have two main uses. The first is for optimization, such
as finding the best weights for a neural network.
The second is for supervised feature selection. In this use case, "genes" represent
individual features and the "organism" represents a candidate set of features.
Each organism in the "population" is graded on a fitness score such as model
performance on a hold-out set. The fittest organisms survive and reproduce,
repeating until the population converges on a solution some generations later.
For forward stepwise search, you start without any features. Then, you'd train a
1-feature model using each of your candidate features and keep the version with
the best performance. You'd continue adding features, one at a time, until
your performance improvements stall.
Backward stepwise search is the same process, just reversed: start with all
features in your model and then remove one at a time until performance starts
to drop substantially.
As with feature selection, some algorithms already have built-in feature extraction.
The best example is Deep Learning, which extracts increasingly useful representations
of the raw input data through each hidden neural layer.
As a stand-alone task, feature extraction can be unsupervised (i.e. PCA) or supervised
(i.e. LDA).
You should always normalize your dataset before performing PCA because the
transformation is dependent on scale. If you don't, the features that are on the
largest scale would dominate your new principal components.
Therefore, LDA is a supervised method that can only be used with labeled data.
So which is better: LDA and PCA? Well, results will vary from problem to
problem, and the same "No Free Lunch" theorem.
Autoencoders
Autoencoders are neural networks that are trained to reconstruct their original
inputs. For example, image autoencoders are trained to reproduce the original
images instead of classifying the image as a dog or a cat.
So how is this helpful? Well, the key is to structure the hidden layer to have fewer
neurons than the input/output layers. Thus, that hidden layer will learn to
produce a smaller representation of the original image.
Because you use the input image as the target output, autoencoders are
considered unsupervised. They can be used directly (e.g. image compression) or
stacked in sequence (e.g. deep learning).
In machine learning, there’s something called the “No Free Lunch” theorem. In a
nutshell, it states that no one algorithm works best for every problem, and it’s especially
relevant for supervised learning (i.e. predictive modelling).
For example, you can’t say that neural networks are always better than decision trees or
vice-versa. There are many factors at play, such as the size and structure of your dataset.
As a result, you should try many different algorithms for your problem, while using
a hold-out “test set” of data to evaluate performance and select the winner.
Of course, the algorithms you try must be appropriate for your problem, which is where
picking the right machine learning task comes in. As an analogy, if you need to clean
your house, you might use a vacuum, a broom, or a mop, but you wouldn't bust out
a shovel and start digging.