Você está na página 1de 5

Curse of Dimensionality and its Reduction

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data
in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-
dimensional settings. The expression was coined by Richard E. Bellman when considering problems
in dynamic optimization.

Cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine
learning, data mining and databases. The common theme of these problems is that when the
dimensionality increases, the volume of the space increases so fast that the available data become sparse.
This sparsity is problematic for any method that requires statistical significance. In order to obtain a
statistically sound and reliable result, the amount of data needed to support the result often grows
exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas
where objects form groups with similar properties; in high dimensional data, however, all objects appear
to be sparse and dissimilar in many ways, which prevents common data organization strategies from being
efficient.

In machine learning, “dimensionality” simply refers to


the number of features (i.e. input variables) in your
dataset. When the number of features is very large
relative to the number of observations in your
dataset, certain algorithms struggle to train effective
models. And it’s especially relevant for clustering
algorithms that rely on distance calculations.

Figure 1. As the dimensionality increases, the classifier’s performance increases until the optimal number of features is reached.
Further increasing the dimensionality without increasing the number of training samples results in a decrease in classifier

A Quora user has provided an excellent analogy for the Curse of Dimensionality, which we'll borrow here:

“Let's say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn't be too
hard to find. You walk along the line and it takes two minutes.
Now let's say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be
pretty hard, like searching across two football fields stuck together. It could take days.
Now a cube 100 yards across. That's like searching a 30-story building the size of a football stadium. Ugh.
The difficulty of searching through the space gets a lot harder as you have more dimensions.”
If we would keep adding features, the dimensionality of the
feature space grows, and becomes sparser and sparser. Due
to this sparsity, it becomes much easier to find a separable
hyperplane because the likelihood that a training sample lies
on the wrong side of the best hyperplane becomes infinitely
small when the number of features becomes infinitely large.
However, if we project the highly dimensional classification
result back to a lower dimensional space, a serious problem
associated with this approach becomes evident.
Figure 2. Using too many features results in overfitting. The classifier starts learning exceptions that are specific to the training
data and do not generalize well when new data is encountered.

As a result, the classifier learns the appearance of specific


instances and exceptions of our training dataset. Because
of this, the resulting classifier would fail on real-world data,
consisting of an infinite amount of unseen data sets that
often do not adhere to these exceptions. This concept is
called overfitting and is a direct result of the curse of
dimensionality.

Although the simple linear classifier with decision


boundaries seems to perform worse than the non-linear Figure 3. Although the training data is not classified
classifier, this simple classifier generalizes much better to perfectly, this classifier achieves better results on
unseen data because it did not learn specific exceptions unseen data
that were only in our training data by coincidence. In other words, by using less features, the curse of
dimensionality could be avoided such that the classifier did not overfit the training data.

Dimensionality reduction is an important technique to overcome the curse of dimensionality in data


science and machine learning. As the number of predictors (or dimensions or features) in the dataset
increase, it becomes computationally more expensive (i.e. increased storage space, longer computation
time) and exponentially more difficult to produce accurate predictions in classification or regression
models.

Here are some widely used methods of dimensionality reduction explained briefly:

1. Feature Selection
2. Feature Extraction

1. FEATURE SELECTION
Feature selection is for filtering irrelevant or redundant features from your dataset. The key difference
between feature selection and extraction is that feature selection keeps a subset of the original features
while feature extraction creates brand new ones.

As a stand-alone task, feature selection can be unsupervised (e.g. Variance Thresholds) or supervised (e.g.
Genetic Algorithms). You can also combine multiple methods if needed.
1.1 Variance Thresholds
Variance thresholds remove features whose values don't change much from observation to observation
(i.e. their variance falls below a threshold). These features provide little value. Because variance is
dependent on scale, you should always normalize your features first.

Strengths: Applying variance thresholds is based on solid intuition: features that don't change much also
don't add much information. This is an easy and relatively safe way to reduce dimensionality at the start
of your modeling process.

Weaknesses: If your problem does require dimensionality reduction, applying variance thresholds is
rarely sufficient. Furthermore, you must manually set or tune a variance threshold, which could be
tricky. We recommend starting with a conservative (i.e. lower) threshold.

1.2 Correlation Thresholds


Correlation thresholds remove features that are highly correlated with others (i.e. its values change very
similarly to another's). These features provide redundant information.

Strengths: Applying correlation thresholds is also based on solid intuition: similar features provide
redundant information. Some algorithms are not robust to correlated features, so removing them can
boost performance.

Weaknesses: Again, you must manually set or tune a correlation threshold, which can be tricky to do.
Plus, if you set your threshold too low, you risk dropping useful information. Whenever possible, we
prefer algorithms with built-in feature selection over correlation thresholds. Even for algorithms
without built-in feature selection, Principal Component Analysis (PCA) is often a better alternative.

1.3 Genetic Algorithms (GA)


Genetic algorithms (GA) are a broad class of algorithms that can be
adapted to different purposes. They are search algorithms that are
inspired by evolutionary biology and natural selection, combining
mutation and cross-over to efficiently traverse large solution spaces.
Here's a great intro to the intuition behind GA's.

In machine learning, GA's have two main uses:

The first is for optimization, such as finding the best weights for a
neural network.

The second is for supervised feature selection. In this use case, "genes" represent individual features and
the "organism" represents a candidate set of features. Each organism in the "population" is graded on a
fitness score such as model performance on a hold-out set. The fittest organisms survive and reproduce,
repeating until the population converges on a solution some generations later.

Strengths: Genetic algorithms can efficiently select features from very high dimensional datasets, where
exhaustive search is unfeasible. When you need to preprocess data for an algorithm that doesn't have
built-in feature selection (e.g. nearest neighbors) and when you must preserve the original features (i.e.
no PCA allowed), GA's are likely your best bet. These situations can arise in business/client settings that
require a transparent and interpretable solution.
Weaknesses: GA's add a higher level of complexity to your implementation, and they aren't worth the
hassle in most cases. If possible, it's faster and simpler to use PCA or to directly use an algorithm with
built-in feature selection.

2. FEATURE EXTRACTION
Feature extraction is for creating a new, smaller set of features that stills captures most of the useful
information. Again, feature selection keeps a subset of the original features while feature extraction
creates new ones.

As with feature selection, some algorithms already have built-in feature extraction. The best example is
Deep Learning, which extracts increasingly useful representations of the raw input data through each
hidden neural layer. As a stand-alone task, feature extraction can be unsupervised (i.e. PCA) or supervised
(i.e. LDA).

2.1 Principal Component Analysis (PCA)


Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the
original features. The new features are orthogonal, which means that they are uncorrelated. Furthermore,
they are ranked in order of their "explained variance." The first principal component (PC1) explains the
most variance in your dataset, PC2 explains the second-most variance, and so on.

Therefore, you can reduce dimensionality by limiting the


number of principal components to keep based on cumulative
explained variance. For example, you might decide to keep
only as many principal components as needed to reach a
cumulative explained variance of 90%.

You should always normalize your dataset before performing


PCA because the transformation is dependent on scale. If you
don't, the features that are on the largest scale would
dominate your new principal components.

Strengths: PCA is a versatile technique that works well in practice. It's fast and simple to implement, which
means you can easily test algorithms with and without PCA to compare performance. In addition, PCA
offers several variations and extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle specific roadblocks.

Weaknesses: The new principal components are not interpretable, which may be a deal-breaker in some
settings. In addition, you must still manually set or tune a threshold for cumulative explained variance.

2.2 Linear Discriminant Analysis (LDA)


Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear
combinations of your original features. However, unlike PCA, LDA doesn't maximize explained variance.
Instead, it maximizes the separability between classes.

Therefore, LDA is a supervised method that can only be used with labeled data. The LDA transformation
is also dependent on scale, so you should normalize your dataset first.

Strengths: LDA is supervised, which can (but doesn't always) improve the predictive performance of the
extracted features. Furthermore, LDA offers variations (i.e. quadratic LDA) to tackle specific roadblocks.
Weaknesses: As with PCA, the new features are not easily interpretable, and you must still manually set
or tune the number of components to keep. LDA also requires labeled data, which makes it more
situational.

2.3 Autoencoders
Autoencoders are neural networks that are trained to reconstruct their original inputs. For example,
image autoencoders are trained to reproduce the original images instead of classifying the image as a dog
or a cat.
The key is to structure the hidden layer to have fewer neurons than the
input/output layers. Thus, that hidden layer will learn to produce a smaller
representation of the original image.
Because you use the input image as the target output, autoencoders are
considered unsupervised. They can be used directly (e.g. image compression) or stacked in sequence (e.g.
deep learning).
Strengths: Autoencoders are neural networks, which means they perform well for certain types of data,
such as image and audio data.
Weaknesses: Autoencoders are neural networks, which means they require more data to train. They are
not used as general-purpose dimensionality reduction algorithms.

BIBLIOGRAPHY:
1. (n.d.). Dimensionality Reduction Algorithms: Strengths and Weaknesses. Retrieved January 6,
2019, from https://elitedatascience.com/dimensionality-reduction-algorithms
2. Spruyt, V. (n.d.). The Curse of Dimensionality in Classification. Retrieved January 6, 2019, from
http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
3. (n.d.). Curse of dimensionality - Wikipedia. Retrieved January 24, 2019, from
https://en.wikipedia.org/wiki/Curse_of_dimensionality
4. Gleeson, P. (n.d.). The Curse of Dimensionality – freeCodeCamp.org. Retrieved from
https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-
from-itself-d9fa0f872335

ADDITIONAL RESOURCES:
1. Shetty, B. (n.d.). Curse of Dimensionality – Towards Data Science. Retrieved from
https://towardsdatascience.com/curse-of-dimensionality-2092410f3d27
2. WIN, K. (n.d.). How to Overcome the Curse of Dimensionality | Byte Academy | Top Coding School
for Python, Blockchain & Fintech. Retrieved from https://byteacademy.co/blog/overcome-
dimensionailty-machinelearning
3. Berge, A. (n.d.). Microsoft PowerPoint - Lecture 11 - handout-inf3300-2005-10-6pp.pdf. Retrieved
from https://www.uio.no/studier/emner/matnat/ifi/nedlagte-
emner/INF3300/h05/undervisningsmateriale/handout-inf3300-2005-10-6pp.pdf
4. Pavlenko, T. (n.d.). doi:10.1016/S0378-3758(02)00166-0 - On feature selection, curse-of-
dimensionality paper.pdf – Journal of Statistical Planning and Inference 115 (2003) 565–584
5. (n.d.). (17) What is the curse of dimensionality? - Quora. Retrieved January 24, 2019, from
https://www.quora.com/What-is-the-curse-of-dimensionality

Você também pode gostar