Dimensionality reduction is defined as a method of reducing variables in a training dataset used to develop machine learning models. This article explains the core principles of dimensionality reduction and its key techniques with examples.
Table of Contents
Dimensionality reduction refers to the method of reducing variables in a training dataset used to develop machine learning models. The process keeps a check on the dimensionality of data by projecting high dimensional data to a lower dimensional space that encapsulates the â€˜core essence’ of the data.
Machine learning requires many sources and computations to analyze data with millions of features. Besides, it also involves a lot of manual labor. Dimensionality reduction makes this complex task relatively easy by converting a high-dimensional dataset to a lower-dimensional dataset without affecting the key properties of the original dataset. This process reveals the data pre-processing steps undertaken before beginning the training cycle of machine learning models.
Let’s say you train a model that can forecast the next day’s weather based on the current climatic variables such as the amount of sunlight, rainfall, temperature, humidity, and several other environmental factors. Analyzing all these variables is a complex and challenging task. Hence, to accomplish the task with a limited set of features, you can target only specific features that show a stronger correlation and can be clubbed into one.
For instance, we can combine the humidity and temperature variables into one dependent feature as they tend to show a stronger correlation. With such a clubbing method, dimensionality reduction compresses complex data into a simpler form and ensures that the end objective is achieved without losing the data’s crux. Today, businesses and firms such as DataToBiz are leveraging data analytics solutions such as data visualization, data mining, and predictive modeling that employ dimensionality reduction to maximize their business ROIs.
With the growing online and social media platforms, the number of internet users has risen exponentially. According to a September 2022 report by Statista Research Department, there are over five billion internet users worldwide. Such a solid user base generates a tremendous amount of data daily.Â
A recent report by Finances Online predicts that by the end of 2022, we will produce and consume around 94 zettabytes of data. This may include data collected by Facebook (likes, shares, and comments), Amazon (customers’ buying patterns, clicks, and views), smartphone apps (users’ personal information), IoT devices (daily activity and health data of users), and even casinos (track every move of the customer).
Such a variety of data is fed to machine learning and deep learning models to learn more about the trends and fluctuations in data patterns. As this data has several features and is generated in vast amounts, it often gives rise to the â€˜curse of dimensionality.’
Additionally, large datasets are accompanied by an inevitable sparsity factor. Sparsity denotes the â€˜no-value’ features that can be ignored while training a model. Moreover, such features occur redundantly in the given dataset and pose issues while clustering similar features.
To address this curse of dimensionality, dimensionality reduction is resorted to. Its advantages include:
- With the elimination of redundant data, lesser space for assumption remains, thereby elevating the overall accuracy of the machine learning model.
- Significant control over the usage of computational resources. As a result, it saves time and budget.
- Some ML and deep learning techniques do not perform well with high-dimensional data. This can be taken care of by reducing the dimensions of data.
- Non-sparse data is crucial to derive statistical results as clean data ensures accurate and easier clustering, unlike sparse data.
Dimensionality reduction techniques can be broadly divided into two categories:
- Feature selection: This refers to retaining the relevant (optimal) features and discarding the irrelevant ones to ensure the high accuracy of the model. Feature selection methods such as filter, wrapper, and embedded methods are popularly used.
- Feature extraction: This process is also termed feature projection, wherein multidimensional space is converted into a space with lower dimensions. Some known feature extraction methods include principal component analysis (PCA), linear discriminant analysis (LDA), Kernel PCA (K-PCA), and quadratic discriminant analysis (QCA).
Although one can perform dimensionality reduction with several techniques, the following are the most commonly used ones:Â
1. Principal component analysis (PCA)
Principal component analysis performs orthogonal transformations to convert an observation of correlated characteristics into a set of linearly correlated features. The newly changed characteristics are termed â€˜principal components’. This statistical method is a key data analysis and predictive modeling technique.
2. Missing value ratio
When a dataset contains several missing values, such variables are eliminated as they fail to provide relevant or reliable information. The elimination task is accomplished by defining a threshold level, wherein a variable with more missing values than the threshold is immediately dropped. This implies that the more the threshold value, the lesser the efficiency.
3. Backward feature elimination
This approach is typically used while developing a linear regression or logistic regression model. In this technique, you can specify the number of features essential for ML algorithms based on the estimated model performance and tolerated error rate.
The process begins by training the ML model using all â€˜n’ variables provided in the dataset. Upon training, the model’s performance is evaluated. After this, features are eliminated one at a time, and the model is trained on â€˜n-1′ features for n times. The performance of the model is typically re-evaluated at each step.
The above iteration is repeated until the variable that made the least or no difference to the model’s performance is identified. Upon identification, this variable or feature is eliminated, and you are left with â€˜n-1′ features. The process is repeated until a point is reached where no feature can be ultimately dropped from the dataset.
4. Forward feature selection
Forward feature selection is opposite to the backward elimination technique. Rather than deleting any feature, we rely on determining the best characteristics that result in an above-average gain in the model’s performance.
In this approach, we start with a single feature and then progressively add features one at a time. Here, the model is trained on each feature independently. Thus, the feature with the highest performance is identified, and the model is iteratively run using it. The process is repeated until the model’s performance improves over time.
5. Random forest
Random forest is a feature selection approach with a built-in feature significance package that identifies feature importance. As a result, the need to program it individually is eliminated.
In this approach, multiple decision trees are constructed against the target feature to identify the variable subset using statistics for each attribute. Moreover, as random forest accepts input in the form of numeric data, a hot encoding process is essential to convert any type of input data into numeric data.
6. Factor analysis
The factor analysis method determines the connection between a group of variables and then decides upon retaining a certain variable based on the strong variable correlation. This means that the variables in a group may be strongly correlated but may reveal a weak correlation with other groups. Hence, each variable is either retained or dropped based on this correlation factor.
7. Independent component analysis (ICA)
Independent component analysis that covers the known â€˜blind source separation’ and the â€˜cocktail party problem’ refers to a linear dimensionality reduction method that aims to identify the independent components in a provided dataset. It is important to note that â€˜independence’ differs from â€˜correlation,’ as discussed earlier.
Here’s an example.Â
Suppose you have two random variables, a1 and b1. Their distribution function is given by Pa1 and Pb1, respectively. Now say you receive additional information on variable a1. However, this does not affect your knowledge about variable b1. This implies that a1 and b1 are independent variables.Â
Although correlation measures the dependence between variables, it essentially reveals linear dependence. However, when two variables are independent, no linear or non-linear dependence exists between them. But the absence of linear dependence, as observed in correlation, may not essentially equate to independence as the variables can have non-linear relationships.
8. Low variance filter
Data columns in a dataset that undergo certain changes tend to provide less information. As a result, it leads to problems observed in the missing value ratio approach. It is, therefore, essential to compute each variable’s variance by defining a threshold. If the data column has a variance less than the very threshold, it is eliminated as its low variance characteristics do not impact the target variable in any sense.
9. High correlation filter
If two variables reveal identical information in a dataset, they are said to have a high correlation. This affects the model’s performance negatively due to variables conveying redundant information. Hence, the correlation coefficient is defined to better understand the correlation based on a threshold value. If the correlation coefficient value exceeds the threshold value, one of the variables can be eliminated from the dataset. The aim here is to seek features strongly associated with the target variables.
10. Uniform manifold approximation and projection (UMAP)
T-Distributed Stochastic Neighbor Embedding (T-SNE) is a dimensionality reduction technique applied to large datasets. However, it has certain disadvantages, such as loss of large-scale information, longer computational times, and problems representing sufficiently large datasets.
UMAP, on the other hand, is known to provide quicker computing runtime and maintain a local and global data structure such as T-SNE. The technique has an edge over T-SNE since it handles big datasets well and manages high-dimensional data. It unveils the power of visualization, which is key to reducing data dimensionality.
The method relies on the â€˜k-nearest neighbor’ concept and uses the â€˜stochastic gradient descent’ to fine-tune the results. The first step in the process is calculating the distance between the data points in a high-dimensional space. Next, the measured distance is projected onto the low-dimensional space. Lastly, the distance between the data points is re-evaluated, and the stochastic gradient descent method is applied to reduce the calculated distance differences between the two-dimensional spaces. As a result, data dimensionality is considerably reduced.
Dimensionality reduction methods are key to several real-life applications, including text categorization, image retrieval, face recognition, intrusion detection, neuroscience, gene expression analysis, email categorization, etc.
Let’s look at some examples in detail.
1. Text categorization
The internet holds massive amounts of digital data such as digital libraries, social media content, emails, ecommerce data, and so on. It is challenging to classify these text files as it represents data in a higher dimensional space. As a result, whenever a new text document is added to the web, one of the tasks where dimensionality reduction plays a key role is automatically classifying the newly-added documents based on predefined categories.Â
This method cuts down on the feature space (word- or phrase-based features) without hampering categorization accuracy. It employs multiple metrics such as document frequency, information gain, term length, and others to segregate text files automatically.
2. Image retrieval
With the growth in online media and IoT devices, image collections from scientific quarters, military departments, and social media platforms have increased significantly. Without indexing these images, one may not be able to retrieve them when required. That’s where dimensionality reduction comes into the picture. Images are indexed based on visual content, which includes color, texture, or shape.Â
Traditionally, images were indexed by using textual descriptions (keywords and captions). However, with the rise in high-dimensional data, indexing based on text content wasn’t sufficient. This led to the indexing of images based on visual content. Various deep learning methods such as object recognition, face recognition, and others are also integral to this image retrieval task.
3. Gene expression analysis
Dimensionality reduction has made gene expression analysis faster and easier since the technology has enabled the simultaneous measurement of gene expression levels in a single experiment that involves several thousand genes.Â
For example, sample classification of leukemia data is performed by considering feature ranking methods based on the linear correlation between relevant gene features. This technique has not only contributed to the speeding up of gene expression analysis but has also shown good accuracy.
4. Intrusion detection
In today’s digital world, network-based computer systems are essential to modern society. However, all such networking systems are inevitably exposed to external cyber threats. Hence, to ensure secure and smooth network operation, protecting these vital computer systems from such intrusions is crucial.
Intrusion detection via data mining is critical, where dimensionality reduction techniques are extensively employed. With the help of data mining algorithms, user activity patterns can be frequently obtained. This can be accomplished by auditing relevant data regularly, where dimensionality reduction techniques determine optimal features that act as checkpoints to suspicious activities. Moreover, classifiers can also be designed based on selected features to mark the observed activity as â€˜legitimate’ or â€˜intrusive.’
A similar strategy can be employed for email classification problems where the task is to categorize emails as spam or legitimate. One can consider several features such as email title, email content, whether the email uses a template, and so on to categorize emails.
Dimensionality reduction is widely used in the field of neuroscience. A technique known as â€˜maximally informative dimensions’ is employed to perform the statistical analysis of neural responses. Typically, the method projects a neural stimulus into a lower dimensional space so that all the relevant information related to the stimulus is retained in that very neural response.Â
Moreover, independent component analysis (ICA) techniques find application in neuroimaging, fMRI, and EEG analysis where normal and abnormal signals are segregated.
Today, an unprecedented amount of data is generated every second. This largely comprises high dimensional data that requires some preprocessing before being used. Hence, it is crucial to look for ways to handle such high-dimensional data. Dimensionality reduction provides ways to pre-process data in a precise and efficient manner. It is considered the go-to approach by several data scientists since it helps analyze humungous data with optimal computing resources while achieving accurate results.
Did this article help you understand dimensionality reduction and its key examples in the real world? Comment below or let us know on FacebookOpens a new window , TwitterOpens a new window , or LinkedInOpens a new window . We’d love to hear from you!
MORE ON ARTIFICIAL INTELLIGENCE
- What Is NoSQL? Features, Types, and Examples
- What Is Finite Element Analysis? Working, Software, and Applications
- What Is Robotic Process Automation (RPA)? Meaning, Working, Software, and Uses
- What Is Reinforcement Learning? Working, Algorithms, and Uses
- What Is Cognitive Science? Meaning, Methods, and Applications