Dimensionality reduction is a technique used in machine learning and data analysis to reduce the number of features (dimensions) in a dataset while preserving as much relevant information as possible. High-dimensional data can pose challenges in terms of computational complexity, visualization, and model performance, often referred to as the "curse of dimensionality." Dimensionality reduction aims to overcome these challenges by transforming the data into a lower-dimensional space.
Reducing dimensionality may result in some loss of information, and in some cases, important patterns or details might be discarded. Therefore, it is crucial to carefully analyze the trade-offs before applying dimensionality reduction techniques.
You can watch the video-based tutorial with step by step explanation down below.
Import the data
from keras.datasets import mnist
Keras provides a module called keras.datasets that includes several popular datasets for machine learning and deep learning tasks. Here we are importing mnist dataset.
Next we will unpack the training and test data.
(X, y), (_,_) = mnist.load_data()
print(X.shape, y.shape)
(60000, 28, 28) (60000,)
In this code snippet, X will contain the training images, and y will contain the corresponding labels for the images. The MNIST dataset contains 60,000 training images, each represented as a 28x28 grayscale image, so X.shape will be (60000, 28, 28). The corresponding labels are stored in y, and y.shape will be (60000,), representing a 1-dimensional array of 60,000 labels.
By using the dummy variables (_, _) for the testing data, the code indicates that you are not interested in loading or using the testing images and labels in this particular context. The convention of using _ is a common way to indicate that you are intentionally discarding certain values.
Next let us reshape the data.
X = X.reshape(len(X), -1)
X.shape
(60000, 784)
Before reshaping, X had a shape of (60000, 28, 28), representing 60,000 images with dimensions 28x28 pixels.
After the reshaping, the data will have a shape of (60000, 784), where each row now represents a flattened version of an individual image with 784 features (28x28 = 784).
Flattening the images like this is a common preprocessing step when using certain machine learning algorithms that expect 2D input rather than 3D input.
By doing so, the 2D array X can be directly used in algorithms like logistic regression, support vector machines (SVMs), or artificial neural networks.
Each row in X now represents a single data point (image), and each column represents a feature of that data point (pixel intensity).
Import Modules
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from umap import UMAP
sklearn.decomposition - module from the scikit-learn library (also known as sklearn) that provides various methods for performing dimensionality reduction and matrix decomposition in Python.
sklearn.manifold - module from the scikit-learn library (sklearn) that provides various algorithms for manifold learning. Manifold learning is a set of techniques used for nonlinear dimensionality reduction and visualization of high-dimensional data in a lower-dimensional space.
sklearn.discriminant_analysis - module from the scikit-learn library (sklearn) that provides implementations of various linear and quadratic discriminant analysis (LDA and QDA) algorithms.
umap - UMAP (Uniform Manifold Approximation and Projection) is a powerful non-linear dimensionality reduction technique used for embedding high-dimensional data into a lower-dimensional space.
Different Dimensionality Reduction Techniques
1) PCA
PCA (Principal Component Analysis) is a widely used linear dimensionality reduction technique that aims to transform high-dimensional data into a lower-dimensional space while preserving as much of the data's variance as possible. The principal components represent the directions of maximum variance in the data, and by choosing a subset of these components, PCA can reduce the number of features in the data.
Let us see how PCA technique performs
x_pca = PCA(n_components=2).fit_transform(X)
x_pca will contain the transformed data, where each row represents a sample projected into the two-dimensional space defined by the first two principal components.
The fit_transform() method combines the process of fitting the PCA model to the data (fit()) and transforming the data to the lower-dimensional space (transform()).
Using n_components=2 indicates that the transformed data will have two columns, representing the two most important principal components that capture the maximum variance in the data. The remaining components, which account for less variance, are discarded in this case.
Let us see the shape of the reduced data after reshaping
x_pca.shape
(60000, 2)
The shape of the variable x_pca depends on the original shape of the dataset X and the number of components chosen for PCA.
X is the original dataset, and you specified n_components=2, which means PCA will reduce the dimensionality of the data to two components. Therefore, x_pca will have the shape (n_samples, 2).
Next let us visualize the reduced data.
plt.figure(figsize=(10,10))
sc = plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()
plt.figure(figsize=(10, 10)) creates a new figure with a size of 10x10 inches.
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y) creates a scatter plot using the first two components of x_pca (accessible using x_pca[:, 0] and x_pca[:, 1]) as the x and y coordinates. The c=y parameter assigns colors to the points based on the corresponding labels y.
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10))) creates a legend for the plot using the unique labels in y. It uses the handles returned by sc.legend_elements()[0] to obtain the legend markers, and list(range(10)) provides the labels for each class (assuming there are 10 classes).
Finally display the plot on the screen.
In the plot each color represents particular digit.
In 2-D representation we can clearly see most of the digits are overlapping and some of the digits are grouped together.
2) LDA
LDA (Linear Discriminant Analysis) is a linear dimensionality reduction technique that is commonly used for supervised classification tasks. Similar to PCA, LDA projects data from a higher-dimensional space to a lower-dimensional space, but it aims to maximize the separation between classes while minimizing the variance within each class.
Let us see how LDA technique performs.
x_lda = LDA(n_components=2).fit_transform(X, y)
The fit_transform() method is used to fit the LDA model to the data and then transform the data into the lower-dimensional space defined by the first two discriminant components.
The variable x_lda will have a shape of (100, 2), where each row represents a sample projected into the reduced feature space.
Next let us visualize the reduced data.
plt.figure(figsize=(10,10))
sc = plt.scatter(x_lda[:, 0], x_lda[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()
This code snippet will produce a scatter plot where data points from different classes are colored differently, and you will see how the data has been distributed in the two-dimensional space after applying Linear Discriminant Analysis. The legend will help identify which color corresponds to which class label.
We can see that LDA has produced different plot compared to PCA but still we can see some kind of overlapping and grouping of digits but it is better when compared to PCA.
3) T-SNE
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique commonly used for visualizing high-dimensional data in a lower-dimensional space. It is particularly effective at preserving the local structure and relationships between data points, making it useful for visualizing complex datasets where linear methods like PCA might not be sufficient.
The t-SNE algorithm works by minimizing the divergence between two probability distributions: a Gaussian distribution that represents pairwise similarities between data points in the high-dimensional space and a Student's t-distribution that represents pairwise similarities in the low-dimensional space.
By minimizing the divergence, t-SNE tries to preserve the neighborhood relationships of data points, making similar points close together and dissimilar points far apart.
Let us see how t-SNE technique works
# taking only 10k samples for quick results
x_tsne = TSNE(n_jobs=-1).fit_transform(X[:10000])
X: Represents the original dataset, a 2D array with samples as rows and features as columns.
X[:10000]: Slices the first 10,000 samples from the original dataset X. This subset of data will be used for the t-SNE computation.
TSNE(n_jobs=-1): Instantiates t-SNE with the desired number of components (default is 2 for 2D visualization). The n_jobs=-1 parameter enables parallel computation using all available CPU cores, which can speed up the t-SNE process significantly.
fit_transform(X[:10000]): Fits the t-SNE model to the first 10,000 samples of the data and transforms it into the lower-dimensional space. The transformed data is stored in the variable x_tsne, which will have a shape of (10000, n_components) based on the number of components specified in the t-SNE instantiation (default is 2).
Next let us visualize the reduced data
plt.figure(figsize=(10,10))
sc = plt.scatter(x_tsne[:, 0], x_tsne[:, 1], c=y[:10000])
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()
This code snippet will produce a scatter plot where data points from different classes are colored differently based on their class labels. The legend will help identify which color corresponds to which class label. Since t-SNE preserves the local structure of the data, you should be able to observe clusters of data points that have similar characteristics. However, keep in mind that t-SNE is a stochastic algorithm, so the resulting plot may vary slightly with each run.
We can see the difference in the cluster grouping and there is less overlapping compared previous techniques.
The results are mostly clear and can be used for any machine learning algorithms to classify handwritten digits with less overlapping.
We can also draw decision boundaries based on the 2 dimension.
4) UMAP
UMAP is a powerful non-linear dimensionality reduction technique, similar to t-SNE, but with better scalability and preservation of global structure. The reduced data obtained after UMAP can be used for visualization and understanding complex high-dimensional datasets.
Let us see how UMAP performs
x_umap = UMAP(n_neighbors=10, min_dist=0.1, metric='correlation').fit_transform(X)
umap.UMAP(n_neighbors=10, min_dist=0.1, metric='correlation'): Creates an instance of the UMAP model with the following parameters
n_neighbors=10: Specifies the number of nearest neighbors to consider during the transformation. It affects the local structure preservation.
min_dist=0.1: Controls how tightly UMAP groups points together in the reduced space. Smaller values result in more compact clusters.
metric='correlation': Specifies the distance metric used to compute distances between points in the original space. In this case, it's the correlation distance.
umap_model.fit_transform(X): Fits the UMAP model to the data X and transforms the data into the lower-dimensional space. The transformed data is stored in the variable x_umap.
Next let us visualize the reduced data
plt.figure(figsize=(10,10))
sc = plt.scatter(x_umap[:, 0], x_umap[:, 1], c=y)
plt.legend(handles=sc.legend_elements()[0], labels=list(range(10)))
plt.show()
This code snippet will produce a scatter plot where data points from different classes are colored differently based on their class labels. The legend will help identify which color corresponds to which class label. UMAP's ability to preserve both local and global structure can provide insights into the underlying patterns in high-dimensional data, and the visualization can be useful for exploratory data analysis and understanding the relationships between data points.
Here we can see that digits are grouped well. We can draw the decision line using this technique as well.
This can be processed in few seconds compared to t-SNE
If you want to go CPU approach with feature reduction then UMAP is the best technique.
Final Thoughts
As the number of features increases, the data becomes more sparse, and the distance between points becomes less meaningful. Dimensionality reduction techniques help mitigate the curse of dimensionality by projecting data into a lower-dimensional space.
Dimensionality reduction can be used for data visualization and exploratory data analysis, especially when the data has many dimensions and cannot be easily visualized in its original form.
Dimensionality reduction can serve as a preprocessing step before applying machine learning algorithms. By reducing the number of features, it can lead to improved model performance and reduced overfitting, especially when the original data is high-dimensional.
Linear techniques like PCA are effective when the data exhibits linear relationships, while non-linear techniques like t-SNE and UMAP are better suited for capturing complex non-linear structures in the data.
PCA is an unsupervised technique that does not consider class labels, while techniques like LDA, t-SNE, and UMAP take class information into account, making them suitable for supervised tasks or visualization of labeled data.
Dimensionality reduction involves a trade-off between simplifying the data and preserving relevant information. Some loss of information is inevitable, but the goal is to retain the most important patterns and relationships.
Many dimensionality reduction techniques have hyperparameters that can be tuned to influence the outcome of the transformation. Proper parameter selection can significantly impact the quality of the reduced representation.
In conclusion, dimensionality reduction is a valuable tool for handling high-dimensional data, understanding complex patterns, and preparing data for further analysis or machine learning tasks. The choice of technique depends on the specific problem and the underlying characteristics of the data. As with any data analysis or machine learning task, it is essential to consider the context, evaluate the performance, and interpret the results in light of the domain knowledge and objectives.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments