Step by Step Approach to Principal Component Analysis using Python
Principal Component Analysis or PCA is used for dimensionality reduction of the large data set. In my previous post A Complete Guide to Principal Component Analysis – PCA in Machine Learning , I have explained what is PCA and the complete concept behind the PCA technique. This post is in continuation of previous post, However if you have the basic understanding of how PCA works then you may continue else it is highly recommended to go through above mentioned post first.
In this post I will explain the following things on ‘MNIST original’ data set.
- Download and load the ‘MNIST original’ Data set.
- Split data into Train and Test sets
- Standardize the data
- How to compute Principal Components and Transform the data
- Plot explained variance as a function of number of Components
- Apply Logistic Regression to the Transformed Data
- Measuring Model Performance with PCA Transformed Data and original data that is without the PCA
- Comparing time taken to fit Logistic Regression model between PCA transformed data and Original Data.
Download and load the ‘MNIST original’ Data set.
from sklearn.dataset import fetch_mldata mnist = fetch_mldata('MNIST ORIGINAL') mnist
#This is image data set. import pandas as pd mnist_df = pd.Dataframe(mnist.data).describe mnist_df_label = pd.Dataframe(mnist.target).describe
2. Split Data into Training and Test Sets
from sklearn.model_selection import train_test_split train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)
3. Standardize the Data
- Standardization involves re-scaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
- In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled.
from sklearn.preprocessing inport StandardScaler scaler = StandardScaler() scaler.fit(train_img) #fit only on training set #Apply transformation on both training and test set train_img = scaler.transform(train_img) test_img = scaler.transform(test_img)
4. Compute Principal Components
I will consider including 95% of variance. Hence will pass .95 as a parameter for the number of Components in PCA function. It means PCA will return the number of Components which will have a total variance of 95%.
from sklearn.decomposition import PCA pca = PCA(.95) pca.fit(train_img) transformed_train_img = pca.transform(train_img)
Now we can find out how many components are included with respect to 95% of variance using pca.n_components as follows
print(pca.n_components_) #it will output to 330 components
we can check the explained variance by each of the principal components using pca.explained_variance
print(pca.explained_variance_) #below screenshow shows variances of some of the top components. you may run this code and get complete output.
print(pca.explained_variance_ratio_) #below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.
cum = pca.explained_variance_ratio_.cumsum() print(cum) #below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.
5. Plot explained variance as a function of number of Components
%matplotlib notebook plt.figure()figsize = (8,4) plt.plot(np.cumsum(pca.explained_variance_ratio_*100)) plt.xlim(xmax = 400, xmin = 0) plt.ylim(ymax = 100, ymin = 0) plt.title('Cumulative Explained Variance as a Function of the Number of Components') plt.xlabel('number of components') plt.ylabel('cumulative explained variance'); plt.axhline(y = 95, color='k', linestyle='--', label = '95% Explained Variance') plt.axhline(y = 90, color='c', linestyle='--', label = '90% Explained Variance') plt.axhline(y = 85, color='r', linestyle='--', label = '85% Explained Variance') PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED plt.legend(loc='best') plt.show()
The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see in next steps. One of the cool things about PCA is that we can go from a compressed representation (330 components) back to an approximation of the original high dimensional data (784 components).
6. Apply Logistic Regression to the Transformed Data
Step 1: Import the model you want to use
from sklearn.linear_model import LogisticRegression
Step 2: Make an instance of the Model
#Parameters which are not specified are set to their defaults #Default solver is incredibly slow which is why I am using 'lbfgs' logisticRegr = LogisticRegression(solver = 'lbfgs')#Limited Memory Broyden–Fletcher–Goldfarb–Shanno logisticRegr_pca=LogisticRegression(solver = 'lbfgs')
In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.
Step 3: Training the model on the data, storing the information learned from the data
#fitting logistic regression for original data logisticRegr.fit(train_img, train_lbl) #fitting logistic regression for pca_transformed data logisticRegr_pca.fit(transformed_train_img, train_lbl)
Step 4: Predict the labels of new data (new images)
Uses the information the model learned during the model training process
The code below predicts for one observation:
7. Measuring Model Performance
8. Image reconstruction
The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see at the end of the tutorial. One of the cool things about PCA is that we can go from a compressed representation (154 components) back to an approximation of the original high dimensional data (784 components)
lower_dimensional_data = pca.fit_transform(mnist.data) approximation = pca.inverse_transform(lower_dimensional_data) plt.figure(figsize=(8,4)); #Original Image plt.subplot(1, 2, 1); plt.imshow(mnist.data.reshape(28,28), cmap = plt.cm.gray, interpolation='nearest', clim=(0, 255)); plt.xlabel('784 components', fontsize = 14) plt.title('Original Image', fontsize = 20); #154 principal components plt.subplot(1, 2, 2); plt.imshow(approximation.reshape(28, 28), cmap = plt.cm.gray, interpolation='nearest', clim=(0, 255)); plt.xlabel('330 components', fontsize = 14) plt.title('95% of Explained Variance', fontsize = 20);
Hope it has given you the good understanding on how to use PCA to speed-up the Machine Learning Algorithms.
If you are an aspiring data scientist or an experienced professional who is trying to make his career in Data Science, then you must visit E-network. Where we focus on high-quality interactive mock interview sessions and help you to QuickStart your Data Science and Machine Learning journey by Preparing a learning roadmap, providing study material, suggesting Best training institutes and provide practice problems with their solutions and many more…
Feel free to contact us for more details and discussions.
Recommended Articles on Machine Learning:
- What is Linear Regression? Part: 1
- What is Linear Regression? Part: 2
- Co-variance and Correlation
- What is the Coefficient of Determination | R Square
- Feature Selection Techniques in Regression Model
- What is “stepAIC” in R?
- What is Multicollinearity in R?
- A Complete Guide to Principal Component Analysis – PCA in Machine Learning