Step by Step Approach to Principal Component Analysis using Python

Principal Component Analysis or PCA is used for dimensionality reduction of the large data set. In my previous post A Complete Guide to Principal Component Analysis – PCA in Machine Learning , I have explained what is PCA and the complete concept behind the PCA technique. This post is in continuation of previous post, However if you have the basic understanding of how PCA works then you may continue else it is highly recommended to go through above mentioned post first.

In this post I will explain the following things on ‘MNIST original’ data set.

Principal Component Analysis
  • Download and load the ‘MNIST original’ Data set.
  • Split data into Train and Test sets
  • Standardize the data
  • How to compute Principal Components and Transform the data
  • Plot explained variance as a function of number of Components
  • Apply Logistic Regression to the Transformed Data
  • Measuring Model Performance with PCA Transformed Data and original data that is without the PCA
  • Comparing time taken to fit Logistic Regression model between PCA transformed data and Original Data.

Download and load the ‘MNIST original’ Data set.

from sklearn.dataset import fetch_mldata
mnist = fetch_mldata('MNIST ORIGINAL')
mnist
#This is image data set.
import pandas as pd
mnist_df = pd.Dataframe(mnist.data).describe
mnist_df_label = pd.Dataframe(mnist.target).describe

2. Split Data into Training and Test Sets

from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

3. Standardize the Data

  • Standardization involves re-scaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
  • In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled.
from sklearn.preprocessing inport StandardScaler
scaler = StandardScaler()
scaler.fit(train_img) #fit only on training set
#Apply transformation on both training and test set
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

4. Compute Principal Components

I will consider including 95% of variance. Hence will pass .95 as a parameter for the number of Components in PCA function. It means PCA will return the number of Components which will have a total variance of 95%.

from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(train_img)
transformed_train_img = pca.transform(train_img)

Now we can find out how many components are included with respect to 95% of variance using pca.n_components as follows

print(pca.n_components_)
#it will output to 330 components

we can check the explained variance by each of the principal components using pca.explained_variance

print(pca.explained_variance_)
#below screenshow shows variances of some of the top components. you may run this code and get complete output.
print(pca.explained_variance_ratio_)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.
cum = pca.explained_variance_ratio_.cumsum()
print(cum)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.

5. Plot explained variance as a function of number of Components

%matplotlib notebook
plt.figure()figsize = (8,4)
plt.plot(np.cumsum(pca.explained_variance_ratio_*100))

plt.xlim(xmax = 400, xmin = 0)
plt.ylim(ymax = 100, ymin = 0)

plt.title('Cumulative Explained Variance as a Function of the Number of Components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.axhline(y = 95, color='k', linestyle='--', label = '95% Explained Variance')
plt.axhline(y = 90, color='c', linestyle='--', label = '90% Explained Variance')
plt.axhline(y = 85, color='r', linestyle='--', label = '85% Explained Variance')
PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED
plt.legend(loc='best')
plt.show()

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see in next steps. One of the cool things about PCA is that we can go from a compressed representation (330 components) back to an approximation of the original high dimensional data (784 components).

6. Apply Logistic Regression to the Transformed Data

Step 1: Import the model you want to use


from sklearn.linear_model import LogisticRegression

Step 2: Make an instance of the Model


#Parameters which are not specified are set to their defaults 
#Default solver is incredibly slow which is why I am using 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')#Limited Memory Broyden–Fletcher–Goldfarb–Shanno
logisticRegr_pca=LogisticRegression(solver = 'lbfgs')

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.

Step 3: Training the model on the data, storing the information learned from the data


#fitting logistic regression for original data
logisticRegr.fit(train_img, train_lbl)

#fitting logistic regression for pca_transformed data
logisticRegr_pca.fit(transformed_train_img, train_lbl)

Step 4: Predict the labels of new data (new images)

Uses the information the model learned during the model training process
The code below predicts for one observation:

7. Measuring Model Performance

8. Image reconstruction

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see at the end of the tutorial. One of the cool things about PCA is that we can go from a compressed representation (154 components) back to an approximation of the original high dimensional data (784 components)

 
lower_dimensional_data = pca.fit_transform(mnist.data)
approximation = pca.inverse_transform(lower_dimensional_data)
plt.figure(figsize=(8,4));
#Original Image
plt.subplot(1, 2, 1);
plt.imshow(mnist.data[1].reshape(28,28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('784 components', fontsize = 14)
plt.title('Original Image', fontsize = 20);
#154 principal components
plt.subplot(1, 2, 2);
plt.imshow(approximation[1].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('330 components', fontsize = 14)
plt.title('95% of Explained Variance', fontsize = 20);

Hope it has given you the good understanding on how to use PCA to speed-up the Machine Learning Algorithms.

Feel free to contact us for more details and discussions.

Recommended Articles on Machine Learning:

4 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.