Step by Step Approach to Principal Component Analysis using Python

Principal Component Analysis or PCA is used for dimensionality reduction of the large data set. In my previous post A Complete Guide to Principal Component Analysis – PCA in Machine Learning , I have explained what is PCA and the complete concept behind the PCA technique. This post is in continuation of previous post, However if you have the basic understanding of how PCA works then you may continue else it is highly recommended to go through above mentioned post first.

In this post I will explain the following things on ‘MNIST original’ data set.

Download and load the ‘MNIST original’ Data set.
Split data into Train and Test sets
Standardize the data
How to compute Principal Components and Transform the data
Plot explained variance as a function of number of Components
Apply Logistic Regression to the Transformed Data
Measuring Model Performance with PCA Transformed Data and original data that is without the PCA
Comparing time taken to fit Logistic Regression model between PCA transformed data and Original Data.

Download and load the ‘MNIST original’ Data set.

from sklearn.dataset import fetch_mldata
mnist = fetch_mldata('MNIST ORIGINAL')
mnist

#This is image data set.
import pandas as pd
mnist_df = pd.Dataframe(mnist.data).describe
mnist_df_label = pd.Dataframe(mnist.target).describe

2. Split Data into Training and Test Sets

from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

3. Standardize the Data

Standardization involves re-scaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled.

from sklearn.preprocessing inport StandardScaler
scaler = StandardScaler()
scaler.fit(train_img) #fit only on training set
#Apply transformation on both training and test set
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

4. Compute Principal Components

I will consider including 95% of variance. Hence will pass .95 as a parameter for the number of Components in PCA function. It means PCA will return the number of Components which will have a total variance of 95%.

from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(train_img)
transformed_train_img = pca.transform(train_img)

Now we can find out how many components are included with respect to 95% of variance using pca.n_components as follows

print(pca.n_components_)
#it will output to 330 components

we can check the explained variance by each of the principal components using pca.explained_variance

print(pca.explained_variance_)
#below screenshow shows variances of some of the top components. you may run this code and get complete output.

print(pca.explained_variance_ratio_)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.

cum = pca.explained_variance_ratio_.cumsum()
print(cum)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.

5. Plot explained variance as a function of number of Components

%matplotlib notebook
plt.figure()figsize = (8,4)
plt.plot(np.cumsum(pca.explained_variance_ratio_*100))

plt.xlim(xmax = 400, xmin = 0)
plt.ylim(ymax = 100, ymin = 0)

plt.title('Cumulative Explained Variance as a Function of the Number of Components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.axhline(y = 95, color='k', linestyle='--', label = '95% Explained Variance')
plt.axhline(y = 90, color='c', linestyle='--', label = '90% Explained Variance')
plt.axhline(y = 85, color='r', linestyle='--', label = '85% Explained Variance')
PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED
plt.legend(loc='best')
plt.show()

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see in next steps. One of the cool things about PCA is that we can go from a compressed representation (330 components) back to an approximation of the original high dimensional data (784 components).

6. Apply Logistic Regression to the Transformed Data

Step 1: Import the model you want to use


from sklearn.linear_model import LogisticRegression

Step 2: Make an instance of the Model


#Parameters which are not specified are set to their defaults 
#Default solver is incredibly slow which is why I am using 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')#Limited Memory Broyden–Fletcher–Goldfarb–Shanno
logisticRegr_pca=LogisticRegression(solver = 'lbfgs')

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.

Step 3: Training the model on the data, storing the information learned from the data


#fitting logistic regression for original data
logisticRegr.fit(train_img, train_lbl)

#fitting logistic regression for pca_transformed data
logisticRegr_pca.fit(transformed_train_img, train_lbl)

Step 4: Predict the labels of new data (new images)

Uses the information the model learned during the model training process
The code below predicts for one observation:

7. Measuring Model Performance

8. Image reconstruction

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see at the end of the tutorial. One of the cool things about PCA is that we can go from a compressed representation (154 components) back to an approximation of the original high dimensional data (784 components)

 
lower_dimensional_data = pca.fit_transform(mnist.data)
approximation = pca.inverse_transform(lower_dimensional_data)
plt.figure(figsize=(8,4));
#Original Image
plt.subplot(1, 2, 1);
plt.imshow(mnist.data[1].reshape(28,28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('784 components', fontsize = 14)
plt.title('Original Image', fontsize = 20);
#154 principal components
plt.subplot(1, 2, 2);
plt.imshow(approximation[1].reshape(28, 28),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=(0, 255));
plt.xlabel('330 components', fontsize = 14)
plt.title('95% of Explained Variance', fontsize = 20);

Hope it has given you the good understanding on how to use PCA to speed-up the Machine Learning Algorithms.

Feel free to contact us for more details and discussions.

Recommended Articles on Machine Learning:

Step by Step Approach to Principal Component Analysis using Python

Download and load the ‘MNIST original’ Data set.

2. Split Data into Training and Test Sets

3. Standardize the Data

4. Compute Principal Components

5. Plot explained variance as a function of number of Components

6. Apply Logistic Regression to the Transformed Data

Step 1: Import the model you want to use

Step 2: Make an instance of the Model

Step 3: Training the model on the data, storing the information learned from the data

Step 4: Predict the labels of new data (new images)

7. Measuring Model Performance

8. Image reconstruction

4 comments

Leave a Reply Cancel reply

Download and load the ‘MNIST original’ Data set.

2. Split Data into Training and Test Sets

3. Standardize the Data

4. Compute Principal Components

5. Plot explained variance as a function of number of Components

6. Apply Logistic Regression to the Transformed Data

Step 1: Import the model you want to use

Step 2: Make an instance of the Model

Step 3: Training the model on the data, storing the information learned from the data

Step 4: Predict the labels of new data (new images)

7. Measuring Model Performance

8. Image reconstruction

Share this:

Related

Leave a Reply Cancel reply