## Step by Step Approach to Principal Component Analysis using Python Principal Component Analysis or PCA is used for dimensionality reduction of the large data set. In my previous post A Complete Guide to Principal Component Analysis – PCA in Machine Learning , I have explained what is PCA and the complete concept behind the PCA technique. This post is in continuation of previous post, However if you have the basic understanding of how PCA works then you may continue else it is highly recommended to go through above mentioned post first.

In this post I will explain the following things on ‘MNIST original’ data set.

• Split data into Train and Test sets
• Standardize the data
• How to compute Principal Components and Transform the data
• Plot explained variance as a function of number of Components
• Apply Logistic Regression to the Transformed Data
• Measuring Model Performance with PCA Transformed Data and original data that is without the PCA
• Comparing time taken to fit Logistic Regression model between PCA transformed data and Original Data.

```from sklearn.dataset import fetch_mldata
mnist = fetch_mldata('MNIST ORIGINAL')
mnist
```
```#This is image data set.
import pandas as pd
mnist_df = pd.Dataframe(mnist.data).describe
mnist_df_label = pd.Dataframe(mnist.target).describe
```

## 2. Split Data into Training and Test Sets

```from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)
```

## 3. Standardize the Data

• Standardization involves re-scaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.
• In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled.
```from sklearn.preprocessing inport StandardScaler
scaler = StandardScaler()
scaler.fit(train_img) #fit only on training set
#Apply transformation on both training and test set
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)
```

## 4. Compute Principal Components

I will consider including 95% of variance. Hence will pass .95 as a parameter for the number of Components in PCA function. It means PCA will return the number of Components which will have a total variance of 95%.

```from sklearn.decomposition import PCA
pca = PCA(.95)
pca.fit(train_img)
transformed_train_img = pca.transform(train_img)
```

Now we can find out how many components are included with respect to 95% of variance using pca.n_components as follows

```print(pca.n_components_)
#it will output to 330 components
```

we can check the explained variance by each of the principal components using pca.explained_variance

```print(pca.explained_variance_)
#below screenshow shows variances of some of the top components. you may run this code and get complete output.
```
```print(pca.explained_variance_ratio_)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.
```
```cum = pca.explained_variance_ratio_.cumsum()
print(cum)
#below screenshow shows variance ratio of some of the top components. you may run this code and get complete output.
```

## 5. Plot explained variance as a function of number of Components

```%matplotlib notebook
plt.figure()figsize = (8,4)
plt.plot(np.cumsum(pca.explained_variance_ratio_*100))

plt.xlim(xmax = 400, xmin = 0)
plt.ylim(ymax = 100, ymin = 0)

plt.title('Cumulative Explained Variance as a Function of the Number of Components')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.axhline(y = 95, color='k', linestyle='--', label = '95% Explained Variance')
plt.axhline(y = 90, color='c', linestyle='--', label = '90% Explained Variance')
plt.axhline(y = 85, color='r', linestyle='--', label = '85% Explained Variance')
PLOT OUT THE EXPLAINED VARIANCES SUPERIMPOSED
plt.legend(loc='best')
plt.show()
```

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see in next steps. One of the cool things about PCA is that we can go from a compressed representation (330 components) back to an approximation of the original high dimensional data (784 components).

## 6. Apply Logistic Regression to the Transformed Data

#### Step 1: Import the model you want to use

```
from sklearn.linear_model import LogisticRegression

```

#### Step 2: Make an instance of the Model

```
#Parameters which are not specified are set to their defaults
#Default solver is incredibly slow which is why I am using 'lbfgs'
logisticRegr = LogisticRegression(solver = 'lbfgs')#Limited Memory Broyden–Fletcher–Goldfarb–Shanno
logisticRegr_pca=LogisticRegression(solver = 'lbfgs')

```

In numerical optimization, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.

#### Step 3: Training the model on the data, storing the information learned from the data

```
#fitting logistic regression for original data
logisticRegr.fit(train_img, train_lbl)

#fitting logistic regression for pca_transformed data
logisticRegr_pca.fit(transformed_train_img, train_lbl)
```

#### Step 4: Predict the labels of new data (new images)

Uses the information the model learned during the model training process
The code below predicts for one observation:

## 8. Image reconstruction

The idea with going from 784 components to 330 is to reduce the running time of a supervised learning algorithm (in this case logistic regression) which we will see at the end of the tutorial. One of the cool things about PCA is that we can go from a compressed representation (154 components) back to an approximation of the original high dimensional data (784 components)

```
lower_dimensional_data = pca.fit_transform(mnist.data)
approximation = pca.inverse_transform(lower_dimensional_data)
plt.figure(figsize=(8,4));
#Original Image
plt.subplot(1, 2, 1);
plt.imshow(mnist.data.reshape(28,28),
cmap = plt.cm.gray, interpolation='nearest',
clim=(0, 255));
plt.xlabel('784 components', fontsize = 14)
plt.title('Original Image', fontsize = 20);
#154 principal components
plt.subplot(1, 2, 2);
plt.imshow(approximation.reshape(28, 28),
cmap = plt.cm.gray, interpolation='nearest',
clim=(0, 255));
plt.xlabel('330 components', fontsize = 14)
plt.title('95% of Explained Variance', fontsize = 20);
```

Hope it has given you the good understanding on how to use PCA to speed-up the Machine Learning Algorithms.

If you are an aspiring data scientist or an experienced professional who is trying to make his career in Data Science, then you must visit E-network. Where we focus on high-quality interactive mock interview sessions and help you to QuickStart your Data Science and Machine Learning journey by Preparing a learning roadmap, providing study material, suggesting Best training institutes and provide practice problems with their solutions and many more…