# Data Visualization using plotly, matplotlib, seaborn and squarify | Data Science

Data Visualization is one of the important activity we perform when doing Exploratory Data Analysis. It helps in preparing business reports, visual dashboards, story telling etc important tasks. In this post I have explained how to ask questions from the data and in return get the self explanatory graphs. In this You will learn the use of various python libraries like plotly, matplotlib, seaborn, squarify etc to plot those graphs.

Key takeaways from this post are:

• Asking questions from data set
• Univariate Analysis
• Bivariate Analysis
• Analysis of more than 3 variables
• 3D Visualization
• Case Study on employee Attrition Rate using HR Data Set
```import warnings
warnings.filterwarnings('ignore')
!pip install plotly
!pip install squarify
```
```import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly
import plotly.offline as pyoff
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
import squarify # for tree maps
%matplotlib inline
```

## plotly

• Modern Visualization for the data Era

#### Line Chart in plotly

• 2 numeric variables with 1-1 mapping, i.e in situations where we have 1 y value corresponding to 1 x value
```x=[1, 2, 3]
y=[3, 1, 6]
iplot([go.Scatter(x=x,
y=y,
text = [str(i) for i in (zip(x,y))],
textposition = 'top center')])
```
###### You can export images to html file only with offline mode
```from plotly.offline import plot
plot([go.Scatter(x=x,
y=y,
text = [str(i) for i in (zip(x,y))],
textposition = 'top center')],
output_type='file' ,
filename='temp-histogram.jpeg',image='jpeg',auto_open=False)
```
`output -> 'temp-histogram.jpeg.html'`

Note that this is a bare chart with no information, Later in the activity we will add title, x labels and y labels.

#### Basic Bar chart in plotly

• 1 Categorical variable
```data = [go.Bar(
x=['x', 'y', 'z'],
y=[10, 20, 15])]
iplot(data)
```

#### Histogram in plotly

• 1 numeric variable
```n = 1000
x = np.random.randn(n)
data = [go.Histogram(x=x,
marker=dict(
color='#CC0E1D',# Lava (#CC0E1D)
color = 'rgb(200,0,0)' # you can provide color in HEX format or rgb format, genrally programmers prefer HEX format as it is a single string value and easy to pass as a variable
<code>))]</code>
layout = go.Layout(title = "Histogram of {} random numbers".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)
```

#### Boxplot in plotly

• 1 Numeric variable
```from IPython.display import Image
Image("img/boxplot.png")
```
```np.random.seed(0) # Set seed for reproducibility
n = 10
r1 = np.random.randn(n)
r2 = np.random.randn(n)
trace0 = go.Box(
y=r1,
name = 'Box1',
marker = dict(
color = '#AA0505',
)
)
trace1 = go.Box(
y=r2,
name = 'Box2',
marker = dict(
color = '#B97D10',
)
)
data = [trace0, trace1]
layout = go.Layout(title = "Boxplot of 2 sets of random numbers")
fig = go.Figure(data= data, layout=layout)
iplot(fig)
```

#### Pie chart in plotly

• 1 Categorical variable
```labels = ["Pre processing and Visualization", "Model Building", "Misc"]
values = [80,10,10]
trace = go.Pie(labels=labels, values=values)
layout = go.Layout(title = 'Percentage of time spent on Data Science projects')
data = [trace]
fig = go.Figure(data= data,layout=layout)
iplot(fig)
```

#### Scatter plot in plotly

• 2 numeric variables
• One x might have multiple corresponding y values
```np.random.seed(0)
n = 20
x=np.random.randint(0,100,n)
y=np.random.randint(0,100,n)
data = [go.Scatter(x=x,y=y,
text = [str(i) for i in (zip(x,y))],
textposition = 'top center',
marker = dict(color = 'rgba(17, 157, 255, 0.8)',
size = 10), mode = 'markers')]
layout = go.Layout(title = 'Scatter plot')
fig = go.Figure(data= data,layout=layout)
iplot(fig)
```

### Tree map

https://plot.ly/python/treemaps/

```squarify.plot(sizes=[13,22,35,5], label=["group A", "group B", "group C",
"group D"], alpha=.7 )
plt.show()
```

### Heatmap

```np.random.rand(2, 2)
```
```# trace = go.Heatmap(z=[[1, 20], [22, 1]], x=['Monday', 'Tuesday'],y=['Morning', 'Afternoon'])
# data=[trace]
# iplot(data)
sns.heatmap(np.random.rand(2, 2))
```

# Case Study

### hr_data Description

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

```hr_data = pd.read_csv("HR_Attrition.csv")
```

## Pre-processing

```hr_data.head()
```

## Checking the number of unique values in each column

```for i in hr_data.columns:
print ("Number of unique values in {} column are {} \n
The unique values are {}".format(i, len(hr_data[i].unique()),hr_data[i].unique()))
print ("---------------------- \n")
```

## Observations:

```- Most columns have fewer than 4 unique levels
- NumCompaniesWorked and PercentSalaryHike have less than 15 values and we can convert these into categorical values for analysis purposes,
this is fairly subjective. You can also continue with these as integer values.

```
##### Replacing the integers with above values with the values in the description
• hr_data.Education = hr_data.Education.replace(to_replace=[1,2,3,4,5],value=[‘Below College’, ‘College’, ‘Bachelor’, ‘Master’, ‘Doctor’])
• hr_data.EnvironmentSatisfaction = hr_data.EnvironmentSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
• hr_data.JobInvolvement = hr_data.JobInvolvement.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
• hr_data.JobSatisfaction = hr_data.JobSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
• hr_data.PerformanceRating = hr_data.PerformanceRating.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Good’, ‘Excellent’, ‘Outstanding’])
• hr_data.RelationshipSatisfaction = hr_data.RelationshipSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
• hr_data.WorkLifeBalance = hr_data.WorkLifeBalance.replace(to_replace=[1,2,3,4],value=[‘Bad’, ‘Good’, ‘Better’, ‘Best’])
```Education_dict = {1:'Below College',
2:'College',
3:'Bachelor',
4:'Master',
5:'Doctor',
}
EnvironmentSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
JobInvolvement_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
JobSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
PerformanceRating_dict = {1:'Low',
2:'Good',
3:'Excellent',
4:'Outstanding',
}
RelationshipSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
2:'Good',
3:'Better',
4:'Best',
}
```
```hr_data = hr_data.replace({
"Education":Education_dict,
"EnvironmentSatisfaction":EnvironmentSatisfaction_dict,
"JobInvolvement":JobInvolvement_dict,
"JobSatisfaction":JobSatisfaction_dict,
"PerformanceRating":PerformanceRating_dict,
"RelationshipSatisfaction":RelationshipSatisfaction_dict,
"WorkLifeBalance":WorkLifeBalance_dict
})
```
##### Extract categorical columns

Columns with 15 or less levels are considered as categorical columns for the purpose of this analysis

We have decided to treat all the columns with 15 or less levels as categorical columns, the following few lines of code extract all the columns which satisfy the condition.

```cat_cols = []
for i in hr_data.columns:
if hr_data[i].dtype =='object' or len(np.unique(hr_data[i]))<=15 :
# if the number of levels is less that 15 considering the column
as categorial
cat_cols.append(i)
print("{} : {} : {} ".format(i,len(np.unique(hr_data[i])),np.unique(hr_data[i])))
```

## Check if the above columns are categorical in the data set

##### Type Conversion
• n dimensional type conversion to ‘category’ is not implemented yet
```for i in cat_cols:
hr_data[i] = hr_data[i].astype('category')
```

# Exploratory Data Analysis

## Univariate Analysis

### 1. What is the attrition rate in the company?

##### Attrition in numbers (pandas)
```hr_data.Attrition.value_counts()
```
```plt.figure()
hr_data.Attrition.value_counts().plot(kind='bar',
figsize=(6,3), color="blue", alpha = 0.7, fontsize=13)
plt.title('Attrition rate (in numbers)')
plt.grid()
plt.show()
```

This is one way to tell matplotlib to plot the graphs in the notebook

## Attrition rate in percentage (pandas)

```((hr_data.Attrition.value_counts()/sum(hr_data.Attrition.value_counts()))*100).plot(
kind='bar', figsize=(6,3), color=["blue"], alpha = 0.7, fontsize=16)
plt.ylim([0,100])
plt.title('Attrition Rate (in percentage)')
plt.ylabel('Percentage Attrition',fontsize = 14)
plt.grid(True)
plt.show()
```

## plotly In percentages

```temp = hr_data.Attrition.value_counts()
trace = go.Bar(x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2),
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
name = 'Attrition')
data = [trace]
layout = go.Layout(autosize=False, width=600, height=400,title =
"Attrition Distribution"
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
```

## 2. What is the Gender Distribution in the company?

```temp = hr_data.Gender.value_counts()
temp
```
```data = [go.Bar(
x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2),
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
)]
layout = go.Layout(
autosize=False,
width=600,
height=400,title = "Gender Distribution",
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
```
```temp = hr_data.Gender.value_counts()
temp
```
###### Steps to create a bar chart with counts for a categorical variable in plotly
• Steps to create a bar chart with counts for a categorical variable
• create an object and store the counts (optional)
• create a bar object
• pass the x values
• pass the y values
• optional :
• text to be displayed
• text position
• color of the bar
• name of the bar (trace in plotly terminology)
• create a layout object
• title – font and size of title
• x axis – font and size of xaxis text
• y axis – font and size of yaxis text
• create a figure object:
• plot the figure object
```# create a table with value counts
temp = hr_data.Gender.value_counts()
# creating a Bar chart object of plotly
data = [go.Bar(
x=temp.index.astype(str), # x axis values
y=np.round(temp.values.astype(float)/temp.values.sum(),4)100,
text=['{}%'.format(i) for i in
np.round(temp.values.astype(float)/temp.values.sum(),4)100],
textposition = 'auto', # specify at which position on the bar the text should appear
marker = dict(color = '#0047AB'),)] # change color of the bar
# color used here Cobalt Blue
# these are used to define the layout options
layout = go.Layout(
autosize=False, # auto size the graph? use False if you are specifying the height and width
width=800, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Distribution of {} column".format('Gender'), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=16, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=16, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
title='Percentage',
titlefont=dict(
size=16,
color='black'
),
tickfont=dict(
family='Courier New, monospace', # font family
size=16, # size of ticks displayed on the y axis
color='black' # color of the font
)
),
font = dict(
family='Courier New, monospace', # font family
color = "white",# color of the font
size = 12 # size of the font displayed on the bar
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
```

## We will save the above layout in an object and define a function for future use

```def generate_layout_bar(col_name):
layout_bar = go.Layout(
autosize=False,
# auto size the graph? use False if you are specifying the height and width
width=800, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Distribution of {} column".format(col_name), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(title='Percentage',titlefont=dict(size=14, color='black'),
tickfont=dict(family='Courier New, monospace', size = 14, color='black')),
font=dict(family='Courier New, monospace', color = "white", size = 12))

return layout_bar

```

## Defining a function to plot the bar charts

```def plot_bar(col_name):
# create a table with value counts
temp = hr_data[col_name].value_counts()
# creating a Bar chart object of plotly
data = [go.Bar(
x=temp.index.astype(str),
y=np.round(temp.values.astype(float)/temp.values.sum(),4)100,
text = ['{}%'.format(i) for i in
np.round(temp.values.astype(float)/temp.values.sum(),4)100],
textposition = 'auto',
# specify at which position on the bar the text should appear
marker = dict(color = '#0047AB'),)]
layout_bar = generate_layout_bar(col-name=col_name)
fig = go.Figure(data = data, layout=layout_bar)
return iplot(fig)
```

## 4. Which department has the highest number of employees? (Department)

```plot_bar('Department')
```

## 5. What is the most common educational background of the employees (Education Field)

```plot_bar('EducationField')
```

## 6. In what roles are the employees working and what is the common job role? (Job Role)

```plot_bar('JobRole')
```

## 7. Is the workforce in the company young? (Age)

```plot_bar('Age')
```

Age is a continuous variable, it makes more sense to plot a histogram rather than a bar chart

## Histogram

```data = [go.Histogram(x=hr_data.Age,marker=dict(color='#CC0E1D'<code>))]</code>
layout = go.Layout(title = "Histogram of Age".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)
```

## 8. What is the income distribution in the company?(Monthly Income)

```data = [go.Histogram(x=hr_data.MonthlyIncome,
marker=dict(
color='#CC0E1D'))]
layout = go.Layout(title = "Histogram of Income".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig
```

Observations:

```- We see that the income column has a long tailed distribution
- Binning might give better insights into the distribution
```

## Let us bin the Income column

```hr_data['Income_Bins'] = np.digitize(hr_data.MonthlyIncome,
list(range(0,hr_data.MonthlyIncome.max()+10,2500)),right=True)

list(range(0,hr_data.MonthlyIncome.max()+10,2500))

hr_data['Income_Bins'].value_counts()
```
```hr_data['Income_Bins'] = hr_data['Income_Bins'].replace(to_replace=[1,2,3,4,5,6,7,8],
value=['Bin1','Bin2','Bin3',
'Bin4','Bin5','Bin6','Bin7','Bin8'])
```
```temp = hr_data['Income_Bins'].value_counts()
temp=temp.sort_index()
```
```trace1 = go.Bar(x = temp.index,
y=(temp.values.astype(float)/sum(temp.values))100,
text=['{}%'.format(i) for i in
np.round(temp.values.astype(float)/temp.values.sum(),4)100],
textposition = 'auto',
name = 'Income_Bins')
data = [trace1]
# these are used to define the layout options
layout = generate_layout_bar('Income_Bins')
fig = go.Figure(data=data, layout=layout)
iplot(fig)
print(list(range(0,hr_data.MonthlyIncome.max()+10,2500)))
```

## 1. Is a particular gender travelling more distance than other?(Gender and Distance from home)

```trace1 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Male'],name = 'Male',
boxpoints = 'all',jitter = 1
)
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Female'],name= 'Female',
boxpoints = 'all',jitter = 1
)
data = [trace1,trace2]
layout = go.Layout(width = 1000,
height = 500,title = 'Distance from home and Gender')
fig = go.Figure(data=data,layout = layout)
iplot(fig)
```

## Distance Bins and Gender

```hr_data['Distance_Bins']=(np.digitize(
hr_data.DistanceFromHome,[0,5,15,np.max(hr_data.DistanceFromHome)],right=True))

temp = hr_data.groupby(['Distance_Bins','Gender']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['Distance_Bins','Gender','Count']
temp
```
```trace1 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Male'],
y = temp.Count[temp.Gender=='Male'],
text = temp.Count[temp.Gender=='Male'],
textposition = 'auto',
name = 'Male')
trace2 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Female'],
y = temp.Count[temp.Gender=='Female'],
text = temp.Count[temp.Gender=='Female'],
textposition = 'auto',
name = 'Female')
data = [trace1,trace2]
layout = go.Layout(width = 700,
height = 500,title = 'Gender and Distance bins',
yaxis = dict(title='Count'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

## Observations:

```- Irrespective of the distance bin, there is a global pattern i.e every bin has more male employees
```

## 2. Are employees working overtime getting better ratings? (Over Time and Performance Rating.)

```temp =
hr_data.groupby(['OverTime','PerformanceRating']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['OverTime','PerformanceRating','Count']
temp
```
```hr_data.OverTime.value_counts()
```
```trace1 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Excellent'],
y = temp.Count[temp.PerformanceRating=='Excellent']/temp.Count.sum(),
name = 'Excellent')
trace2 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Outstanding'],
y = temp.Count[temp.PerformanceRating=='Outstanding']/temp.Count.sum(),
name = 'Outstanding')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 600,title = 'OverTime and PerformanceRating')
fig = go.Figure(data=data, layout=layout)
iplot(fig)
fig
```

All the percentages add up to one, so we can compare the numbers globally

## 3. Does working longer with a manager have any relationship with Job satisfaction? (Years With Current Manager, Job Satisfaction)

```yearscurrman_jobsat = hr_data.groupby(['YearsWithCurrManager','JobSatisfaction']).size().to_frame()
yearscurrman_jobsat = yearscurrman_jobsat.reset_index()
yearscurrman_jobsat.columns = ['YearsWithCurrManager','JobSatisfaction','Counts']

np.random.seed(0)
yearscurrman_jobsat.sample(frac =0.1)
```
```tracelow = go.Bar(
x=yearscurrman_jobsat.YearsWithCurrManager[
yearscurrman_jobsat.JobSatisfaction=='Low'],
y = yearscurrman_jobsat.Counts[
yearscurrman_jobsat.JobSatisfaction=='Low'],
text = yearscurrman_jobsat.Counts[
yearscurrman_jobsat.JobSatisfaction=='Low'],
textposition = 'auto',
name = 'Low')
tracemedium = go.Bar(
x = yearscurrman_jobsat.YearsWithCurrManager[
yearscurrman_jobsat.JobSatisfaction=='Medium'],
y = yearscurrman_jobsat.Counts[
yearscurrman_jobsat.JobSatisfaction=='Medium'],
text = yearscurrman_jobsat.Counts[
yearscurrman_jobsat.JobSatisfaction=='Medium'],
textposition = 'auto',
name = 'Medium')
traceHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[
yearscurrman_jobsat.JobSatisfaction=='High'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='High'],
text = yearscurrman_jobsat.Counts[
yearscurrman_jobsat.JobSatisfaction=='High'],
textposition = 'auto',
name = 'High')
traceVHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[
yearscurrman_jobsat.JobSatisfaction=='Very High'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'],
text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'],
textposition = 'auto',
name = 'Very High')
data = [tracelow, tracemedium, traceHigh, traceVHigh]
layout = go.Layout(width = 1000,
barmode='stack',height=600,title='YearsWithCurrManager and Job Satisfaction',
xaxis = dict(title='YearsWithCurrManager'),yaxis=dict(title='Counts',
range=[0,yearscurrman_jobsat.Counts.max()+10]))
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

We observe that the red bars are higher than the green bars only after 2 years , we can infer that employees generally tend to be comfortable working with the manager after 2 years.

## 4. Are married employees staying far from the office? (Marital status and Distance from home)

```hr_data.MaritalStatus.unique()
```
```hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'].describe()
```
```hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'].describe()
```
```hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'].describe()
```
```tracediv = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'],
name = 'DistanceFromHome')
tracemarried = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'],
name= 'Married')
tracesin = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'],
name= 'Single')
data = [tracediv,tracemarried,tracesin]
layout = go.Layout(width = 800,
height = 500,title = 'Distance from home and and Marital Status')
fig = go.Figure(data=data,layout = layout)
iplot(fig)
```

## 5. Is there any relationship between Attrition and Gender?

```Gender_Attrition = hr_data.groupby(['Gender','Attrition']).size().to_frame()
Gender_Attrition = Gender_Attrition.reset_index()
Gender_Attrition.columns = ['Gender','Attrition','Count']
Gender_Attrition
```
```trace1 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='Yes'],
y = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
textposition = 'auto',
name = 'Yes')
trace2 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='No'],
y = Gender_Attrition.Count[Gender_Attrition.Attrition=='No'],
text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
textposition = 'auto',
name = 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 600,title = 'Gender and Attrition')
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

## 6. Employees who spend more years in the company tend to leave. Verify if this is true.(Years at company and Attrition)

```hr_data.YearsAtCompany[hr_data.Attrition=='Yes'].describe()
```
```hr_data.YearsAtCompany[hr_data.Attrition=='No'].describe()
```
```trace1 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='Yes'],name = 'Yes',
boxpoints = 'all',jitter = 1
<code>)</code>
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='No'],name= 'No',
boxpoints = 'all',jitter = 1
<code>)</code>
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'YearsAtCompany and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)
```

## 7 . Is a particular age group more prone to leaving the company? (Age and Attrition)

```hr_data.Age[hr_data.Attrition=='Yes'].describe()
```
```hr_data.Age[hr_data.Attrition=='No'].describe()
```
```trace1 = go.Box(y = hr_data.Age[hr_data.Attrition=='Yes'],name = 'Yes')
trace2 = go.Box(y = hr_data.Age[hr_data.Attrition=='No'],name= 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'Age and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)
```

You can also bin the age column and do the same.

## 8. Employees earning less tend to leave the company. Verify if this is true. (Monthly Income vs Attrition)

```trace1 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='Yes'],name = 'Yes')
trace2 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='No'],name= 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'Income and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)
```

## 9. How do Age and Monthly Income vary?

```trace = go.Scatter(x=hr_data.Age ,
y= hr_data.MonthlyIncome,
name = 'Age and MonthlyIncome',
mode= 'markers')
data = [trace]
layout = go.Layout(title = ' Age and Monthly Income distribution',
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Monthly Income'))
fig = go.Figure(data=data,layout=layout)
iplot(fig)
```

## 10. Does Years With Curr Manager have to do anything with Years Since Last Promotion?

```trace = go.Scatter(x=hr_data.YearsWithCurrManager ,
name = 'YearsWithCurrManager and YearsSinceLastPromotion',
mode= 'markers')
data = [trace]
layout = go.Layout(title = ' YearsWithCurrManager and YearsSinceLastPromotion distribution',
xaxis = dict(title = 'YearsWithCurrManager'),
yaxis = dict(title = 'YearsSinceLastPromotion'))
fig = go.Figure(data=data,layout=layout)
iplot(fig)
```

## 1. What is the relationship between number of companies worked , age and attrition.(Number of companies worked, Age, Attrition.)

```data = []
for i in np.sort(hr_data.NumCompaniesWorked.unique()):
data.append(go.Box(y = hr_data.Age[
hr_data.NumCompaniesWorked==i][hr_data.Attrition=='Yes'],
marker = dict(
color = '#CC0E1D',
),
name = "{}- Yes".format(str(i))))
data.append(go.Box(y = hr_data.Age[
hr_data.NumCompaniesWorked==i][hr_data.Attrition=='No'],
marker = dict(
color = '#588061',
),
name = "{}- No".format(str(i))))
layout = go.Layout(
autosize=False, # auto size the graph? use False if you are specifying the height and width
width=1000, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Age','NumCompaniesWorked'), # title of the figure
# more granular control on the title font
<code>titlefont=dict( family='Courier New, monospace', # font family </code>
<code>size=14, # size of the font </code>
<code>color='black' # color of the font ), </code>
<code># granular control on the axes objects</code>
xaxis=dict(title='Number of Companies worked and attrition',
tickfont=dict(family = 'Courier New, monospace',
size=10,
color='black',
),
yaxis=dict(
# range=[0,100],title='Age',
titlefont=dict(size=14,color='black'),
tickfont=dict(family='Courier New, monospace', size=14,color='black')),)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

Observe the last two plots, how are they different from others?

### 2. What is the relationship between total working years, number of companies worked and attrition (Total Working Years , Number of companies and Attrition.)

#### Generate a new feature using Total Working Years and Number of companies worked

```hr_data['TotalWorkingYears_NumCompWorked'] = np.round(
hr_data.TotalWorkingYears / (hr_data.NumCompaniesWorked.astype(int)+1))
# adding 1 to avoid dividng by 0

```
```trace0 = go.Box(y= hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='Yes'],
name = 'Yes')
trace1 = go.Box(y = hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='No'],
name = 'No')
data =[trace0,trace1]
layout = go.Layout(width = 900,
height = 600,
title = 'Ratio of Age and Number of Companies worked vs Attrition',
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace',
size=10,
color='black'
)
),
yaxis=dict(
# range=[0,100],
<code>title='(TotalWorkingYears/NumCompWorked)', </code>
<code>titlefont=dict( size=14, color='black' ), </code>
<code>tickfont=dict( family='Courier New, monospace', </code>
<code>size=14,</code>
<code>color='black' # color of the font )</code>
),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

## 3. Do marital status and distance from home affect attrition? (Marital Status, Distance From Home and Attrition)

```data = []
for i in np.sort(hr_data.MaritalStatus.unique()):
data.append(go.Box(y = hr_data.DistanceFromHome[
hr_data.MaritalStatus==i][hr_data.Attrition=='Yes'],
marker = dict(color = '#CC0E1D', # red),
),
name = "{}- Yes".format(str(i)))
)
<code>data.append(go.Box(y = hr_data.DistanceFromHome[</code>
<code>hr_data.MaritalStatus==i][hr_data.Attrition=='No'], </code>
<code>marker = dict(color = '#588061', # green), ), </code>
<code>name = "{}- No".format(str(i))) )</code>
layout = go.Layout(
autosize=False,
width=1000, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('DistanceFromHome','MaritalStatus'),
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=10, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
# range=[0,100],
<code>title='Distance travelled', titlefont=dict( size=14, color='black' ), </code>
<code>tickfont=dict( family='Courier New, monospace', </code>
<code>size=14,  color='black' # color of the font ) ), )</code>
fig = go.Figure(data=data, layout=layout)
iplot(fig)
```

## >3 variables, 3D plots.

```n = 1500
Extracting th x, y ,z values
temp = hr_data.iloc[0:n,]
temp.shape
```
```trace1 = go.Scatter3d(
x=temp.PercentSalaryHike[temp.Attrition=='Yes'],
y=temp.YearsAtCompany[temp.Attrition=='Yes'],
z=temp.DistanceFromHome[temp.Attrition=='Yes'],
mode='markers',name ='Yes',
marker=dict(
size=temp.YearsInCurrentRole[temp.Attrition=='Yes']+2,
color='#CC0E1D', # ferarri red
# colorscale='Viridis', # choose a colorscale
opacity=1
)
)
trace2 = go.Scatter3d(
x=temp.PercentSalaryHike[temp.Attrition=='No'],
y=temp.YearsAtCompany[temp.Attrition=='No'],
z=temp.DistanceFromHome[temp.Attrition=='No'],
mode='markers',name ='No',
marker=dict(
size=temp.YearsInCurrentRole[temp.Attrition=='No']+2,
color='rgb(0,255,0)', #green
# colorscale='Viridis', # choose a colorscale
opacity=0.9,
)
)
data = [trace1,trace2]
layout = go.Layout(
scene = dict(
xaxis = dict(
title='PercentSalaryHike',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
),
yaxis = dict(
title='YearsAtCompany',
showbackground=True,
backgroundcolor="black",
titlefont=dict(
size=16,
color='black'
)
),
zaxis = dict(
title='DistanceFromHome',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
)
),
width=1000, # height of the figure in pixels
height=800, # height of the figure in pixels
)
fig = go.Figure(data=data, layout=layout)
fig['layout'].update(title= "PercentSalaryHike,
YearsAtCompany, DistanceFromHome, YearsInCurrentRole and Attrition")
iplot(fig, filename='3d-scatter-colorscale')
```

## Scree Plot

```x=list(range(2,10))
y=sse
data = [go.Scatter(x=x, # number of clusters
y=y, # sum of squared errors
text = [str(i) for i in (zip(x,y))], # text to display on hover
textposition = 'top center',
line = dict(color = ('rgb(205, 12, 24)')) # line color
)]
layout = go.Layout(title ='Scree plot (Sum of Squared errors)')
fig = go.Figure(data=data,layout=layout)
iplot(fig)
```
```trace0 = go.Scatter3d(
x=hr_data.Age[hr_data.Attrition=='Yes'],
y=hr_data.MonthlyIncome[hr_data.Attrition=='Yes'],
z=hr_data.DistanceFromHome[hr_data.Attrition=='Yes'],
mode='markers',name ='Yes',
marker=dict(
size=4,
color=hr_data.colors_clusters[hr_data.Attrition=='Yes'],
# colorscale='Viridis', # choose a colorscale
opacity=1
)
)
trace1 = go.Scatter3d(
x=hr_data.Age[hr_data.Attrition=='No'],
y=hr_data.MonthlyIncome[hr_data.Attrition=='No'],
z=hr_data.DistanceFromHome[hr_data.Attrition=='No'],
mode='markers',name ='No',
marker=dict(
size=4,
color=hr_data.colors_clusters[hr_data.Attrition=='No'],
# colorscale='Viridis', # choose a colorscale
opacity=0.75
)
)
data = [trace0,trace1]
layout = go.Layout(
scene = dict(
xaxis = dict(
title='Age',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
<code>), </code>
<code>yaxis = dict( title='MonthlyIncome', showbackground=True, backgroundcolor="black", </code>
<code>titlefont=dict( size=16, color='black' ) ), zaxis = dict( title='DistanceFromHome', </code>
<code>backgroundcolor="black", showbackground=True, </code>
<code>titlefont=dict( size=16, color='black' ) ) ), </code>
<code>width=1000, # height of the figure in pixels </code>
<code>height=800, </code>
<code>margin = dict( b =15),)</code>
fig = go.Figure(data=data, layout=layout)
fig['layout'].update(
title= "Understanding attrition by using the clusters.")
iplot(fig)
```

One of the metric to find out if you have chosen the correct number of clusters is to see if you can give a name to all your clusters in terms of business.

This is all for now. I have also created a report on Employee Attrition Rate Analysis. you may like to check it as well. Please read it using the below link.

Report on Employee Attrition Rate Analysis

Thank you for reading. Your comments, thoughts on this post are most welcome.

1. Dettifoss IT Solutions says:

I have been searching for this kind of content where I can gain some recent updates with a clear examples.

Like

1. Ashutosh Tripathi says:

Hope you find it useful.😊

Like

2. Fredrica says:

Thank you very much. This post is detailed and understandable and will sure be of great help to Data scientist when carrying out Data Visualization.

Liked by 1 person

1. Ashutosh Tripathi says:

Good to hear. And I am happy you liked the post. Keep visiting us.

Liked by 1 person

3. 360DigiTMGIN says:

Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
best training institutions

Liked by 1 person

1. Anonymous says:

Thank you.

Like

This site uses Akismet to reduce spam. Learn how your comment data is processed.