Data Visualization using plotly, matplotlib, seaborn and squarify

Data Visualization is one of the important activity we perform when doing Exploratory Data Analysis. It helps in preparing business reports, visual dashboards, story telling etc important tasks. In this post I have explained how to ask questions from the data and in return get the self explanatory graphs. In this You will learn the use of various python libraries like plotly, matplotlib, seaborn, squarify etc to plot those graphs.

Key takeaways from this post are:

  • Asking questions from data set
  • Univariate Analysis
  • Bivariate Analysis
  • Analysis of more than 3 variables
  • 3D Visualization
  • Case Study on employee Attrition Rate using HR Data Set
import warnings
warnings.filterwarnings('ignore')
!pip install plotly
!pip install squarify
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import plotly
import plotly.offline as pyoff
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
import squarify # for tree maps
%matplotlib inline

plotly

  • Modern Visualization for the data Era

Line Chart in plotly

  • 2 numeric variables with 1-1 mapping, i.e in situations where we have 1 y value corresponding to 1 x value
x=[1, 2, 3]
y=[3, 1, 6]
iplot([go.Scatter(x=x,
y=y,
text = [str(i) for i in (zip(x,y))],
textposition = 'top center')])
You can export images to html file only with offline mode
from plotly.offline import plot
plot([go.Scatter(x=x,
y=y,
text = [str(i) for i in (zip(x,y))],
textposition = 'top center')],output_type='file' ,filename='temp-histogram.jpeg',image='jpeg',auto_open=False)
output -> 'temp-histogram.jpeg.html'

Note that this is a bare chart with no information, Later in the activity we will add title, x labels and y labels.

Basic Bar chart in plotly

  • 1 Categorical variable
data = [go.Bar(
x=['x', 'y', 'z'],
y=[10, 20, 15])]
iplot(data)

Histogram in plotly

  • 1 numeric variable
n = 1000
x = np.random.randn(n)
data = [go.Histogram(x=x,
marker=dict(
color='#CC0E1D',# Lava (#CC0E1D)
color = 'rgb(200,0,0)' # you can provide color in HEX format or rgb format, genrally programmers prefer HEX format as it is a single string value and easy to pass as a variable
<code>))]</code>
layout = go.Layout(title = "Histogram of {} random numbers".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)

Boxplot in plotly

  • 1 Numeric variable
from IPython.display import Image
Image("img/boxplot.png")
np.random.seed(0) # Set seed for reproducibility
n = 10
r1 = np.random.randn(n)
r2 = np.random.randn(n)
trace0 = go.Box(
y=r1,
name = 'Box1',
marker = dict(
color = '#AA0505',
)
)
trace1 = go.Box(
y=r2,
name = 'Box2',
marker = dict(
color = '#B97D10',
)
)
data = [trace0, trace1]
layout = go.Layout(title = "Boxplot of 2 sets of random numbers")
fig = go.Figure(data= data, layout=layout)
iplot(fig)

Pie chart in plotly

  • 1 Categorical variable
labels = ["Pre processing and Visualization", "Model Building", "Misc"]
values = [80,10,10]
trace = go.Pie(labels=labels, values=values)
layout = go.Layout(title = 'Percentage of time spent on Data Science projects')
data = [trace]
fig = go.Figure(data= data,layout=layout)
iplot(fig)
Note: We do not suggest you use pie chart, one reason being the total is not always obvious and second, having many levels will make the chart cluttered.

Scatter plot in plotly

  • 2 numeric variables
  • One x might have multiple corresponding y values
np.random.seed(0)
n = 20
x=np.random.randint(0,100,n)
y=np.random.randint(0,100,n)
data = [go.Scatter(x=x,y=y,text = [str(i) for i in (zip(x,y))],textposition = 'top center', marker = dict(color = 'rgba(17, 157, 255, 0.8)', size = 10), mode = 'markers')]
layout = go.Layout(title = 'Scatter plot')
fig = go.Figure(data= data,layout=layout)
iplot(fig)

Tree map

https://plot.ly/python/treemaps/

squarify.plot(sizes=[13,22,35,5], label=["group A", "group B", "group C", "group D"], alpha=.7 )
plt.show()

Heatmap

np.random.rand(2, 2)
# trace = go.Heatmap(z=[[1, 20], [22, 1]], x=['Monday', 'Tuesday'],y=['Morning', 'Afternoon'])
# data=[trace]
# iplot(data)
sns.heatmap(np.random.rand(2, 2))

Case Study

Now let us use our new found skill to extract insights from a dataset

hr_data Description

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

hr_data = pd.read_csv("HR_Attrition.csv")

Pre-processing

hr_data.head()

Checking the datatypes

Checking the number of unique values in each column

for i in hr_data.columns:
    print ("Number of unique values in {} column are {} \n The unique values are {}".format(i, len(hr_data[i].unique()),hr_data[i].unique()))
    print ("---------------------- \n")

Observations:

- Most columns have fewer than 4 unique levels
- NumCompaniesWorked and PercentSalaryHike have less than 15 values and we can convert these in to categorical values for analysis purposes, this is fairly subjective. You can also continue with these as integer values.

Replacing the integers with above values with the values in the description
  • hr_data.Education = hr_data.Education.replace(to_replace=[1,2,3,4,5],value=[‘Below College’, ‘College’, ‘Bachelor’, ‘Master’, ‘Doctor’])
  • hr_data.EnvironmentSatisfaction = hr_data.EnvironmentSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
  • hr_data.JobInvolvement = hr_data.JobInvolvement.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
  • hr_data.JobSatisfaction = hr_data.JobSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
  • hr_data.PerformanceRating = hr_data.PerformanceRating.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Good’, ‘Excellent’, ‘Outstanding’])
  • hr_data.RelationshipSatisfaction = hr_data.RelationshipSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
  • hr_data.WorkLifeBalance = hr_data.WorkLifeBalance.replace(to_replace=[1,2,3,4],value=[‘Bad’, ‘Good’, ‘Better’, ‘Best’])
Education_dict = {1:'Below College',
2:'College',
3:'Bachelor',
4:'Master',
5:'Doctor',
}
EnvironmentSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
JobInvolvement_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
JobSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
PerformanceRating_dict = {1:'Low',
2:'Good',
3:'Excellent',
4:'Outstanding',
}
RelationshipSatisfaction_dict = {1:'Low',
2:'Medium',
3:'High',
4:'Very High',
}
WorkLifeBalance_dict = {1:'Bad',
2:'Good',
3:'Better',
4:'Best',
}
hr_data = hr_data.replace({"Education":Education_dict,
"EnvironmentSatisfaction":EnvironmentSatisfaction_dict,
"JobInvolvement":JobInvolvement_dict,
"JobSatisfaction":JobSatisfaction_dict,
<code>"PerformanceRating":PerformanceRating_dict, "RelationshipSatisfaction":RelationshipSatisfaction_dict, "WorkLifeBalance":WorkLifeBalance_dict, </code>
<code>})</code>
Extract categorical columns

Columns with 15 or less levels are considered as categorical columns for the purpose of this analysis

We have decided to treat all the columns with 15 or less levels as categorical columns, the following few lines of code extract all the columns which satisfy the condition.

cat_cols = []
for i in hr_data.columns:
if hr_data[i].dtype =='object' or len(np.unique(hr_data[i]))<=15 : # if the number of levels is less that 15 considering the column as categorial
cat_cols.append(i)
print("{} : {} : {} ".format(i,len(np.unique(hr_data[i])),np.unique(hr_data[i])))

Print the categorical column names

Check if the above columns are categorical in the data set

Type Conversion
  • n dimensional type conversion to ‘category’ is not implemented yet
for i in cat_cols:
hr_data[i] = hr_data[i].astype('category')

Categorical attributes summary

Extracting Numeric Columns

Exploratory Data Analysis

Univariate Analysis

1. What is the attrition rate in the company?

Attrition in numbers (pandas)
hr_data.Attrition.value_counts()
plt.figure()
hr_data.Attrition.value_counts().plot(kind='bar', figsize=(6,3), color="blue", alpha = 0.7, fontsize=13)
plt.title('Attrition rate (in numbers)')
plt.grid()
plt.show()

This is one way to tell matplotlib to plot the graphs in the notebook

Attrition rate in percentage (pandas)

((hr_data.Attrition.value_counts()/sum(hr_data.Attrition.value_counts()))*100).plot(kind='bar', figsize=(6,3), color=["blue"], alpha = 0.7, fontsize=16)
plt.ylim([0,100])
plt.title('Attrition Rate (in percentage)')
plt.ylabel('Percentage Attrition',fontsize = 14)
plt.grid(True)
plt.show()

plotly In percentages

temp = hr_data.Attrition.value_counts()
trace = go.Bar(x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2), 
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
name = 'Attrition')
data = [trace]
layout = go.Layout(autosize=False, width=600, height=400,title = 
"Attrition Distribution"
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp

2. What is the Gender Distribution in the company?

temp = hr_data.Gender.value_counts()
temp
data = [go.Bar(
x=temp.index,
y= np.round(temp.astype(float)/temp.values.sum(),2),
text = np.round(temp.astype(float)/temp.values.sum(),2),
textposition = 'auto',
)]
layout = go.Layout(
autosize=False,
width=600,
height=400,title = "Gender Distribution",
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp
temp = hr_data.Gender.value_counts()
temp
Steps to create a bar chart with counts for a categorical variable in plotly
  • Steps to create a bar chart with counts for a categorical variable
    • create an object and store the counts (optional)
    • create a bar object
      • pass the x values
      • pass the y values
      • optional :
        • text to be displayed
        • text position
        • color of the bar
        • name of the bar (trace in plotly terminology)
    • create a layout object
      • title – font and size of title
      • x axis – font and size of xaxis text
      • y axis – font and size of yaxis text
    • create a figure object:
      • add data
      • add layout
    • plot the figure object
# create a table with value counts
temp = hr_data.Gender.value_counts()
# creating a Bar chart object of plotly
data = [go.Bar(
x=temp.index.astype(str), # x axis values
y=np.round(temp.values.astype(float)/temp.values.sum(),4)<em>100, # y axis values text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)</em>100],
# text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
textposition = 'auto', # specify at which position on the bar the text should appear
marker = dict(color = '#0047AB'),)] # change color of the bar
# color used here Cobalt Blue
# these are used to define the layout options
layout = go.Layout(
autosize=False, # auto size the graph? use False if you are specifying the height and width
width=800, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Distribution of {} column".format('Gender'), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=16, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=16, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
title='Percentage',
titlefont=dict(
size=16,
color='black'
),
tickfont=dict(
family='Courier New, monospace', # font family
size=16, # size of ticks displayed on the y axis
color='black' # color of the font
)
),
font = dict(
family='Courier New, monospace', # font family
color = "white",# color of the font
size = 12 # size of the font displayed on the bar
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
del temp

We will save the above layout in an object and define a function for future use

def generate_layout_bar(col_name):
    layout_bar = go.Layout(
    autosize=False, # auto size the graph? use False if you are specifying the    height and width
    width=800, # height of the figure in pixels
    height=600, # height of the figure in pixels
    title = "Distribution of {} column".format(col_name), # title of the figure
# more granular control on the title font
    titlefont=dict(
    family='Courier New, monospace', # font family
    size=14, # size of the font
    color='black' # color of the font
   ),
# granular control on the axes objects
    xaxis=dict(
    tickfont=dict(
    family='Courier New, monospace', # font family
    size=14, # size of ticks displayed on the x axis
    color='black' # color of the font
       )
    ),
    yaxis=dict(
# range=[0,100],
    <code>title='Percentage', titlefont=dict( size=14, color='black' ), tickfont=dict(family='Courier New, monospace', # font family size=14, # size of ticks displayed on the y axis color='black' # color of the font ) ), font = dict( family='Courier New, monospace', # font family color = "white",# color of the font size = 12 # size of the font displayed on the bar ) ) </code>
    <code>return layout_bar</code> 

Defining a function to plot the bar charts

def plot_bar(col_name):
# create a table with value counts
temp = hr_data[col_name].value_counts()
# creating a Bar chart object of plotly
data = [go.Bar(
x=temp.index.astype(str), # x axis values
y=np.round(temp.values.astype(float)/temp.values.sum(),4)<em>100, # y axis values text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)</em>100],
# text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
textposition = 'auto', # specify at which position on the bar the text should appear
marker = dict(color = '#0047AB'),)] # change color of the bar
# color used here Cobalt Blue
<code>layout_bar = generate_layout_bar(col_name=col_name) fig = go.Figure(data=data, layout=layout_bar) return iplot(fig)</code>

3.How many people travel? (Business Travel)

4. Which department has the highest number of employees? (Department)

plot_bar('Department')

5. What is the most common educational background of the employees (Education Field)

plot_bar('EducationField')

6. In what roles are the employees working and what is the common job role? (Job Role)

plot_bar('JobRole')

7. Is the workforce in the company young? (Age)

plot_bar('Age')

Age is a continuous variable, it makes more sense to plot a histogram rather than a bar chart

Histogram

data = [go.Histogram(x=hr_data.Age,
marker=dict(
color='#CC0E1D',# Lava (#CC0E1D)
# color = 'rgb(200,0,0)' `
<code>))]</code>
layout = go.Layout(title = "Histogram of Age".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig)

8. What is the income distribution in the company?(Monthly Income)

data = [go.Histogram(x=hr_data.MonthlyIncome,
marker=dict(
color='#CC0E1D',# Lava (#CC0E1D)
# color = 'rgb(200,0,0)'
<code>))]</code>
layout = go.Layout(title = "Histogram of Income".format(n))
fig = go.Figure(data= data, layout=layout)
iplot(fig

Observations:

- We see that the income column has a long tailed distribution
- Binning might give better insights into the distribution

Let us bin the Income column

hr_data['Income_Bins'] = np.digitize(hr_data.MonthlyIncome,list(range(0,hr_data.MonthlyIncome.max()+10,2500)),right=True)

list(range(0,hr_data.MonthlyIncome.max()+10,2500))

hr_data['Income_Bins'].value_counts()
hr_data['Income_Bins'] = hr_data['Income_Bins'].replace(to_replace=[1,2,3,4,5,6,7,8],
value=['Bin1','Bin2','Bin3',
'Bin4','Bin5','Bin6','Bin7','Bin8'])
temp = hr_data['Income_Bins'].value_counts()
temp=temp.sort_index()
trace1 = go.Bar(x = temp.index,
y = (temp.values.astype(float)/sum(temp.values))<em>100, text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)</em>100],
# text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
textposition = 'auto',
name = 'Income_Bins')
data = [trace1]
# these are used to define the layout options
layout = generate_layout_bar('Income_Bins')
fig = go.Figure(data=data, layout=layout)
iplot(fig)
print(list(range(0,hr_data.MonthlyIncome.max()+10,2500)))

Bivariate Analysis

1. Is a particular gender travelling more distance than other?(Gender and Distance from home)

trace1 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Male'],name = 'Male',
boxpoints = 'all',jitter = 1
)
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Female'],name= 'Female',
boxpoints = 'all',jitter = 1
)
data = [trace1,trace2]
layout = go.Layout(width = 1000,
height = 500,title = 'Distance from home and Gender')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

Distance Bins and Gender

hr_data['Distance_Bins']=(np.digitize(hr_data.DistanceFromHome,[0,5,15,np.max(hr_data.DistanceFromHome)],right=True))

temp = hr_data.groupby(['Distance_Bins','Gender']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['Distance_Bins','Gender','Count']
temp
trace1 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Male'],
y = temp.Count[temp.Gender=='Male'],
text = temp.Count[temp.Gender=='Male'],
textposition = 'auto',
name = 'Male')
trace2 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Female'],
y = temp.Count[temp.Gender=='Female'],
text = temp.Count[temp.Gender=='Female'],
textposition = 'auto',
name = 'Female')
data = [trace1,trace2]
layout = go.Layout(width = 700,
height = 500,title = 'Gender and Distance bins',
yaxis = dict(title='Count'))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Observations:

- Irrespective of the distance bin, there is a global pattern i.e every bin has more male employees, this has to do with the actual distribution of gender in the data.

2. Are employees working overtime getting better ratings? (Over Time and Performance Rating.)

temp = hr_data.groupby(['OverTime','PerformanceRating']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['OverTime','PerformanceRating','Count']
temp
hr_data.OverTime.value_counts()
trace1 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Excellent'],
y = temp.Count[temp.PerformanceRating=='Excellent']/temp.Count.sum(),
name = 'Excellent')
trace2 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Outstanding'],
y = temp.Count[temp.PerformanceRating=='Outstanding']/temp.Count.sum(),
name = 'Outstanding')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 600,title = 'OverTime and PerformanceRating')
fig = go.Figure(data=data, layout=layout)
iplot(fig)
fig

All the percentages add up to one, so we can compare the numbers globally

3. Does working longer with a manager have any relationship with Job satisfaction? (Years With Current Manager, Job Satisfaction)

yearscurrman_jobsat = hr_data.groupby(['YearsWithCurrManager','JobSatisfaction']).size().to_frame()
yearscurrman_jobsat = yearscurrman_jobsat.reset_index()
yearscurrman_jobsat.columns = ['YearsWithCurrManager','JobSatisfaction','Counts']

np.random.seed(0)
yearscurrman_jobsat.sample(frac =0.1)
tracelow = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[yearscurrman_jobsat.JobSatisfaction=='Low'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Low'],
text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Low'],
textposition = 'auto',
name = 'Low')
tracemedium = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[yearscurrman_jobsat.JobSatisfaction=='Medium'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Medium'],
text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Medium'],
textposition = 'auto',
name = 'Medium')
traceHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[yearscurrman_jobsat.JobSatisfaction=='High'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='High'],
text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='High'],
textposition = 'auto',
name = 'High')
traceVHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[yearscurrman_jobsat.JobSatisfaction=='Very High'],
y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'],
text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'],
textposition = 'auto',
name = 'Very High')
data = [tracelow, tracemedium, traceHigh, traceVHigh]
layout = go.Layout(width = 1000,
barmode='stack',
<code>height = 600,title = 'YearsWithCurrManager and Job Satisfaction', xaxis = dict(title = 'YearsWithCurrManager'), yaxis = dict(title = 'Counts',range=[0, yearscurrman_jobsat.Counts.max()+10]))</code>
fig = go.Figure(data=data, layout=layout)
iplot(fig)

We observe that the red bars are higher than the green bars only after 2 years , we can infer that employees generally tend to be comfortable working with the manager after 2 years.

4. Are married employees staying far from the office? (Marital status and Distance from home)

hr_data.MaritalStatus.unique()
hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'].describe()
hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'].describe()
hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'].describe()
tracediv = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'], name = 'DistanceFromHome')
tracemarried = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'], name= 'Married')
tracesin = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'], name= 'Single')
data = [tracediv,tracemarried,tracesin]
layout = go.Layout(width = 800,
height = 500,title = 'Distance from home and and Marital Status')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

5. Is there any relationship between Attrition and Gender?

Gender_Attrition = hr_data.groupby(['Gender','Attrition']).size().to_frame()
Gender_Attrition = Gender_Attrition.reset_index()
Gender_Attrition.columns = ['Gender','Attrition','Count']
Gender_Attrition
trace1 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='Yes'],
y = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
textposition = 'auto',
name = 'Yes')
trace2 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='No'],
y = Gender_Attrition.Count[Gender_Attrition.Attrition=='No'],
text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'],
textposition = 'auto',
name = 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 600,title = 'Gender and Attrition')
fig = go.Figure(data=data, layout=layout)
iplot(fig)

6. Employees who spend more years in the company tend to leave. Verify if this is true.(Years at company and Attrition)

hr_data.YearsAtCompany[hr_data.Attrition=='Yes'].describe()
hr_data.YearsAtCompany[hr_data.Attrition=='No'].describe()
trace1 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='Yes'],name = 'Yes',
boxpoints = 'all',jitter = 1
<code>)</code>
# boxpoints is used to specify the points to plot
# jitter is used to specify how far from each should the points be
trace2 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='No'],name= 'No',
boxpoints = 'all',jitter = 1
<code>)</code>
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'YearsAtCompany and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

7 . Is a particular age group more prone to leaving the company? (Age and Attrition)

hr_data.Age[hr_data.Attrition=='Yes'].describe()
hr_data.Age[hr_data.Attrition=='No'].describe()
trace1 = go.Box(y = hr_data.Age[hr_data.Attrition=='Yes'],name = 'Yes')
trace2 = go.Box(y = hr_data.Age[hr_data.Attrition=='No'],name= 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'Age and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

You can also bin the age column and do the same.

8. Employees earning less tend to leave the company. Verify if this is true. (Monthly Income vs Attrition)

trace1 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='Yes'],name = 'Yes')
trace2 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='No'],name= 'No')
data = [trace1,trace2]
layout = go.Layout(width = 800,
height = 500,title = 'Income and Attrition')
fig = go.Figure(data=data,layout = layout)
iplot(fig)

9. How do Age and Monthly Income vary?

trace = go.Scatter(x=hr_data.Age ,
y= hr_data.MonthlyIncome,
name = 'Age and MonthlyIncome',
mode= 'markers')
data = [trace]
layout = go.Layout(title = ' Age and Monthly Income distribution',
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Monthly Income'))
fig = go.Figure(data=data,layout=layout)
iplot(fig)

10. Does Years With Curr Manager have to do anything with Years Since Last Promotion?

trace = go.Scatter(x=hr_data.YearsWithCurrManager ,
y= hr_data.YearsSinceLastPromotion,
name = 'YearsWithCurrManager and YearsSinceLastPromotion',
mode= 'markers')
data = [trace]
layout = go.Layout(title = ' YearsWithCurrManager and YearsSinceLastPromotion distribution',
xaxis = dict(title = 'YearsWithCurrManager'),
yaxis = dict(title = 'YearsSinceLastPromotion'))
fig = go.Figure(data=data,layout=layout)
iplot(fig)

3 variables

1. What is the relationship between number of companies worked , age and attrition.(Number of companies worked, Age, Attrition.)

data = []
for i in np.sort(hr_data.NumCompaniesWorked.unique()):
data.append(go.Box(y = hr_data.Age[hr_data.NumCompaniesWorked==i][hr_data.Attrition=='Yes'],
marker = dict(
color = '#CC0E1D',
),
name = "{}- Yes".format(str(i))))
data.append(go.Box(y = hr_data.Age[hr_data.NumCompaniesWorked==i][hr_data.Attrition=='No'],
marker = dict(
color = '#588061',
),
name = "{}- No".format(str(i))))
layout = go.Layout(
autosize=False, # auto size the graph? use False if you are specifying the height and width
width=1000, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Age','NumCompaniesWorked'), # title of the figure
# more granular control on the title font
<code>titlefont=dict( family='Courier New, monospace', # font family size=14, # size of the font color='black' # color of the font ), # granular control on the axes objects xaxis=dict( title='Number of companies worked and attrition', tickfont=dict( family='Courier New, monospace', # font family size=10, # size of ticks displayed on the x axis color='black' # color of the font )</code>
),
yaxis=dict(
# range=[0,100],
<code>title='Age', titlefont=dict( size=14, color='black' ), tickfont=dict( family='Courier New, monospace', # font family size=14, # size of ticks displayed on the y axis color='black' # color of the font )</code>
),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Observe the last two plots, how are they different from others?

Creating new features and plotting

2. What is the relationship between total working years, number of companies worked and attrition (Total Working Years , Number of companies and Attrition.)

Generate a new feature using Total Working Years and Number of companies worked

hr_data['TotalWorkingYears_NumCompWorked'] = np.round(hr_data.TotalWorkingYears / (hr_data.NumCompaniesWorked.astype(int)+1)) # adding 1 to avoid dividng by 0

hr_data.TotalWorkingYears_NumCompWorked.head()
trace0 = go.Box(y= hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='Yes'],name = 'Yes')
trace1 = go.Box(y = hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='No'],name = 'No')
data =[trace0,trace1]
layout = go.Layout(width = 900,
height = 600,
title = 'Ratio of Age and Number of Companies worked vs Attrition',
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace',
size=10,
color='black'
)
),
yaxis=dict(
# range=[0,100],
<code>title='(TotalWorkingYears/NumCompWorked)', titlefont=dict( size=14, color='black' ), tickfont=dict( family='Courier New, monospace', size=14,</code>
<code>color='black' # color of the font )</code>
),
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

3. Do marital status and distance from home affect attrition? (Marital Status, Distance From Home and Attrition)

data = []
for i in np.sort(hr_data.MaritalStatus.unique()):
data.append(go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus==i][hr_data.Attrition=='Yes'],
marker = dict(color = '#CC0E1D', # red),
),
name = "{}- Yes".format(str(i)))
)
<code>data.append(go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus==i][hr_data.Attrition=='No'], marker = dict(color = '#588061', # green), ), name = "{}- No".format(str(i))) )</code>
layout = go.Layout(
autosize=False,
width=1000, # height of the figure in pixels
height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('DistanceFromHome','MaritalStatus'), 
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
tickfont=dict(
family='Courier New, monospace', # font family
size=10, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
# range=[0,100],
<code>title='Distance travelled', titlefont=dict( size=14, color='black' ), tickfont=dict( family='Courier New, monospace', size=14,  color='black' # color of the font ) ), )</code>
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Extras

>3 variables, 3D plots.

n = 1500
Extracting th x, y ,z values
temp = hr_data.iloc[0:n,]
temp.shape
trace1 = go.Scatter3d(
x=temp.PercentSalaryHike[temp.Attrition=='Yes'],
y=temp.YearsAtCompany[temp.Attrition=='Yes'],
z=temp.DistanceFromHome[temp.Attrition=='Yes'],
mode='markers',name ='Yes',
marker=dict(
size=temp.YearsInCurrentRole[temp.Attrition=='Yes']+2,
color='#CC0E1D', # ferarri red
# colorscale='Viridis', # choose a colorscale
opacity=1
)
)
trace2 = go.Scatter3d(
x=temp.PercentSalaryHike[temp.Attrition=='No'],
y=temp.YearsAtCompany[temp.Attrition=='No'],
z=temp.DistanceFromHome[temp.Attrition=='No'],
mode='markers',name ='No',
marker=dict(
size=temp.YearsInCurrentRole[temp.Attrition=='No']+2,
color='rgb(0,255,0)', #green
# colorscale='Viridis', # choose a colorscale
opacity=0.9,
)
)
data = [trace1,trace2]
layout = go.Layout(
scene = dict(
xaxis = dict(
title='PercentSalaryHike',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
),
yaxis = dict(
title='YearsAtCompany',
showbackground=True,
backgroundcolor="black",
titlefont=dict(
size=16,
color='black'
)
),
zaxis = dict(
title='DistanceFromHome',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
)
),
width=1000, # height of the figure in pixels
height=800, # height of the figure in pixels
)
fig = go.Figure(data=data, layout=layout)
fig['layout'].update(title= "PercentSalaryHike, YearsAtCompany, DistanceFromHome, YearsInCurrentRole and Attrition")
iplot(fig, filename='3d-scatter-colorscale')

Scree Plot

x=list(range(2,10))
y=sse
data = [go.Scatter(x=x, # number of clusters
y=y, # sum of squared errors
text = [str(i) for i in (zip(x,y))], # text to display on hover
textposition = 'top center',
line = dict(color = ('rgb(205, 12, 24)')) # line color
)]
layout = go.Layout(title ='Scree plot (Sum of Squared errors)')
fig = go.Figure(data=data,layout=layout)
iplot(fig)
trace0 = go.Scatter3d(
x=hr_data.Age[hr_data.Attrition=='Yes'],
y=hr_data.MonthlyIncome[hr_data.Attrition=='Yes'],
z=hr_data.DistanceFromHome[hr_data.Attrition=='Yes'],
mode='markers',name ='Yes',
marker=dict(
size=4,
color=hr_data.colors_clusters[hr_data.Attrition=='Yes'],
# colorscale='Viridis', # choose a colorscale
opacity=1
)
)
trace1 = go.Scatter3d(
x=hr_data.Age[hr_data.Attrition=='No'],
y=hr_data.MonthlyIncome[hr_data.Attrition=='No'],
z=hr_data.DistanceFromHome[hr_data.Attrition=='No'],
mode='markers',name ='No',
marker=dict(
size=4,
color=hr_data.colors_clusters[hr_data.Attrition=='No'],
# colorscale='Viridis', # choose a colorscale
opacity=0.75
)
)
data = [trace0,trace1]
layout = go.Layout(
scene = dict(
xaxis = dict(
title='Age',
backgroundcolor="black",
showbackground=True,
titlefont=dict(
size=16,
color='black'
)
<code>), yaxis = dict( title='MonthlyIncome', showbackground=True, backgroundcolor="black", titlefont=dict( size=16, color='black' ) ), zaxis = dict( title='DistanceFromHome', backgroundcolor="black", showbackground=True, titlefont=dict( size=16, color='black' ) ) ), </code>
<code>width=1000, # height of the figure in pixels </code>
<code>height=800, # height of the figure in pixels margin = dict( b =15),)</code>
fig = go.Figure(data=data, layout=layout)
fig['layout'].update(title= "Understanding attrition by using the clusters.")
iplot(fig)

One of the metric to find out if you have chosen the correct number of clusters is to see if you can give a name to all your clusters in terms of business.

This is all for now. I have also created a report on Employee Attrition Rate Analysis. you may like to check it as well. Please read it using the below link.

Report on Employee Attrition Rate Analysis

Thank you for reading. Your comments, thoughts on this post are most welcome.

6 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.