Key takeaways from this post are:
- Asking questions from data set
- Univariate Analysis
- Bivariate Analysis
- Analysis of more than 3 variables
- 3D Visualization
- Case Study on employee Attrition Rate using HR Data Set
import warnings warnings.filterwarnings('ignore') !pip install plotly !pip install squarify
import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns import plotly import plotly.offline as pyoff import plotly.figure_factory as ff from plotly.offline import init_notebook_mode, iplot, plot import plotly.graph_objs as go import squarify # for tree maps %matplotlib inline
plotly
- Modern Visualization for the data Era
Line Chart in plotly
- 2 numeric variables with 1-1 mapping, i.e in situations where we have 1 y value corresponding to 1 x value
x=[1, 2, 3] y=[3, 1, 6] iplot([go.Scatter(x=x, y=y, text = [str(i) for i in (zip(x,y))], textposition = 'top center')])

You can export images to html file only with offline mode
from plotly.offline import plot plot([go.Scatter(x=x, y=y, text = [str(i) for i in (zip(x,y))], textposition = 'top center')], output_type='file' , filename='temp-histogram.jpeg',image='jpeg',auto_open=False)
output -> 'temp-histogram.jpeg.html'
Note that this is a bare chart with no information, Later in the activity we will add title, x labels and y labels.
Basic Bar chart in plotly
- 1 Categorical variable
data = [go.Bar( x=['x', 'y', 'z'], y=[10, 20, 15])] iplot(data)

Histogram in plotly
- 1 numeric variable
n = 1000 x = np.random.randn(n) data = [go.Histogram(x=x, marker=dict( color='#CC0E1D',# Lava (#CC0E1D) color = 'rgb(200,0,0)' # you can provide color in HEX format or rgb format, genrally programmers prefer HEX format as it is a single string value and easy to pass as a variable <code>))]</code> layout = go.Layout(title = "Histogram of {} random numbers".format(n)) fig = go.Figure(data= data, layout=layout) iplot(fig)

Boxplot in plotly
- 1 Numeric variable
from IPython.display import Image Image("img/boxplot.png")

np.random.seed(0) # Set seed for reproducibility n = 10 r1 = np.random.randn(n) r2 = np.random.randn(n) trace0 = go.Box( y=r1, name = 'Box1', marker = dict( color = '#AA0505', ) ) trace1 = go.Box( y=r2, name = 'Box2', marker = dict( color = '#B97D10', ) ) data = [trace0, trace1] layout = go.Layout(title = "Boxplot of 2 sets of random numbers") fig = go.Figure(data= data, layout=layout) iplot(fig)

Pie chart in plotly
- 1 Categorical variable
labels = ["Pre processing and Visualization", "Model Building", "Misc"] values = [80,10,10] trace = go.Pie(labels=labels, values=values) layout = go.Layout(title = 'Percentage of time spent on Data Science projects') data = [trace] fig = go.Figure(data= data,layout=layout) iplot(fig)

Note: We do not suggest you use pie chart, one reason being the total is not always obvious and second, having many levels will make the chart cluttered.
Scatter plot in plotly
- 2 numeric variables
- One x might have multiple corresponding y values
np.random.seed(0) n = 20 x=np.random.randint(0,100,n) y=np.random.randint(0,100,n) data = [go.Scatter(x=x,y=y, text = [str(i) for i in (zip(x,y))], textposition = 'top center', marker = dict(color = 'rgba(17, 157, 255, 0.8)', size = 10), mode = 'markers')] layout = go.Layout(title = 'Scatter plot') fig = go.Figure(data= data,layout=layout) iplot(fig)

Tree map
https://plot.ly/python/treemaps/
squarify.plot(sizes=[13,22,35,5], label=["group A", "group B", "group C", "group D"], alpha=.7 ) plt.show()

Heatmap
np.random.rand(2, 2)

# trace = go.Heatmap(z=[[1, 20], [22, 1]], x=['Monday', 'Tuesday'],y=['Morning', 'Afternoon']) # data=[trace] # iplot(data) sns.heatmap(np.random.rand(2, 2))

Case Study
Now let us use our new found skill to extract insights from a dataset
hr_data Description
Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’
RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
hr_data = pd.read_csv("HR_Attrition.csv")
Pre-processing
hr_data.head()

Checking the datatypes



Checking the number of unique values in each column
for i in hr_data.columns: print ("Number of unique values in {} column are {} \n The unique values are {}".format(i, len(hr_data[i].unique()),hr_data[i].unique())) print ("---------------------- \n")




Observations:
- Most columns have fewer than 4 unique levels
- NumCompaniesWorked and PercentSalaryHike have less than 15 values and we can convert these into categorical values for analysis purposes,
this is fairly subjective. You can also continue with these as integer values.
Replacing the integers with above values with the values in the description
- hr_data.Education = hr_data.Education.replace(to_replace=[1,2,3,4,5],value=[‘Below College’, ‘College’, ‘Bachelor’, ‘Master’, ‘Doctor’])
- hr_data.EnvironmentSatisfaction = hr_data.EnvironmentSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
- hr_data.JobInvolvement = hr_data.JobInvolvement.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
- hr_data.JobSatisfaction = hr_data.JobSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
- hr_data.PerformanceRating = hr_data.PerformanceRating.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Good’, ‘Excellent’, ‘Outstanding’])
- hr_data.RelationshipSatisfaction = hr_data.RelationshipSatisfaction.replace(to_replace=[1,2,3,4],value=[‘Low’, ‘Medium’, ‘High’, ‘Very High’])
- hr_data.WorkLifeBalance = hr_data.WorkLifeBalance.replace(to_replace=[1,2,3,4],value=[‘Bad’, ‘Good’, ‘Better’, ‘Best’])
Education_dict = {1:'Below College', 2:'College', 3:'Bachelor', 4:'Master', 5:'Doctor', } EnvironmentSatisfaction_dict = {1:'Low', 2:'Medium', 3:'High', 4:'Very High', } JobInvolvement_dict = {1:'Low', 2:'Medium', 3:'High', 4:'Very High', } JobSatisfaction_dict = {1:'Low', 2:'Medium', 3:'High', 4:'Very High', } PerformanceRating_dict = {1:'Low', 2:'Good', 3:'Excellent', 4:'Outstanding', } RelationshipSatisfaction_dict = {1:'Low', 2:'Medium', 3:'High', 4:'Very High', } WorkLifeBalance_dict = {1:'Bad', 2:'Good', 3:'Better', 4:'Best', }
hr_data = hr_data.replace({ "Education":Education_dict, "EnvironmentSatisfaction":EnvironmentSatisfaction_dict, "JobInvolvement":JobInvolvement_dict, "JobSatisfaction":JobSatisfaction_dict, "PerformanceRating":PerformanceRating_dict, "RelationshipSatisfaction":RelationshipSatisfaction_dict, "WorkLifeBalance":WorkLifeBalance_dict })
Extract categorical columns
Columns with 15 or less levels are considered as categorical columns for the purpose of this analysis
We have decided to treat all the columns with 15 or less levels as categorical columns, the following few lines of code extract all the columns which satisfy the condition.
cat_cols = [] for i in hr_data.columns: if hr_data[i].dtype =='object' or len(np.unique(hr_data[i]))<=15 : # if the number of levels is less that 15 considering the column as categorial cat_cols.append(i) print("{} : {} : {} ".format(i,len(np.unique(hr_data[i])),np.unique(hr_data[i])))

Print the categorical column names

Check if the above columns are categorical in the data set

Type Conversion
- n dimensional type conversion to ‘category’ is not implemented yet
for i in cat_cols: hr_data[i] = hr_data[i].astype('category')
Categorical attributes summary


Extracting Numeric Columns


Exploratory Data Analysis
Univariate Analysis
1. What is the attrition rate in the company?
Attrition in numbers (pandas)
hr_data.Attrition.value_counts()

plt.figure() hr_data.Attrition.value_counts().plot(kind='bar', figsize=(6,3), color="blue", alpha = 0.7, fontsize=13) plt.title('Attrition rate (in numbers)') plt.grid() plt.show()

This is one way to tell matplotlib to plot the graphs in the notebook
Attrition rate in percentage (pandas)
((hr_data.Attrition.value_counts()/sum(hr_data.Attrition.value_counts()))*100).plot( kind='bar', figsize=(6,3), color=["blue"], alpha = 0.7, fontsize=16) plt.ylim([0,100]) plt.title('Attrition Rate (in percentage)') plt.ylabel('Percentage Attrition',fontsize = 14) plt.grid(True) plt.show()

plotly In percentages
temp = hr_data.Attrition.value_counts() trace = go.Bar(x=temp.index, y= np.round(temp.astype(float)/temp.values.sum(),2), text = np.round(temp.astype(float)/temp.values.sum(),2), textposition = 'auto', name = 'Attrition') data = [trace] layout = go.Layout(autosize=False, width=600, height=400,title = "Attrition Distribution" ) fig = go.Figure(data=data, layout=layout) iplot(fig) del temp

2. What is the Gender Distribution in the company?
temp = hr_data.Gender.value_counts() temp

data = [go.Bar( x=temp.index, y= np.round(temp.astype(float)/temp.values.sum(),2), text = np.round(temp.astype(float)/temp.values.sum(),2), textposition = 'auto', )] layout = go.Layout( autosize=False, width=600, height=400,title = "Gender Distribution", ) fig = go.Figure(data=data, layout=layout) iplot(fig) del temp

temp = hr_data.Gender.value_counts() temp

Steps to create a bar chart with counts for a categorical variable in plotly
- Steps to create a bar chart with counts for a categorical variable
- create an object and store the counts (optional)
- create a bar object
- pass the x values
- pass the y values
- optional :
- text to be displayed
- text position
- color of the bar
- name of the bar (trace in plotly terminology)
- create a layout object
- title – font and size of title
- x axis – font and size of xaxis text
- y axis – font and size of yaxis text
- create a figure object:
- add data
- add layout
- plot the figure object
# create a table with value counts temp = hr_data.Gender.value_counts() # creating a Bar chart object of plotly data = [go.Bar( x=temp.index.astype(str), # x axis values y=np.round(temp.values.astype(float)/temp.values.sum(),4)100, text=['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)100], textposition = 'auto', # specify at which position on the bar the text should appear marker = dict(color = '#0047AB'),)] # change color of the bar # color used here Cobalt Blue # these are used to define the layout options layout = go.Layout( autosize=False, # auto size the graph? use False if you are specifying the height and width width=800, # height of the figure in pixels height=600, # height of the figure in pixels title = "Distribution of {} column".format('Gender'), # title of the figure # more granular control on the title font titlefont=dict( family='Courier New, monospace', # font family size=16, # size of the font color='black' # color of the font ), # granular control on the axes objects xaxis=dict( tickfont=dict( family='Courier New, monospace', # font family size=16, # size of ticks displayed on the x axis color='black' # color of the font ) ), yaxis=dict( title='Percentage', titlefont=dict( size=16, color='black' ), tickfont=dict( family='Courier New, monospace', # font family size=16, # size of ticks displayed on the y axis color='black' # color of the font ) ), font = dict( family='Courier New, monospace', # font family color = "white",# color of the font size = 12 # size of the font displayed on the bar ) ) fig = go.Figure(data=data, layout=layout) iplot(fig) del temp

We will save the above layout in an object and define a function for future use
def generate_layout_bar(col_name): layout_bar = go.Layout( autosize=False, # auto size the graph? use False if you are specifying the height and width width=800, # height of the figure in pixels height=600, # height of the figure in pixels title = "Distribution of {} column".format(col_name), # title of the figure # more granular control on the title font titlefont=dict( family='Courier New, monospace', # font family size=14, # size of the font color='black' # color of the font ), # granular control on the axes objects xaxis=dict( tickfont=dict( family='Courier New, monospace', # font family size=14, # size of ticks displayed on the x axis color='black' # color of the font ) ), yaxis=dict(title='Percentage',titlefont=dict(size=14, color='black'), tickfont=dict(family='Courier New, monospace', size = 14, color='black')), font=dict(family='Courier New, monospace', color = "white", size = 12)) return layout_bar
Defining a function to plot the bar charts
def plot_bar(col_name): # create a table with value counts temp = hr_data[col_name].value_counts() # creating a Bar chart object of plotly data = [go.Bar( x=temp.index.astype(str), y=np.round(temp.values.astype(float)/temp.values.sum(),4)100, text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)100], textposition = 'auto', # specify at which position on the bar the text should appear marker = dict(color = '#0047AB'),)] layout_bar = generate_layout_bar(col-name=col_name) fig = go.Figure(data = data, layout=layout_bar) return iplot(fig)
3.How many people travel? (Business Travel)

4. Which department has the highest number of employees? (Department)
plot_bar('Department')

5. What is the most common educational background of the employees (Education Field)
plot_bar('EducationField')

6. In what roles are the employees working and what is the common job role? (Job Role)
plot_bar('JobRole')

7. Is the workforce in the company young? (Age)
plot_bar('Age')

Age is a continuous variable, it makes more sense to plot a histogram rather than a bar chart
Histogram
data = [go.Histogram(x=hr_data.Age,marker=dict(color='#CC0E1D'<code>))]</code> layout = go.Layout(title = "Histogram of Age".format(n)) fig = go.Figure(data= data, layout=layout) iplot(fig)

8. What is the income distribution in the company?(Monthly Income)
data = [go.Histogram(x=hr_data.MonthlyIncome, marker=dict( color='#CC0E1D'))] layout = go.Layout(title = "Histogram of Income".format(n)) fig = go.Figure(data= data, layout=layout) iplot(fig

Observations:
- We see that the income column has a long tailed distribution
- Binning might give better insights into the distribution
Let us bin the Income column
hr_data['Income_Bins'] = np.digitize(hr_data.MonthlyIncome, list(range(0,hr_data.MonthlyIncome.max()+10,2500)),right=True) list(range(0,hr_data.MonthlyIncome.max()+10,2500)) hr_data['Income_Bins'].value_counts()

hr_data['Income_Bins'] = hr_data['Income_Bins'].replace(to_replace=[1,2,3,4,5,6,7,8], value=['Bin1','Bin2','Bin3', 'Bin4','Bin5','Bin6','Bin7','Bin8'])
temp = hr_data['Income_Bins'].value_counts() temp=temp.sort_index()
trace1 = go.Bar(x = temp.index, y=(temp.values.astype(float)/sum(temp.values))100, text=['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)100], textposition = 'auto', name = 'Income_Bins') data = [trace1] # these are used to define the layout options layout = generate_layout_bar('Income_Bins') fig = go.Figure(data=data, layout=layout) iplot(fig) print(list(range(0,hr_data.MonthlyIncome.max()+10,2500)))

Bivariate Analysis
1. Is a particular gender travelling more distance than other?(Gender and Distance from home)
trace1 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Male'],name = 'Male', boxpoints = 'all',jitter = 1 ) # boxpoints is used to specify the points to plot # jitter is used to specify how far from each should the points be trace2 = go.Box(y = hr_data.DistanceFromHome[hr_data.Gender=='Female'],name= 'Female', boxpoints = 'all',jitter = 1 ) data = [trace1,trace2] layout = go.Layout(width = 1000, height = 500,title = 'Distance from home and Gender') fig = go.Figure(data=data,layout = layout) iplot(fig)

Distance Bins and Gender
hr_data['Distance_Bins']=(np.digitize( hr_data.DistanceFromHome,[0,5,15,np.max(hr_data.DistanceFromHome)],right=True)) temp = hr_data.groupby(['Distance_Bins','Gender']).size().to_frame() temp = temp.reset_index() temp.columns = ['Distance_Bins','Gender','Count'] temp

trace1 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Male'], y = temp.Count[temp.Gender=='Male'], text = temp.Count[temp.Gender=='Male'], textposition = 'auto', name = 'Male') trace2 = go.Bar(x = temp.Distance_Bins[temp.Gender=='Female'], y = temp.Count[temp.Gender=='Female'], text = temp.Count[temp.Gender=='Female'], textposition = 'auto', name = 'Female') data = [trace1,trace2] layout = go.Layout(width = 700, height = 500,title = 'Gender and Distance bins', yaxis = dict(title='Count')) fig = go.Figure(data=data, layout=layout) iplot(fig)

Observations:
- Irrespective of the distance bin, there is a global pattern i.e every bin has more male employees
2. Are employees working overtime getting better ratings? (Over Time and Performance Rating.)
temp = hr_data.groupby(['OverTime','PerformanceRating']).size().to_frame() temp = temp.reset_index() temp.columns = ['OverTime','PerformanceRating','Count'] temp

hr_data.OverTime.value_counts()

trace1 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Excellent'], y = temp.Count[temp.PerformanceRating=='Excellent']/temp.Count.sum(), name = 'Excellent') trace2 = go.Bar(x = temp.OverTime[temp.PerformanceRating=='Outstanding'], y = temp.Count[temp.PerformanceRating=='Outstanding']/temp.Count.sum(), name = 'Outstanding') data = [trace1,trace2] layout = go.Layout(width = 800, height = 600,title = 'OverTime and PerformanceRating') fig = go.Figure(data=data, layout=layout) iplot(fig) fig

All the percentages add up to one, so we can compare the numbers globally
3. Does working longer with a manager have any relationship with Job satisfaction? (Years With Current Manager, Job Satisfaction)
yearscurrman_jobsat = hr_data.groupby(['YearsWithCurrManager','JobSatisfaction']).size().to_frame() yearscurrman_jobsat = yearscurrman_jobsat.reset_index() yearscurrman_jobsat.columns = ['YearsWithCurrManager','JobSatisfaction','Counts'] np.random.seed(0) yearscurrman_jobsat.sample(frac =0.1)

tracelow = go.Bar( x=yearscurrman_jobsat.YearsWithCurrManager[ yearscurrman_jobsat.JobSatisfaction=='Low'], y = yearscurrman_jobsat.Counts[ yearscurrman_jobsat.JobSatisfaction=='Low'], text = yearscurrman_jobsat.Counts[ yearscurrman_jobsat.JobSatisfaction=='Low'], textposition = 'auto', name = 'Low') tracemedium = go.Bar( x = yearscurrman_jobsat.YearsWithCurrManager[ yearscurrman_jobsat.JobSatisfaction=='Medium'], y = yearscurrman_jobsat.Counts[ yearscurrman_jobsat.JobSatisfaction=='Medium'], text = yearscurrman_jobsat.Counts[ yearscurrman_jobsat.JobSatisfaction=='Medium'], textposition = 'auto', name = 'Medium') traceHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[ yearscurrman_jobsat.JobSatisfaction=='High'], y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='High'], text = yearscurrman_jobsat.Counts[ yearscurrman_jobsat.JobSatisfaction=='High'], textposition = 'auto', name = 'High') traceVHigh = go.Bar(x = yearscurrman_jobsat.YearsWithCurrManager[ yearscurrman_jobsat.JobSatisfaction=='Very High'], y = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'], text = yearscurrman_jobsat.Counts[yearscurrman_jobsat.JobSatisfaction=='Very High'], textposition = 'auto', name = 'Very High') data = [tracelow, tracemedium, traceHigh, traceVHigh] layout = go.Layout(width = 1000, barmode='stack',height=600,title='YearsWithCurrManager and Job Satisfaction', xaxis = dict(title='YearsWithCurrManager'),yaxis=dict(title='Counts', range=[0,yearscurrman_jobsat.Counts.max()+10])) fig = go.Figure(data=data, layout=layout) iplot(fig)

We observe that the red bars are higher than the green bars only after 2 years , we can infer that employees generally tend to be comfortable working with the manager after 2 years.
4. Are married employees staying far from the office? (Marital status and Distance from home)
hr_data.MaritalStatus.unique()

hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'].describe()

hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'].describe()

hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'].describe()

tracediv = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Divorced'], name = 'DistanceFromHome') tracemarried = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Married'], name= 'Married') tracesin = go.Box(y = hr_data.DistanceFromHome[hr_data.MaritalStatus=='Single'], name= 'Single') data = [tracediv,tracemarried,tracesin] layout = go.Layout(width = 800, height = 500,title = 'Distance from home and and Marital Status') fig = go.Figure(data=data,layout = layout) iplot(fig)

5. Is there any relationship between Attrition and Gender?
Gender_Attrition = hr_data.groupby(['Gender','Attrition']).size().to_frame() Gender_Attrition = Gender_Attrition.reset_index() Gender_Attrition.columns = ['Gender','Attrition','Count'] Gender_Attrition

trace1 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='Yes'], y = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'], text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'], textposition = 'auto', name = 'Yes') trace2 = go.Bar(x = Gender_Attrition.Gender[Gender_Attrition.Attrition=='No'], y = Gender_Attrition.Count[Gender_Attrition.Attrition=='No'], text = Gender_Attrition.Count[Gender_Attrition.Attrition=='Yes'], textposition = 'auto', name = 'No') data = [trace1,trace2] layout = go.Layout(width = 800, height = 600,title = 'Gender and Attrition') fig = go.Figure(data=data, layout=layout) iplot(fig)

6. Employees who spend more years in the company tend to leave. Verify if this is true.(Years at company and Attrition)
hr_data.YearsAtCompany[hr_data.Attrition=='Yes'].describe()

hr_data.YearsAtCompany[hr_data.Attrition=='No'].describe()

trace1 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='Yes'],name = 'Yes', boxpoints = 'all',jitter = 1 <code>)</code> # boxpoints is used to specify the points to plot # jitter is used to specify how far from each should the points be trace2 = go.Box(y = hr_data.YearsAtCompany[hr_data.Attrition=='No'],name= 'No', boxpoints = 'all',jitter = 1 <code>)</code> data = [trace1,trace2] layout = go.Layout(width = 800, height = 500,title = 'YearsAtCompany and Attrition') fig = go.Figure(data=data,layout = layout) iplot(fig)

7 . Is a particular age group more prone to leaving the company? (Age and Attrition)
hr_data.Age[hr_data.Attrition=='Yes'].describe()

hr_data.Age[hr_data.Attrition=='No'].describe()

trace1 = go.Box(y = hr_data.Age[hr_data.Attrition=='Yes'],name = 'Yes') trace2 = go.Box(y = hr_data.Age[hr_data.Attrition=='No'],name= 'No') data = [trace1,trace2] layout = go.Layout(width = 800, height = 500,title = 'Age and Attrition') fig = go.Figure(data=data,layout = layout) iplot(fig)

You can also bin the age column and do the same.
8. Employees earning less tend to leave the company. Verify if this is true. (Monthly Income vs Attrition)

trace1 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='Yes'],name = 'Yes') trace2 = go.Box(y = hr_data.MonthlyIncome[hr_data.Attrition=='No'],name= 'No') data = [trace1,trace2] layout = go.Layout(width = 800, height = 500,title = 'Income and Attrition') fig = go.Figure(data=data,layout = layout) iplot(fig)


9. How do Age and Monthly Income vary?

trace = go.Scatter(x=hr_data.Age , y= hr_data.MonthlyIncome, name = 'Age and MonthlyIncome', mode= 'markers') data = [trace] layout = go.Layout(title = ' Age and Monthly Income distribution', xaxis = dict(title = 'Age'), yaxis = dict(title = 'Monthly Income')) fig = go.Figure(data=data,layout=layout) iplot(fig)

10. Does Years With Curr Manager have to do anything with Years Since Last Promotion?

trace = go.Scatter(x=hr_data.YearsWithCurrManager , y= hr_data.YearsSinceLastPromotion, name = 'YearsWithCurrManager and YearsSinceLastPromotion', mode= 'markers') data = [trace] layout = go.Layout(title = ' YearsWithCurrManager and YearsSinceLastPromotion distribution', xaxis = dict(title = 'YearsWithCurrManager'), yaxis = dict(title = 'YearsSinceLastPromotion')) fig = go.Figure(data=data,layout=layout) iplot(fig)

3 variables
1. What is the relationship between number of companies worked , age and attrition.(Number of companies worked, Age, Attrition.)

data = [] for i in np.sort(hr_data.NumCompaniesWorked.unique()): data.append(go.Box(y = hr_data.Age[ hr_data.NumCompaniesWorked==i][hr_data.Attrition=='Yes'], marker = dict( color = '#CC0E1D', ), name = "{}- Yes".format(str(i)))) data.append(go.Box(y = hr_data.Age[ hr_data.NumCompaniesWorked==i][hr_data.Attrition=='No'], marker = dict( color = '#588061', ), name = "{}- No".format(str(i)))) layout = go.Layout( autosize=False, # auto size the graph? use False if you are specifying the height and width width=1000, # height of the figure in pixels height=600, # height of the figure in pixels title = "Boxplot of {} column based on {} ".format('Age','NumCompaniesWorked'), # title of the figure # more granular control on the title font <code>titlefont=dict( family='Courier New, monospace', # font family </code> <code>size=14, # size of the font </code> <code>color='black' # color of the font ), </code> <code># granular control on the axes objects</code> xaxis=dict(title='Number of Companies worked and attrition', tickfont=dict(family = 'Courier New, monospace', size=10, color='black', ), yaxis=dict( # range=[0,100],title='Age', titlefont=dict(size=14,color='black'), tickfont=dict(family='Courier New, monospace', size=14,color='black')),) fig = go.Figure(data=data, layout=layout) iplot(fig)

Observe the last two plots, how are they different from others?
Creating new features and plotting
2. What is the relationship between total working years, number of companies worked and attrition (Total Working Years , Number of companies and Attrition.)
Generate a new feature using Total Working Years and Number of companies worked
hr_data['TotalWorkingYears_NumCompWorked'] = np.round( hr_data.TotalWorkingYears / (hr_data.NumCompaniesWorked.astype(int)+1)) # adding 1 to avoid dividng by 0 hr_data.TotalWorkingYears_NumCompWorked.head()

trace0 = go.Box(y= hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='Yes'], name = 'Yes') trace1 = go.Box(y = hr_data.TotalWorkingYears_NumCompWorked[hr_data.Attrition=='No'], name = 'No') data =[trace0,trace1] layout = go.Layout(width = 900, height = 600, title = 'Ratio of Age and Number of Companies worked vs Attrition', titlefont=dict( family='Courier New, monospace', # font family size=14, # size of the font color='black' # color of the font ), # granular control on the axes objects xaxis=dict( tickfont=dict( family='Courier New, monospace', size=10, color='black' ) ), yaxis=dict( # range=[0,100], <code>title='(TotalWorkingYears/NumCompWorked)', </code> <code>titlefont=dict( size=14, color='black' ), </code> <code>tickfont=dict( family='Courier New, monospace', </code> <code>size=14,</code> <code>color='black' # color of the font )</code> ), ) fig = go.Figure(data=data, layout=layout) iplot(fig)

3. Do marital status and distance from home affect attrition? (Marital Status, Distance From Home and Attrition)
data = [] for i in np.sort(hr_data.MaritalStatus.unique()): data.append(go.Box(y = hr_data.DistanceFromHome[ hr_data.MaritalStatus==i][hr_data.Attrition=='Yes'], marker = dict(color = '#CC0E1D', # red), ), name = "{}- Yes".format(str(i))) ) <code>data.append(go.Box(y = hr_data.DistanceFromHome[</code> <code>hr_data.MaritalStatus==i][hr_data.Attrition=='No'], </code> <code>marker = dict(color = '#588061', # green), ), </code> <code>name = "{}- No".format(str(i))) )</code> layout = go.Layout( autosize=False, width=1000, # height of the figure in pixels height=600, # height of the figure in pixels title = "Boxplot of {} column based on {} ".format('DistanceFromHome','MaritalStatus'), titlefont=dict( family='Courier New, monospace', # font family size=14, # size of the font color='black' # color of the font ), # granular control on the axes objects xaxis=dict( tickfont=dict( family='Courier New, monospace', # font family size=10, # size of ticks displayed on the x axis color='black' # color of the font ) ), yaxis=dict( # range=[0,100], <code>title='Distance travelled', titlefont=dict( size=14, color='black' ), </code> <code>tickfont=dict( family='Courier New, monospace', </code> <code>size=14, color='black' # color of the font ) ), )</code> fig = go.Figure(data=data, layout=layout) iplot(fig)

Extras
>3 variables, 3D plots.
n = 1500 Extracting th x, y ,z values temp = hr_data.iloc[0:n,] temp.shape

trace1 = go.Scatter3d( x=temp.PercentSalaryHike[temp.Attrition=='Yes'], y=temp.YearsAtCompany[temp.Attrition=='Yes'], z=temp.DistanceFromHome[temp.Attrition=='Yes'], mode='markers',name ='Yes', marker=dict( size=temp.YearsInCurrentRole[temp.Attrition=='Yes']+2, color='#CC0E1D', # ferarri red # colorscale='Viridis', # choose a colorscale opacity=1 ) ) trace2 = go.Scatter3d( x=temp.PercentSalaryHike[temp.Attrition=='No'], y=temp.YearsAtCompany[temp.Attrition=='No'], z=temp.DistanceFromHome[temp.Attrition=='No'], mode='markers',name ='No', marker=dict( size=temp.YearsInCurrentRole[temp.Attrition=='No']+2, color='rgb(0,255,0)', #green # colorscale='Viridis', # choose a colorscale opacity=0.9, ) ) data = [trace1,trace2] layout = go.Layout( scene = dict( xaxis = dict( title='PercentSalaryHike', backgroundcolor="black", showbackground=True, titlefont=dict( size=16, color='black' ) ), yaxis = dict( title='YearsAtCompany', showbackground=True, backgroundcolor="black", titlefont=dict( size=16, color='black' ) ), zaxis = dict( title='DistanceFromHome', backgroundcolor="black", showbackground=True, titlefont=dict( size=16, color='black' ) ) ), width=1000, # height of the figure in pixels height=800, # height of the figure in pixels ) fig = go.Figure(data=data, layout=layout) fig['layout'].update(title= "PercentSalaryHike, YearsAtCompany, DistanceFromHome, YearsInCurrentRole and Attrition") iplot(fig, filename='3d-scatter-colorscale')


Scree Plot
x=list(range(2,10)) y=sse data = [go.Scatter(x=x, # number of clusters y=y, # sum of squared errors text = [str(i) for i in (zip(x,y))], # text to display on hover textposition = 'top center', line = dict(color = ('rgb(205, 12, 24)')) # line color )] layout = go.Layout(title ='Scree plot (Sum of Squared errors)') fig = go.Figure(data=data,layout=layout) iplot(fig)



trace0 = go.Scatter3d( x=hr_data.Age[hr_data.Attrition=='Yes'], y=hr_data.MonthlyIncome[hr_data.Attrition=='Yes'], z=hr_data.DistanceFromHome[hr_data.Attrition=='Yes'], mode='markers',name ='Yes', marker=dict( size=4, color=hr_data.colors_clusters[hr_data.Attrition=='Yes'], # colorscale='Viridis', # choose a colorscale opacity=1 ) ) trace1 = go.Scatter3d( x=hr_data.Age[hr_data.Attrition=='No'], y=hr_data.MonthlyIncome[hr_data.Attrition=='No'], z=hr_data.DistanceFromHome[hr_data.Attrition=='No'], mode='markers',name ='No', marker=dict( size=4, color=hr_data.colors_clusters[hr_data.Attrition=='No'], # colorscale='Viridis', # choose a colorscale opacity=0.75 ) ) data = [trace0,trace1] layout = go.Layout( scene = dict( xaxis = dict( title='Age', backgroundcolor="black", showbackground=True, titlefont=dict( size=16, color='black' ) <code>), </code> <code>yaxis = dict( title='MonthlyIncome', showbackground=True, backgroundcolor="black", </code> <code>titlefont=dict( size=16, color='black' ) ), zaxis = dict( title='DistanceFromHome', </code> <code>backgroundcolor="black", showbackground=True, </code> <code>titlefont=dict( size=16, color='black' ) ) ), </code> <code>width=1000, # height of the figure in pixels </code> <code>height=800, </code> <code>margin = dict( b =15),)</code> fig = go.Figure(data=data, layout=layout) fig['layout'].update( title= "Understanding attrition by using the clusters.") iplot(fig)

One of the metric to find out if you have chosen the correct number of clusters is to see if you can give a name to all your clusters in terms of business.
This is all for now. I have also created a report on Employee Attrition Rate Analysis. you may like to check it as well. Please read it using the below link.
Report on Employee Attrition Rate Analysis
Thank you for reading. Your comments, thoughts on this post are most welcome.
I have been searching for this kind of content where I can gain some recent updates with a clear examples.
LikeLike
Hope you find it useful.😊
LikeLike
Thank you very much. This post is detailed and understandable and will sure be of great help to Data scientist when carrying out Data Visualization.
LikeLiked by 1 person
Good to hear. And I am happy you liked the post. Keep visiting us.
LikeLiked by 1 person
Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
best training institutions
LikeLiked by 1 person
Thank you.
LikeLike