the name suggests, Conditional Probability is the probability of an event under
some given condition. And based on the condition our sample space reduces to
the conditional element.
For example, find the probability of a person subscribing for the insurance given that he has taken the house loan. Here sample space is restricted to the persons who have taken house loan.
Continue reading “Conditional Probability with examples For Data Science”
Probability in itself is a huge topic to study. Applications of probability are found everywhere whether it is medical science, share market trading, sports, gaming Industry and many more. However in this post my focus is on to explain the topics which are needed to understand data science and machine learning concepts.
Continue reading “Probability Basics for Data Science”
Variance and Standard Deviation are the most commonly used measures of variability and spread. Variability and spread are nothing but the process to know how much data is being varying from the mean point. And Variance tells us the average distance of all data points from the mean point. Standard deviation is just the square root of the variance. As variance is calculated in squared unit (explained below in the post) and hence to come up a value having unit equal to the data points, we take square root of the variance and it is called as Standard Deviation.
Continue reading “Variance, Standard Deviation and Other Measures of Variability and Spread”
k-Nearest Neighbors or kNN algorithm is very easy and powerful Machine Learning algorithm. It can be used for both classification as well as regression that is predicting a continuous value. The very basic idea behind kNN is that it starts with finding out the k-nearest data points known as neighbors of the new data point for which we need to make the prediction. And then if it is regression then take the conditional mean of the neighbors y-value and that is the predicted value for new data point. If it is classification then it takes the mode (majority value) of the neighbors y value and that becomes the predicted class of the new data point.
Continue reading “A Complete Guide to K-Nearest Neighbors Algorithm – KNN using Python”
Principal Component Analysis or PCA is used for dimensionality reduction of the large data set. In my previous post A Complete Guide to Principal Component Analysis – PCA in Machine Learning , I have explained what is PCA and the complete concept behind the PCA technique. This post is in continuation of previous post, However if you have the basic understanding of how PCA works then you may continue else it is highly recommended to go through above mentioned post first.
Continue reading “Step by Step Approach to Principal Component Analysis using Python”
Principal Component Analysis or PCA is a widely used technique for dimensionality reduction of the large data set. Reducing the number of components or features costs some accuracy and on the other hand, it makes the large data set simpler, easy to explore and visualize. Also, it reduces the computational complexity of the model which makes machine learning algorithms run faster. It is always a question and debatable how much accuracy it is sacrificing to get less complex and reduced dimensions data set. we don’t have a fixed answer for this however we try to keep most of the variance while choosing the final set of components.
Continue reading “A Complete Guide to Principal Component Analysis – PCA in Machine Learning”
Linkedin is a professional networking platform. where employers and employees can connect to each other. LinkedIn had 630 million registered members in 200 countries as of june 2019. And 2 new members join LinkedIn per second. These numbers are larger than the population of some of the countries.
Continue reading “How to Use LinkedIn to Drive Traffic to Your Blog”
Logistic regression is the most widely used machine learning algorithm for classification problems. In its original form it is used for binary classification problem which has only two classes to predict. However with little extension and some human brain, logistic regression can easily be used for multi class classification problem. In this post I will be explaining about binary classification. I will also explain about the reason behind maximizing log likelihood function.
Continue reading “What is Logistic Regression?”
Multicollinearity occurs in a multi linear model where we have more than one predictor variables. So Multicollinearity exist when we can linearly predict one predictor variable (note not the target variable) from other predictor variables with significant degree of accuracy. It means two or more predictor variables are highly correlated. But not the vice versa means if there is low correlation among predictors then also multicollinearity may exist.
Continue reading “What is Multicollinearity?”
In R, stepAIC is one of the most commonly used search method for feature selection. We try to keep on minimizing the stepAIC value to come up with the final set of features. “stepAIC” does not necessarily means to improve the model performance, however it is used to simplify the model without impacting much on the performance. So AIC quantifies the amount of information loss due to this simplification. AIC stands for Akaike Information Criteria.
Continue reading “What is stepAIC in R?”
Feature selection is a way to reduce the number of features and hence reduce the computational complexity of the model. Many times feature selection becomes very useful to overcome with overfitting problem. It helps us in determining the smallest set of features that are needed to predict the response variable with high accuracy. if we ask the model, does adding new features, necessarily increase the model performance significantly? if not then why to add those new features which are only going to increase model complexity.
Continue reading “Feature Selection Techniques in Regression Model”
In today’s world we are generating large amount of data every second. while tweeting, chating, writing or even speaking, we are fabricating corpse of data. Most of the data is in textual and unstructured form. Hence to make this data understandable by computer, we need to process it. NLP technique helps us in processing the data and helps us to get useful insights from it.