What is Linear Regression?
It is a predictive analysis technique. With this technique we can predict a variable called as response variable (or dependent variable) using one or more explanatory variables (or independent or predictor variables). When there is only one predictor variable, then method is called as simple regression otherwise multiple regression.
Linear regression consists of finding the best-fitting straight line (plane in case of multiple regression) through the points. The best-fitting line is called a regression line. In below scatter plot, technique will try to fit single straight line that best defines the point.
Equation of straight line (model): y = b0 + b1.x
Mathematical way of calculating the value of b0 and b1 is:
Slope of the straight Line:
Intercept on Y-Axis:
How to choose our best fitted line? – Residual Analysis
One popular approach is to use the historical data (x and y values that we have used to construct the regression model) to test the model. With this approach, the values of the predictor variable (x) are inserted into the regression model and a value for response variable (y) will be generated. These predicted values (Ypred) are then compared to the actual y values to determine how much error the equation of the regression line produced.
This difference is referred to as the residual.
It is the sum of squares of these residuals that is minimized to find the least squares line or regression line or best fitted line.
Testing the Assumptions of Regression Model.
After developing the regression model (or after getting the best fitted line) we need to test four assumptions of simple regression analysis mentioned below:
- The model is linear.
- The error terms have constant variances.
- The error terms are independent.
- The error terms are normally distributed.
These assumptions can be tested using residual plot.
Now what is this residual plot?
Residual Plot is a type of graph in which the residuals for a particular regression model are plotted along with their associated value of x as an ordered pair (X,Y-Ypred )
Let’s look into assumptions underlying one by one:
- If the residual plot is linear then the assumption that the model is linear will hold. The shape of the residual plot should be linear (increasing or decreasing).
- The assumption of constant error variance is called homoscedasticity. If the error variances are not constant then it is called heteroscedasticity. Have a look on below diagram:
- If the error terms are not independent then they will follow a pattern. If the error term is dependent on the one next to it, the value of the residual will be a function of the residual value next to it.
- A healthy residual graph (normally distributed residual graph) will look like: