Linear Regression is one of the most commonly used statistical tests. It’s a test of relationship between two variables. If you have measurements of two variables, you can fit a line that is the best fitting line that approximates the relationship between those two variables. The line will have a slope (remember rise/run from high school algebra?).
In simple linear regression, where you’re using one variable to predict another, your y and x, if you will, regression will give you the best fitting line that reduces the error of your prediction.
I’ve been using retail examples for my previous posts in this little series, so let’s imagine a situation where we’re trying to predict sales given a particular price on a sweet North Face jacket. It’s so great. Now, if we set the price at $65 for a week, we’ll also have a sales volume for that week. If we do $62 dollars for the next week, we’ll have a sales volume for that week. If we change the price every week for a good amount of time, we’ll have a set of prices and a set of sales.
Now, in this scenario, we can control the price, so we’d like to predict what the sales volume (y) will be given a certain price (x). Regression will give us the best fitting line for our data. There’s a lot of noise in that data because of purchasing patterns of customers, seasons, other things happening in the store, etc, changes that might affect how accurate that prediction would be, we call that error, but we should be able to make an estimate with some degree of certainty. We have statistical measures to look at how accurate we are with our prediction, that is, how much error we have, but generally, the more linear our original data is, the better our predictions are going to be.
We can build more complex models in regression. The one variable model is generally called “simple” and the more complex called “multiple”, but we always need to consider that each variable likely increases variation and error in the model, so we need to balance explanatory power with error.
- linear relationship
- equal variance across distribution (homoscedacity)
- variance is normally distributed
- No autocorrelation or multicollinearity (limited effect of independent variables on dependent variable)