The R-squared measure is between 0 and 1 where 0 means none of the variance is explained by the predictor variable and 1 means 100% of the variance is explained by the predictor variable. This is a very handy measure – it distills all the math behind regression in to one number and one that has a built in scale (bigger is better, smaller is worse).
To calculate the R-squared measure, you need to calculate the prediction for all of your known data to get pred(x) – read as prediction using x.
You’ll need to get SStot which is the Total Sum of Squares.
SStot = SUM( (avg(y) – y)^2 )
The second step is to get the Residual Sum of the Squares. Residual, as defined in the next section, is the difference between your prediction (pred(x)) and the actual results (y)
SSres = SUM( (pred(x) – y)^2 )
Just taking a step back for a second, what do these two calculations tell us? SStot gives you the total “natural” variance of the y variable. If the SSres varies in the same way as the “natural” variance, that means the model has captured a lot of that natural variance – thus the R-squared must be close to one. You want the SSres to be as small as you can get – meaning you want the predictions to be as close to the actual values of y.
With that in mind, the final calculation of R-squared is to divide SSres by SStot and subtract that quotient from 1.
R-squared = 1 – (SSres / SStot)
R-Squared Example
Here’s an example with an imaginary linear regression model f(x).
[table]obs, y, avg(y) – y, (avg(y) – y)^2,f(x), f(x)-y, (f(x)-y)^2
1,4,2,4,5,1,1
2,10,-4,16,9,-1,1
3,2,4,16,4,2,4
4,8,-2,4,8,0,0
5,6,0,0,5,-1,1
SUM,30,,40,,,7
AVG,6,,,,,[/table]
The table above shows the individual calculations with SStot = 40 and SSres = 7. The next step is to take that quotient of SSres / SStot = 7 / 40 = 0.175
Now, we subtract that quotient from one. 1 – 0.175 = 0.835. We now know that the model explains 83.5% of the natural variance in the y variable.
R-squared is an excellent, simple tool to evaluate a regression model with one variable. However, once you add in a second predictor (x) variable, you’ll need to move on to Adjusted R-squared which take in to account the other variables.