Simple Linear Regression

Suppose a scatter plot of two variables shows a high positive correlation between the  variables. We would like to draw a straight line on the scatter plot that comes very close to as many of the points as possible. This straight line is called the line of best fit.


If we could somehow write the equation of this line (y = mx + b) we could then predict y values given x values. This straight line we are talking about is called a regression line. A regression line is a straight line (actually its equation) that describes how a response variable y depends on an explanatory variable x.

Consider the regression line shown below.

The difference between a fitted (predicted) point and an actual point is called a residual. For example, when x = 6 the fitted point is 2.1 ( 1.33 + .14*6) and the actual point is 1.5, resulting in a  residual of -0.6 ( e.g. 1.5 - 2.1). You will notice that some of the residuals are positive (actual points are above the regression line) and some of the residuals are negative (actual points are below the regression line). If you were to move along this line from left to right, summing the residuals , you would discover that the sum of all the residuals is zero. To overcome this problem we could square all the residuals and then find their sum. The regression line of best fit is the line that minimizes the sum of all the squared residuals.

The least Squares Criterion:
The regression line, called the line of best fit, is the line for which the sum of the squares of all the residuals is a minimum.
In statistics a fitted or predicted point is called y-hat, . A residual is defined to be:
(y-)

Lets construct a regression model in which we use the mean of the variable as the predicted or fitted value. In the table below, the variable is y and we will use the mean of  y a as the predicted value of y for each y.

Note: the regression equation or model is :
The residual sum of squares is 92.8. This can be thought of as a measure of how well our model fits the actual data. The smaller the residual sum of squares the better the model fits the data and the larger the residual sum of squares the poorer the model fits the data. Lets see if we can improve our prediction of y by utilizing a second variable. This second variable x is called an explanatory variable.  Hopefully it will account or explain some of the variation in the variable y.

Using the regression equation shown above (don't worry about where this equation came from), we see that this model does a much better job of predicting y than the model in which we use the mean of y as the predicted value. Notice the residual sum of squares has been reduced from 92.8 down to 14.4 resulting in an (92.8-14.4)/92.8 = .845  (84.5%) reduction in the residuals or errors.

The Coefficient of Determination and Explained Variation

The residual sum of squares (92.8) obtained when the mean of y is used for prediction is called the total variation in y. The residual sum of squares when the regression equation is used (14.4) is called the unexplained variation of the model The difference between the two sum of squares squares(92.8-14.4) is called the explained variation in y. The ratio of the explained variation to the total variation is called the coefficient of variation and is denoted by  R square.

We can say that the regression model which uses x as the explanatory variable explains 84.5% of the total variation in y or that this model does not account for 15.5% of the variation in y. Obviously some other variable/s must account for the unexplained variation in y. We could look for other variable beside x that may have an influence on y and include them in our regression equation. The resulting regression model would be an example of multiple linear regression rather than simple linear regression.

The Regression Equation


The regression equation for an explanatory variable x and a response variable y is:

Today, few people calculate a regression equation by hand. Technology comes to our rescue, we only have to enter the x values and the y values and then push the "go" button. For the previous problem, your CS software presents the following output.

The standard error of the estimate is similar to the standard deviation. It is used in making an interval estimation of the y values for a given x value. From the regression equation, y = 13.2 when x =3. This is known as a point estimate. Use your software to make a 95% confidence interval estimation for y when x =3.

The Regression coefficient

In the above example we see the regression equation Y = 2.8X +4.8. The regression coefficient is 2.8. We should ask, what is the meaning of the number 2.8? It is easy to see that if X increases by one unit then Y will increase by 2.8 units. The regression coefficient tells us how much the response variable will change with a one unit change in the explanatory variable.

For example: Suppose the regression equation Y =1200X + 6000 models the selling price of a house (Y)  in terms of heated living space (X) in square feet.  If the heated space of a house increases by 1 square foot, then the selling price increases by $1200. If the number of square feet of heated space is increased by 10 square feet, then the selling price increases by $12,000.

Multiple Regression

In simple linear regression, the regression equation contains one explanatory variable, and one response variable. In multiple regression, there are several explanatory variables and one response variable and the regression equation is


For example: The statistics instructor at Bainbridge College wishes to predict a student's final score in statistics given their current GPA and age. The instructor selects five students from the past semesters course and records the data shown in the table below.

Student            GPA(X1)            AGE(X2)    Statistics Grade(X3)
   1                      3.2                       22                     85
   2                      2.5                       19                     70
   3                      3.7                       32                     98
   4                      2.0                       26                     65
   5                      2.4                       21                     72

Statistics Grade(X3) is the response variable while AGE(X2) and GPA(X1) are the explanatory variables. The multiple regression results are shown below.

< The regression equation is: X3' = 18.889 + 17.714X1 + 0.426X3
The model predicts a statistics score of 82 for a 25 year old student whose GPA is 3.
(18.889 + 17.714*3+0.426*25 = 82)

We see that 99.4% of the variation in statistics scores is accounted for by GPA and Age. The meaning of the regression coefficient  for GPA (17.714)  means that for each one unit increase in GPA the statistics score will increase by 17.714 units provided age remains constant. The regression coefficient for age (0.423)  means that for each additional year in age the statistics score increased by 0.423 points provided the GPA does not change.

Here is how to interpret a regression coefficient in multiple regression:

A regression coefficient for a particular explanatory variable represents the change in the response variable for a one unit change in the explanatory variable, provided all other explanatory variables are held constant.

Making Inferences About The Regression Coefficients

Now we are ready to determine whether the regression line is of any real value in predicting the response variable. We will conduct a hypothesis test on each regression coefficient. The null  hypothesis is always:  H0: B=0 which implies that B is of no use in predicting y, and the alternative hypothesis is Ha: B not equal to zero which implies that B is useful in predicting y. Your StatCrunch software will run this procedure for you. We agree that if the P value of the t statistic is less than 0.05 we have a significant regression coefficient which is useful for prediction.