Suppose a scatter plot of two variables shows a high positive correlation
between the variables. We would like to draw a straight line on the scatter
plot that comes very close to as many of the points as possible. This straight
line is called the line of best fit.

If we could somehow write the equation of this line (y = mx + b) we could then
predict y values given x values. This straight line we are talking about is
called a regression line. A regression line is a straight line (actually its
equation) that describes how a response variable y depends on an explanatory
variable x.
Consider the regression line shown below.

The difference between a fitted (predicted) point and an actual point is called a
residual. For example, when x = 6 the fitted point is 2.1 ( 1.33 + .14*6) and the
actual point is 1.5, resulting in a residual of -0.6 ( e.g. 1.5 - 2.1).
You will notice that some of the residuals are positive (actual points are above
the regression line) and some of the residuals are negative (actual points are
below the regression line). If you were to move along this line from left to
right, summing the residuals , you would discover that the sum of all the
residuals is zero. To overcome this problem we could square all the residuals
and then find their sum. The regression line of best fit is the line that
minimizes the sum of all the squared residuals.
The least Squares Criterion:
The regression line, called the line of best fit, is the line for which the sum
of the squares of all the residuals is a minimum.
In statistics a fitted or
predicted point is called y-hat,
.
A
residual is defined to be:
(y-
)
Lets construct a regression model in which we use the mean of the variable as
the predicted or fitted value. In the table below, the variable is y and we will
use the mean of y a as the predicted value of y for each y.

Note: the regression equation or model is :
The residual sum of squares is 92.8. This can be thought of as a measure of how
well our model fits the actual data. The smaller the residual sum of squares the
better the model fits the data and the larger the residual sum of squares the
poorer the model fits the data. Lets see if we can improve our prediction of y
by utilizing a second variable. This second variable x is called an explanatory
variable. Hopefully it will account or explain some of the variation in
the variable y.

Using the regression equation shown above (don't worry about where this equation
came from), we see that this model does a much
better job of predicting y than the model in which we use the mean of y as the
predicted value. Notice the residual sum of squares has been reduced from 92.8
down to 14.4 resulting in an (92.8-14.4)/92.8 = .845 (84.5%) reduction in
the residuals or errors.
The Coefficient of Determination and Explained Variation
The residual sum of squares (92.8) obtained when the mean of y is used for
prediction is called the total variation in y. The residual sum of squares when
the regression equation is used (14.4) is called the unexplained variation of
the model The difference between the two sum of squares squares(92.8-14.4) is called the
explained variation
in y. The ratio of the explained variation to the total variation is
called the coefficient of variation and is denoted by R square.

The Regression Equation


The standard error of the estimate is similar to the standard deviation. It is used in making an interval estimation of the y values for a given x value. From the regression equation, y = 13.2 when x =3. This is known as a point estimate. Use your software to make a 95% confidence interval estimation for y when x =3.
The Regression coefficient
In the above example we see the regression equation Y = 2.8X +4.8. The regression coefficient is 2.8. We should ask, what is the meaning of the number 2.8? It is easy to see that if X increases by one unit then Y will increase by 2.8 units. The regression coefficient tells us how much the response variable will change with a one unit change in the explanatory variable.
For example: Suppose the regression equation Y =1200X + 6000 models the selling price of a house (Y) in terms of heated living space (X) in square feet. If the heated space of a house increases by 1 square foot, then the selling price increases by $1200. If the number of square feet of heated space is increased by 10 square feet, then the selling price increases by $12,000.
Multiple RegressionIn simple linear regression, the regression equation contains one explanatory variable, and one response variable. In multiple regression, there are several explanatory variables and one response variable and the regression equation is

For example: The statistics instructor at Bainbridge College wishes to predict a student's final score in statistics given their current GPA and age. The instructor selects five students from the past semesters course and records the data shown in the table below.
Student GPA(X1) AGE(X2) Statistics Grade(X3)
<
The regression equation is: X3' = 18.889 + 17.714X1 + 0.426X3Here is how to interpret a regression coefficient in multiple regression:
A regression coefficient for a particular explanatory variable represents the change in the response variable for a one unit change in the explanatory variable, provided all other explanatory variables are held constant.
Making Inferences About The Regression Coefficients
Now we are ready to determine whether the regression line is of any real value in predicting the response variable. We will conduct a hypothesis test on each regression coefficient. The null hypothesis is always: H0: B=0 which implies that B is of no use in predicting y, and the alternative hypothesis is Ha: B not equal to zero which implies that B is useful in predicting y. Your StatCrunch software will run this procedure for you. We agree that if the P value of the t statistic is less than 0.05 we have a significant regression coefficient which is useful for prediction.