# Regression

Regression analysis involves the modelling and exploration of the relationships between variables to find solutions to problems.

## Contents |

## Introduction

The **central assumption** in this analysis is that given a set of data on a scatter plot in which the variables seem to follow a linear relationship, a linear relationship can be assumed and a corresponding linear formula can be derived.

Y = β_{0} + β_{1}X + ϵ

Where:
β_{0} is an intercept;
β_{1} is the slope;
ϵ is the error term.

The beta values are known as the regression coefficients.

The ϵ is a term required to account for the fact that the values do not all exactly fall on a straight line.

The ϵ value can have its standard deviation estimated by the σ^{2} value, defined by:

## Variability Decomposition

In the above equation, the ss_{e} value is the numerator, known as the error sum of squares.

Another important value is the regression sum of squares, ss_{r}.

The total of these terms is the ss_t value, which is equal to

ss_{t} = ss_{r} + ss_{e}

These ss_{r} and ss_{e} value are used to calculate a important value known as the coefficient of determination r^{2} value, which is defined as:

r^{2} = ss_{r} / ss_{t}

This ratio represents the proportion of the variability in the responses that is explained by the predictor and taken into account in the model (essentially the accuracy of the estimations performed in the analysis).

The closer r^{2} is to 1, the better the fit to the data the estimators are, as at value 1, all of the variation of the values is explained by the estimator used.

Whatever difference there is from 1 represents the percentage of variation due to the natural variability in the values gathered.

What is known as the "sample correlation coefficient" is the square root of the coefficient of determination, r^{2}, and gives a sample estimation of the population correlation coefficient ρ.

## Least Square Estimators

In finding the beta values we use what are known as least square estimators to estimate the values based on X_{i} (all X values in the scatter plot) and the average of the X values.

The Beta values are based on fractions of these calculations:

## Types of Questions

### True Average Change/Values

The equation for Y is similar to the equations for all straight lines (y = mx + b), and as such allows us to input values to get an output.

Given an equation Y = 6.00 – 0.02X + ϵ

There will be a true average decrease in Y of 0.02 for a given increase in X of 1.

There will be a true average increase in Y of 0.2 for a given decrease in X of 10.

Similarly, we can gain outputs of Y for inputs of X, giving us reaction times or amounts of element Y, depending on what the variables are representing.

For the same equation Y = 6.00 – 0.02X + ϵ

There will be a true average value of Y = 4 for X value 100. (Y = 6 – (0.02 × 100))

### Probability

The Regression Formula can be estimated as a normal distribution.

N ~ (β_{0} + β_{1}x, σ)

σ will be given.

Therefore, we can find Z values and, using the Z tables, estimate the probability of Y lying between two given values.

### Hypothesis Testing

With the given interval:

and the standard deviation of the data

we can calculate both one and two sided confidence intervals for a given b_{1} value using this information.

This provides for estimations of b_{1} from t_{n-2} and gives us an indication of the value within a range (based on a given α).

#### Averages

For finding a confidence interval for a true average of the required output we use:

and if we are assessing a confidence interval for a future value of the average, we insert a value of 1, which widens the interval by increasing the variability.

## End

This is the end of this topic. Click here to go back to the main subject page for Numerical Methods & Statistics.