# Regression

Jump to: navigation, search

Regression analysis involves the modelling and exploration of the relationships between variables to find solutions to problems.

## Introduction

The central assumption in this analysis is that given a set of data on a scatter plot in which the variables seem to follow a linear relationship, a linear relationship can be assumed and a corresponding linear formula can be derived.

Y = β0 + β1X + ϵ

Where: β0 is an intercept; β1 is the slope; ϵ is the error term.

The beta values are known as the regression coefficients.

The ϵ is a term required to account for the fact that the values do not all exactly fall on a straight line.

The ϵ value can have its standard deviation estimated by the σ2 value, defined by:

## Variability Decomposition

In the above equation, the sse value is the numerator, known as the error sum of squares.

Another important value is the regression sum of squares, ssr.

The total of these terms is the ss_t value, which is equal to

sst = ssr + sse

These ssr and sse value are used to calculate a important value known as the coefficient of determination r2 value, which is defined as:

r2 = ssr / sst

This ratio represents the proportion of the variability in the responses that is explained by the predictor and taken into account in the model (essentially the accuracy of the estimations performed in the analysis).

The closer r2 is to 1, the better the fit to the data the estimators are, as at value 1, all of the variation of the values is explained by the estimator used.

Whatever difference there is from 1 represents the percentage of variation due to the natural variability in the values gathered.

What is known as the "sample correlation coefficient" is the square root of the coefficient of determination, r2, and gives a sample estimation of the population correlation coefficient ρ.

## Least Square Estimators

In finding the beta values we use what are known as least square estimators to estimate the values based on Xi (all X values in the scatter plot) and the average of the X values.

The Beta values are based on fractions of these calculations:

## Types of Questions

### True Average Change/Values

The equation for Y is similar to the equations for all straight lines (y = mx + b), and as such allows us to input values to get an output.

Given an equation Y = 6.00 – 0.02X + ϵ

There will be a true average decrease in Y of 0.02 for a given increase in X of 1.

There will be a true average increase in Y of 0.2 for a given decrease in X of 10.

Similarly, we can gain outputs of Y for inputs of X, giving us reaction times or amounts of element Y, depending on what the variables are representing.

For the same equation Y = 6.00 – 0.02X + ϵ

There will be a true average value of Y = 4 for X value 100. (Y = 6 – (0.02 × 100))

### Probability

The Regression Formula can be estimated as a normal distribution.

N ~ (β0 + β1x, σ)

σ will be given.

Therefore, we can find Z values and, using the Z tables, estimate the probability of Y lying between two given values.

### Hypothesis Testing

With the given interval:

and the standard deviation of the data

we can calculate both one and two sided confidence intervals for a given b1 value using this information.

This provides for estimations of b1 from tn-2 and gives us an indication of the value within a range (based on a given α).

#### Averages

For finding a confidence interval for a true average of the required output we use:

and if we are assessing a confidence interval for a future value of the average, we insert a value of 1, which widens the interval by increasing the variability.

## End

This is the end of this topic. Click here to go back to the main subject page for Numerical Methods & Statistics.