Descriptive Measures

From Uni Study Guides
Jump to: navigation, search

Descriptive measures are important to understand data numerically rather than just graphically.

Contents

Measures of Centre

The mean (otherwise known as the average) is the sum of the numerical data divided by the number of data points.

The median finds the middle value of a data set. If the data set has an odd number of values, the median is the data value directly in the middle of the data sorted from smallest to largest. If the data set has an even number of values, the median is the two middle values averaged. Median is used instead of mean as it is less likely to be affected by outliers.

Quartiles and Percentages

A quartile is a division of a data set into 25% increments. Therefore, the first or lower quartile is the value with 25% of the observations below and 75% of the observations above. The second quartile is the value with 50% of the observations above and below, and the third quartile is the value with 25% of the observations above and 75% of the observations below. It can be seen that the quartiles are the median of either the lower half of the data set (lower quartile) or the upper half (upper quartile). (If the data set frequency is odd, include the median in both quartiles).
A percentile is the same, where the 25th percentile is where 25% of the observations are below, the same as a lower quartile. A 98th percentile would be 98% of observations below and 2% above.

Five Number Summary

The five number summary provides a quick look at the data using five data points. These include the smallest and largest values, lower and upper quartiles and the median.

Variability

Variability is the dispersion of the data points in a data set, or the deviation from the mean. Two sets of data may have the same mean and/or median, but the spread of the data points can vary greatly. the deviation from the mean is the value of the difference between a value and the mean of the data set. Eg. (x1 - ẍ).

Sample Variance

The sample variance is a method of calculating the variance of a data set, denoted as s2. It is the average of the deviation from the mean squared.

s2 = 1/(n - 1) * SUM(xi - ẍ)2

Rearranged for easier computation:

s2 = 1/(n - 1) * SUM(xi2) - n/(n - 1) * ẍ2

Sample Standard Deviation

Sample standard deviation is the deviation, which is found by changing the sample variance from s2 to s by square rooting. This makes the unit of the variance the same as the values in the data set.

Interquartile Range

The interquartile range is the variability of the data between the upper and lower quartiles. This is calculated by finding the difference between the value of the third and first quartile. This allows for less problems caused by outliers, as only the middle half is considered.
iqr = upper quartile (q3) - lower quartile (q1)

Outliers

An outlier can be empirically determined by finding if it lies beyond the limits (q1 - 1.5 * iqr) and (q3 + 1.5 * iqr). This is classified as a mild outlier, though if the outlier lies beyond the limits (q1 - 3 * iqr) and (q3 + 3 * iqr) it is classified as an extreme outlier.

Boxplots

A boxplot, sometimes known as a box-and-whisker plot, shows the five number summary and outliers clearly. The central box spans the quartiles, and the line running horizontally through the box is the median. The vertical line above and below the central box show the range of data to the points that are not outliers, and outliers are plotted separately.


Box Plot.PNG


End

This is the end of this topic. Click here to go back to the main subject page for Numerical Methods & Statistics.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox