Advertisements

Statistics Basics – Descriptive vs Inferential Statistics

Descriptive Statistics
Statistics that quantitatively describes an observed data set. Analysis for descriptive statistics is performed on and conclusions drawn from the observed data only, and does not take into account any larger population of data.

Inferential Statistics
Statistics that make inferences about a larger population of data based on the observed data set. Analysis for inferential statistics takes into account that the observed data is taken from a larger population of data, and infers or predicts characteristics about the population.

Advertisements

Statistics Basics – Measures of Central Tendency & Measures of Variability

Measures of Central Tendency and Measures of Variability are frequently used in data analysis.  This post provides simple definitions of the common measures.

 

Measures of Central Tendency

Mean / Average – sum of all data points or observations in a dataset divided by the total number of data points or observations in the dataset.

The mean or average of this dataset with 5 numbers {2, 4, 6, 8, 10} is: 6

Sum of all data points:     (2+4+6+8+10)
Divided by:                       ———————–  = 6
Number of data points:              5

Median – with the values (data points) in the dataset listed in increasing (ascending) order, the median is the midpoint of the values, such that there are an equal number of data points above and below the median.  If there are an odd number of data points in the dataset, then the median value will be a single midpoint value. If there an even number of data points in the dataset, then the median value will be the mean/average of the two midpoint values.

The median of the same dataset {2, 4, 6, 8, 10} is:  6
This dataset has an odd number of data points (5), and the middle data point is the value 6, with 2 numbers below (2, 4) and 2 numbers above (8, 10).

Using an example of a dataset with an even number of data points:
The median of this dataset {2, 4, 6, 8, 10, 12} is: (6 + 8) / 2 = 7
Since there are 2 middle data points (6, 8), then we need to calculate the mean of those 2 numbers to determine the median.

Mode – the data point that appears the most times in the dataset.

Using our original dataset {2, 4, 6, 8, 10}, since each of the values only appear once, none appearing more times than the others, this dataset does not have a mode.

Using a new dataset {2, 2, 4, 4, 4, 4, 6, 8, 8, 8, 10}, the Mode in this case is: 4
4 is the value that appears the most times in the dataset.

Measures of Variability

Min – the minimum value of the all values in the dataset.
Min {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 1.

Max – the maximum value of the all values in the dataset.
Max {2, 3, 3, 4, 5, 5, 5, 6, 7, 1, 3, 2, 7, 7, 8, 2, 3, 9} is 9.

Variance – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from their average/mean value.  It is the average of the squared differences from the mean.

Variance of {2, 3, 4, 5, 6} is calculated as follows …

First find the Mean.  Mean = (2 + 3 + 4 + 5 + 6) / 5 = 4

Then, find the Squared Differences from the Mean … where ^2 means squared …
(2 – 4)^2 = (-2)^2 = 4
(3 – 4)^2 = (-1)^2 = 1
(4 – 4)^2 = (0)^2 = 0
(5 – 4)^2 = (1)^2 = 1
(6 – 4)^2 = (2)^2 = 4
Average of Squared Differences: (4 + 1 + 0 + 1 + 4) / 5 = 2

Standard Deviation – a calculated value that quantifies how close or how dispersed the values in the dataset are to/from each other.  It is the square root of the Variance (defined above).

For the above dataset, Standard Deviation {2, 3, 4, 5, 6} = Square Root (2) =~ 1.414

Kurtosis – a calculated value that represents how close the tail of the distribution of the dataset is to the tail of a normal distribution*.

Skewness – a calculated value that represents how close the symmetry of the distribution of the dataset is to the symmetry of a normal distribution*.

* A normal distribution, also known as the bell curve, is a probability distribution in which most values are toward the center (closer to the average) and less and less observations occur as you go further from the center.

Range – the difference between the largest number in the dataset and the smallest number in the dataset.
Range {2, 4, 6, 8, 10} = 10 – 2 = 8

 

Thanks for reading!