INTRO TO DATA SCIENCE: MEASURES OF DISPERSION

INTRO TO DATA SCIENCE: MEASURES OF DISPERSION

In our discussion of descriptive statistics, we’ve considered the measures of central tendency—mean, median and mode. These help us categorize typical values in a group—such as the mean height of your classmates or the most frequently purchased car brand (the mode) in a given country.
When we’re talking about a group, the entire group is called the populationSometimes a population is quite large, such as the people likely to vote in the next U.S. presidential election, which is a number in excess of 100,000,000 people. For practical reasons, the polling organizations trying to predict who will become the next president work with carefully selected small subsets of the population known as samples. Many of the polls in the 2016 election had sample sizes of about 1000 people.
In this section, we continue discussing basic descriptive statistics. We introduce measures of dispersion (also called measures of variability) that help you understand how spread out the values are. For example, in a class of students, there may be a bunch of students whose height is close to the average, with smaller numbers of students who are considerably shorter or taller.
For our purposes, we’ll calculate each measure of dispersion both by hand and with functions from the module statistics, using the following population of 10 six­sided die rolls:

View code image

1, 3, 4, 2, 6, 5, 3, 4, 5, 2

Variance

To determine the variance, we begin with the mean of these values—3.5. You obtain this result by dividing the sum of the face values, 35, by the number of rolls, 10. Next, we subtract the mean from every die value (this produces some negative results):
For simplicity, were calculating the population variance. There is a subtle difference between the population variance and the sample variance. Instead of dividing by n(the number of die rolls in our example), sample variance divides by n 1. The difference is pronounced for small samples and becomes insignificant as the sample size increases. The statistics module provides the functions pvariance and variance to calculate the population variance and sample variance, respectively. Similarly, the statistics module provides the functions pstdev and stdev to calculate the population standard deviation and sample standard deviation, respectively.

view code image 

2.5, ­0.5, 0.5, ­1.5, 2.5, 1.5, ­0.5, 0.5, 1.5, ­1.5

Then, we square each of these results (yielding only positives):

View code image 

6.25, 0.25, 0.25, 2.25, 6.25, 2.25, 0.25, 0.25, 2.25, 2.25

Finally, we calculate the mean of these squares, which is 2.25 (22.5 / 10)—this is the population variance. Squaring the difference between each die value and the mean of all die values emphasizes outliers—the values that are farthest from the mean. As we get deeper into data analytics, sometimes we’ll want to pay careful attention to outliers, and sometimes we’ll want to ignore them. The following code uses the statistics module’s pvariance function to confirm our manual result:

view code image

In [1]: import statistics
In [2]: statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])
Out[2]: 2.25

Standard Deviation

The standard deviation is the square root of the variance (in this case, 1.5), which tones down the effect of the outliers. The smaller the variance and standard deviation are, the closer the data values are to the mean and the less overall dispersion (that is, spread) there is between the values and the mean. The following code calculates the population standard deviation with the statistics module’s pstdev function, confirming our manual result:

view code image

In [3]: statistics.pstdev([1, 3, 4, 2, 6, 5, 3, 4, 5, 2])
Out[3]: 1.5

Passing the pvariance function’s result to the math module’s sqrt function confirms our result of 1.5:

view code image

In [4]: import math
In [5]: math.sqrt(statistics.pvariance([1, 3, 4, 2, 6, 5, 3, 4, 5, 2]))
Out[5]: 1.5

Advantage of Population Standard Deviation vs. Population Variance

Suppose you’ve recorded the March Fahrenheit temperatures in your area. You might have 31 numbers such as 19, 32, 28 and 35. The units for these numbers are degrees.
When you square your temperatures to calculate the population variance, the units of the population variance become “degrees squared.” When you take the square root of the population variance to calculate the population standard deviation, the units once again become degrees, which are the same units as your temperatures.

*

Post a Comment (0)
Previous Post Next Post