Module learning objectives

Describe 3 measures of centrality
Explain the mathematical notation used for calculating the mean
Write a function which calculates the mean for any vector
Write a function which calculates the median for any vector
Describe how the reliability of a sample mean will scale with increasing sample size

What are measures of centrality?

You’ve just collected a lot data and graphed heights. Although informative, a graphical display of these data is difficult to summarize – we need to describe these heights with a single number that will be meaningful and allow us to do statistics.

We can do this with a measure of centrality, the concept that one number in the “center” of the data set is a good summary of all the values. Below are examples of different measures of centrality.

The mean is the average and the measure of centrality that you are probably most familiar with. This is a good measure to use when the data are normally distributed. We describe it in detail below.
The median is the value in the middle of the data set. Half of the observations lie above the median and half below. When the data are normally distributed, the median and the mean will be very close to each other. When your data are not normally distributed (skewed to the left or right) the median is a more appropriate measure of centrality (see the animation below).

The mode is the value (height, in our case) that occurs most frequently in the data set. It’s not typically used in statistics, and we won’t cover it further here.

Taking the mean

The mean of a variable is the sum of its values, divided by the number of values.

This concept can be represented with equation below. In our case, each “x” represents a giraffe height (i.e. a single observation), and the numerical subscript indicates its order in the sample. We’ll use \({\bar{x}}\) (read “x-bar”) to represent the mean of the height variable.

\[\begin{equation} \tag{1} \Large{\bar{x}} = \frac{x_1 + x_2 + ... + x_n}{n} \end{equation}\]

To make this more efficient, instead of writing “\({x_1 + x_2 + ... + x_n}\)”, we can use the uppercase sigma symbol \(\sum{}\) to represent summation of all the observations.

\[\begin{equation} \tag{2} \Large{\bar{x}} = \frac{\sum_{i=1}^{n}{x_i}}{n} \end{equation}\]

This might look intimidating, but equation (2) is really showing the same thing as (1). Let’s go through the steps again, breaking the symbols apart a bit (see annotated equation (3) below). The sigma means ‘add up’. What are we adding up? All the heights “x”. The “i = ” part indicates which term to begin with. For our purposes, this will always be the first observation, hence \(i\) = 1. The character on top of the sigma is the last observation we include in our summation. In this case it’s n – because we’re adding all n = 50 observations in each group of giraffes. In both equations, we still divide by the total number of observations in each group we have: again, n.

\[\begin{equation} \tag{3} \vcenter{\img[width=400px]{images/03_mean/eq_annotated.png}} \end{equation}\]

Notation for sample vs population

Recall our discussion about a sample versus a population. Different symbols are used to represent the mean for each of these. We’ve already discussed \(\bar{x}\) for the sample mean. The analogous symbol for the population mean is \({\mu}\) (read “mu”). Additionally, when referring to the size of the population, we will use a capital \({N}\) instead of a lowercase one.

Code it up

Using (2), it’s easy to translate this equation into code in R. The heights recorded from island 1 have been stored in a vector called heights_island1. Below we show the first few observations from this vector, using the head() function.

head(heights_island1)

## [1]  7.038865 13.154339  8.086511  8.159990  6.004716  9.455408

Use the interactive window below to calculate the mean “by hand”.

Create your own function

Now it’s your turn to write your own function. Call it “my_mean” and have it calculate the mean of any given vector. You’re going to use the rules for writing a function in R that you’ve used previously. As a reminder, you’ll use function() and embed your code (that you completed in the window above) within curly brackets{}. The advantage of making a “homemade” function is that you can string together all the steps from the previous exercise into a single command.

You can also complete the exercise above in RStudio on your local computer. This way you will be able to save your my_mean() function and script for future use.

Take a tea break!

Taking the median

To calculate the median go through the following steps:

Assess whether there is an odd or even number of observations
Order all observations from smallest to largest
If an odd number, then the median is the middle value at position: (n + 1) / 2
If an even number, then:
- Find the value at the position: n / 2
- Find the value at the position: (n / 2) + 1
- The median will be the mean of the values at these two positions.

Before you write your own median function, two concepts need to be introduced: 1) the modulus operator %% and 2) if...else statements.

The modulus operation gives the remainder after division of one number by another. For example, in R 11 %% 5 returns the 1, which is the remainer of 11 divided by 5. If the modulus operation returns 0, then there is no remainder. It is useful to apply the modulus operation x %% 2 to determine whether a number x is even or odd by testing if the result is exactly equal to 0. See example code below.

> 10 %% 2

## [1] 0

> 10 %% 2 == 0

## [1] TRUE

> 11 %% 2

## [1] 1

> 11 %% 2 == 0

## [1] FALSE

An if...else statement is useful when you want to specify distinct outcomes for objects dependent on whether they meet your set criteria. See below.

Now that you have a sense for how the %% operator can be used to test whether a number is EVEN or ODD, and how if...else statements work, use both of these concepts in the window below to write your own function that calculates the median of any vector.

Things to think about

Remember that the sample mean is an estimate of the entire population’s mean (which would often be impossibly large to measure). How reliably does the mean of a sample represent the population mean? Warning: if a small sample has been used, the sample mean may not be a reliable at all! Estimates from small samples are subject to the whims of randomness. On the other hand, the larger the sample, the closer the sample size appraches the population size, and the more reliable the sample estimate becomes.

Pressing ‘Play’ on the plot below will illustrate this concept.

The animation above shows the values of means calculated from increasingly larger samples: small samples on the left and larger samples to the right (on the x-axis).
Each point on the zig-zag line is the mean calculated from a random sample. The true mean of the population is 0.
The y-axis shows what the mean is for a sample of that particular size. Though the y-values vary here, remember that if the sample were a good estimate of the population, the y-values should be very close to 0.
You can see that when the samples are small the sample mean isn’t necessarily a good representation of the population that it was sampled from–and that is not a good thing.

For further reading see the Law of Large Numbers.

This project was created entirely in RStudio using R Markdown and published on GitHub pages.

Mean, Median, and Mode