## Statistical Outliers Impossible in Small Samples

**How many pieces of data are needed before it’s possible for one of them to be an Outlier?**

One definition of an Outlier is a piece of data whose value falls at least **2 sample standard deviations from the mean**.

After introducing this idea to my Statistics class, I decided to make a simple applet in GeoGebra with 4 values on a number line which you could drag freely to change their values, with arrows pointing at the outlier thresholds, i.e. at two sample standard deviations above and below the mean:

But something unexpected happened. However I arranged the values, I could not drag one of them far enough away from the others for it to become an outlier because the threshold would accelerate away up or down the number line. I came to the compelling conclusion that outliers, by this definition, are impossible with a sample of only 4 pieces of data.

Twenty minutes earlier, before my Year 12 had legged it for the bus, we had found an outlier in a sample of 10 pieces of data. So if it’s impossible to have an outlier when n=4, but possible when n=10, the minimum sample size for an outlier to be theoretically possible must be between these values.

**Solving the problem**

Take a generalised sample of 4 pieces of data, with values *a*, *b*, *c*, *d*. For the piece of data with value *a* to be an outlier by the above definition, because of the symmetry (two standard deviations above *or* below the mean), the following is sufficient:

But to keep the standard deviation as small as possible it makes sense for b, c and d to take the same value, so:

Even better, we can eliminate *b* altogether by setting *b=0*:

We could solve this without much fuss, but there’s a clear chance at this point to generalise. Why just stick with 4 pieces of data when we can have *n* pieces of data instead?

There is actually another opportunity to generalise here. We can provide scope for redefining the definition of an outlier. Instead of outliers having to be at least 2 standard deviations from the mean, **let’s redefine an outlier to be a piece of data whose value is at least q standard deviations from the mean**.

Life is too short to manipulate this with pen and paper, so simplify it by employing Wolfram Alpha. As you might expect, *a* cancels:

We can solve this for *q* or *n* and both results are meaningful in their own way.

When an outlier is defined as a piece of data whose value is at least *q *standard deviations from the mean,

- the inequality on the left is the constraint on
*q*such that outliers are possible. - the inequality on the right is the constraint on
*n*, i.e. the minimum number of pieces of data in the sample, such that outliers are possible.

When *q=2 (*the common definition of an outlier), the inequality on the right gives us **the minimum number of pieces of data required for outliers to be possible:**

At this point I modified my GeoGebra applet to include a similar number line for 5 pieces of data and another for 6. Thankfully, the graphics agreed with the algebra: with 6 pieces of data (6>5.83), the mean had enough inertia for me to generate outliers.

..

.

But the left inequality is also quite compelling.

This is the constraint on the definition of an outlier, in terms of *n*, such that outliers are possible. Here are a few values of q calculated for given values of n:

**So why is the common definition of an outlier, “2” standard deviations from the mean?**

95% of the Normal distribution is within 1.96 standard deviations from the mean. Statistical experiments often use 95% as their confidence interval – and therefore we look to see whether any pieces of data fall beyond this threshold. But 1.96 is painfully close to an integer, so we settled on 2 for the sake of nicer numbers.

See the table below for some common confidence intervals and the associated *q* value (or z value, if you’re used to the Normal Distribution).

**Some final remarks:**

With more data, the maximum value of *q* increases. This seems to make sense: if you have more data fitting a pattern, the outlier thresholds are closer to the mean; or, an outlier has to be defined as being *more* standard deviations from the mean before it becomes impossible to have outliers.

Scenario: You have 98 pieces of data that are clustered fairly tightly around the mean, and a 99th piece of data that is distinctly separate from the others, but you’ve done the maths and it is not more than 2 standard deviations from the mean. You then get a new, 100th piece of data, which is close to the mean. This pulls the outlier thresholds in ever so slightly, meaning the 99th piece is now an outlier because there are now enough other pieces of data fitting a pattern.

I can imagine a general scenario where somebody might want to tighten the definition of an outlier – perhaps to make *q=1.5 *or even* q=1*, to exclude more data, accepting only those pieces that are clustered tightly around the mean. For example, if bolts for an aircraft were being strength tested, you might want to be extra stringent. But, conversely, e.g. with 100 pieces of data, as long as *q<9.9*, outliers are possible. And I can’t think why anyone would ever set q so large – unless they had reason to be extremely lenient. And since as *n* increases, *q* also increases, and reliable statistical calculations require large *n*, I think we can safely assume that outliers are nearly always going to be possible.

Possible extensions: What’s the minimum sample size in which at least 2 outliers can be found? At least m outliers?

An interesting exploration – something I have also been thinking about for a real-world application.

In my case instead of using mean and std. dev., I am using median and rank statistics for outliers like inter-quartile range. Could be an interesting follow-up exploration / nice segue into robust statistics.

a very interesting idea. i’ve created another construction to check it, improving the original functionality.

http://tube.geogebra.org/material/show/id/1332661