Probability and Sampling FundamentalsWeek 8: The Multivariate Normal Distribution and Large Sample Theory

The multivariate normal distribution and large sample theory

In Week 7 we learned about bivariate continuous random variables with a brief discussion on how to extend these ideas to a continuous random vector. The first part of this week's material will introduce the most commonly encountered standard distribution for a continuous random vector, the multivariate normal distribution, with a particular focus on the bivariate case. The second part will introduce some large sample theory including the law of large numbers and one of the most important theorems in probability, the central limit theorem.

Week 8 learning material aims

The material in week 8 covers:

the multivariate normal distribution;
calculating marginal and conditional distributions for the bivariate normal;
the weak law of large numbers;
the central limit theorem;
the normal approximation to the binomial and Poisson distribution.

Vector-matrix notation for expected value and variance

In Week 4 and Week 7 we have looked at the expected value ("mean"), variance and covariance of random variables. In this section we will introduce the vector-matrix notation for the mean and (co)variance.

Suppose we have two random variables and and their expected values are given by and .

We can think of and as being part of a vector . Similarly, we can arrange the expected values into a vector

We can arrange the variances and as well as the covariance into a symmetric matrix, called the variance matrix or covariance matrix (or even sometimes variance-covariance matrix):

Notice that the values in the top right and bottom left of the matrix are the same, this is because the order of arguments does not matter for covariance, i.e. .

This vector-matrix notation comes in especially handy when we look at linear transformations of random vectors.

In Week 4, we have seen that for univariate random variables X and Y, the linear function has expected value and variance where and . You might, at this stage be wondering, why we have written the variance of as , but we will see that the formula for the multivariate case will have that form.

Let's go back to the vector and define a matrix and a vector .

We can then define a random vector as a linear function or, equivalently

Then, the expected value and variance are given by

The above formulae hold for vectors and of any dimension, not just bivariate vectors.

Note that matrix multiplication is not commutative (i.e. the order in the multiplication matters), i.e. we cannot write the covariance matrix as , which would be more similar to the formula for the univariate case.

Supplement 1

Derivation of the formulae

We can work out the expected value, variances and covariance of and using the rules we have learned in Week 7.

For the expected values, We can recognise that this is the same as which means nothing other than

Let's now turn to the variances and covariances. These are a lot more complicated, so let's only work out one variance.

If we compare this to the top-left entry of we can observe that these are the same. Hence,

The multivariate normal distribution

In Week 6 we learned about the univariate normal distribution. There is also a multivariate family of normal distributions. To begin with, let's consider the bivariate case. A univariate normal distribution has one random variable; a bivariate normal distribution is made up of two random variables. The two variables in the bivariate normal are both normally distributed, and they have a normal distribution when they are added together. Let's consider the bivariate normal distribution using the following example.

Example 1

Karl Pearson, a very famous statistician, analysed 1078 pairs of heights of fathers, and their adult sons in inches. Let

Let's assume that a bivariate normal distribution is appropriate for these data and the corresponding parameters are

where is the mean height for fathers, is the mean height for adult sons, is the standard deviation for the height of fathers and is the standard deviation for the height of adult sons.

In general, to characterise the bivariate normal distribution, we need the following parameters:

the mean and variance for and . These can be denoted as , and
the covariance between and , denoted as .

So we need a total of 5 parameters, however only one of these parameters, the covariance , is needed to specify the dependence between the two random variables.

Rather than list all these parameters separately, it is more convenient and useful for calculations to write these in the vector matrix notation we have just seen, where we have a mean vector and a covariance matrix

The covariance matrix can also be written in terms of the correlation . In Week 4 we defined correlation as

Which is equivalent to

We can rearrange this to be

We can now write that

If the random vector follows a bivariate (or multivariate normal) with mean vector and covariance matrix .

Example 2

In Example 1, say we are also told that . So the mean vector and the covariance matrix in this example would be

Since

The vector matrix notation shown above makes it easier to generalise to more than two random variables. In general, the multivariate normal distribution (MVN) is made up of random variables and has some generic dimensional mean vector and covariance matrix .

For example, if with random variables we have

In Week 6 we discussed the characteristic bell-shaped curve of the normal density curve. The contour lines of the joint density of multivariate normal distributions have a characteristic elliptical shape to them. Below are some examples of bivariate random variables where and both follow standard normal distributions () with varying amounts of correlation. The contours on the plot are in fact ellipses (for a two-dimensional MVN) centered on (in red). The elliptical regions moving outwards from the centre contain, respectively, 50%, 90%, 95% and 99% of the total probability.

Figure 1

This video discusses the bivariate normal distribution and how changing the correlation parameter affects the shape of the distribution.

VideoThe bivariate normal distributionDuration3:58

Example 3

Visualising bivariate normal distributions

If we plot the data in Example 1 we can see the characteristic elliptical shape. Here, there is clearly a positive relationship between the father's and son's heights.

Figure 2

Probability density function of multivariate normal

Let's now define the probability density function for a multivariate normal distribution.

Definition 1

Multivariate normal p.d.f.

Suppose that the random variable can take any real value and that has the p.d.f.

for all , then is said to have a multivariate normal distribution, with mean and (co)variance matrix , written

Note:

corresponds to the determinant of and
refers to the inverse of .

Example 4

Suppose that with What is the p.d.f. of ?

Answer:

First, and

Then

Linear functions

In Week 6 we have seen that if has a univariate normal distribution, then . A similar property holds for the multivariate normal distribution.

Proposition 1

Linear functions of MVN

Suppose , then

In other words, all this means is that linear functions of a normally distributed random vector are again normally distributed, just like in the univariate case.

Again just like in the univariate case, we can standardise a multivariate normal distribution.

If we choose to be the inverse matrix square root of , i.e. then if we can standardise as follows,

Task 1

Suppose that the continuous random vector

Identify the distributions of:

(i) ,

(ii) ,

(iii) ,

(iv) .

Show answer

(i)

Using Proposition 1, then,

(ii)

Using Proposition 1, then,

(iii)

Using Proposition 1, then,

(iv)

Using Proposition 1, then,

Marginal distributions

The marginal distribution of a subset of variables in a MVN can be found by simply taking the relevant subsets of means, and the relevant subset of the covariance matrix for the variables you are interested in.

Proposition 2

Marginal distributions for bivariate normal

Let and be bivariate normal random variables, and suppose

Then

An important consequence of this property is that the marginal distribution of every single variable of a multivariate normal random vector is again normal.

Example 5

In Example 1, calculate the marginal distribution of and .

Answer:

We know Therefore

If we plot these variables separately we can see that both variables have the typical bell-shaped curve as we would expect for data which follows a normal distribution.

Figure 3

Task 2

Suppose that the continuous random vector

Identify the marginal distributions of and .
Find

(i) and ,

(ii) and ,

(iii) and .

Show answer

(i) ,

(ii) ,

(iii) and

Conditional distributions

Another important property of the MVN distribution is that if and have a multivariate normal distribution, then the conditional distribution of given that also has a normal distribution.

Proposition 3

Conditional distributions for bivariate normal

Let and be bivariate normal random variables, and suppose Then

and

Let's take a moment to try and understand what is going on here, focusing on the conditional probability of .

the conditional mean is equal to the mean of () plus a constant which will be positive if the value observed for is larger than the mean for , or negative if the value observed for is smaller than the mean for (assuming is positive, if is negative the opposite is true).
the conditional variance is smaller than the marginal variance (), and gets smaller as the correlation increases.

Example 6

In Example 1, calculate the conditional distribution of fathers' heights given that a son's height is equal to inches.

Answer:

We know

We want the conditional distribution of Looking at Proposition Proposition 3 we can pull out all of the relevant pieces of information we need to calculate the conditional mean and variance.

We can then substitute these values into the formulae from Proposition Proposition 3 to get

Therefore,

So the conditional mean () is smaller than the mean for fathers () since we know that the son's height is smaller than the mean height for sons (). We can also see that the conditional variance () is smaller than the marginal variance for fathers ().

This video discusses the bivariate normal distribution using Example 1. Apologies for the poor sound quality.

VideoThe bivariate normal distribution - a worked exampleDuration12:32

Task 3

Suppose that the continuous random vector

Identify the conditional distribution of given .

Show answer

We have

We can then substitute these values into the formulae from Proposition Proposition 3 to get

Here is a video worked solution for all of Task 1.

VideoWeek 8 - Task 1Duration11:04

Supplement 2

Properties of the multivariate normal

The results above for the marginal and conditional distributions are for the bivariate case. These results can be generalised to the multivariate normal as shown below.

Marginal distributions

Let the random vector be split into two blocks, and , and suppose Then

Conditional distributions

Let the random vector be split into two blocks, and , and suppose Then

and

Notice that

the conditional mean is linear in , it passes through the mean , and has a steeper slope with higher correlation.
the conditional variance is smaller than the marginal variance, and gets smaller as the correlation increases.

Independence

We have seen that uncorrelated random variables are not necessarily independent: their relationship might be entirely non-linear.

The multivariate normal distribution is an exception to this. For the multivariate normal distribution absence of correlation and independence are one and the same thing. The reason for this is that the multivariate normal distribution only allows for linear dependency between its components, as we have seen when we have looked at the conditional distributions.

Large sample theory

In probability, we study limits to understand the long-term behaviour of random processes and sequences of random variables. In general, a limit tells us the value that a function approaches as that function's inputs get closer and closer to some number (often infinity). This may not, on the face of it, seem particularly useful. However, studying limits can often lead to simplified formulas for otherwise unsolvable probability models, which can lead to insights into complex problems.

In Week 1, we discussed the concept of relative frequency when interpreting a probability, which is an intuitive way of interpreting a probability as simply the frequency with which that outcome occurs in the long run, when the experiment is repeated a large number of times. This idea is illustrated in the example below.

Example 7

Real-world example

John Kerrich's famous experiment

Whilst visiting relatives in Copenhagen in 1940, John Kerrich, a British mathematician, was caught up in the Nazi invasion and interned in a prisoner of war camp. During his time in the camp, Kerrich conducted an experiment tossing a coin 10,000 times and recording the number of heads obtained. The following graph shows the proportion of heads for 0 - 2000 tosses using the data recorded by Kerrich.

Figure 4

The figure shows wide fluctuations in the proportion of heads at the beginning of the experiment which eventually settle down close to the proportion we would expect of 0.5.

This example illustrates the Law of Large Numbers, which justifies the use of simulation to approximate the probability of an event occurring. A consequence of the Law of Large Numbers is that in repeated trials of a random experiment the proportion of trials in which occurs converges to .

Let be an independent and identically distributed sequence of random variables with finite expectation . For let

Then the Law of Large Numbers says that the average, or the cumulative mean, , converges to , as

Example 8

In the Kerrich experiment, is the event and using what we have learned so far we can identify this experiment as a series of Bernoulli distributions where the probability of tossing a head is equal to .

Therefore, .

Let

From the figure we can see that as the number of trials increases, the average , which is just proportion of trials in which occurs tends towards .

That is, the proportion of trials in which (heads) occurs converges to as .

To discuss the Law of Large Numbers more formally, let's define what convergence in probability means.

Definition 2

Convergence in probability

Let be a sequence of random variables defined on a sample space . The sequence is said to converge in probability to a constant if, for every ,

For most of the results we will study in this chapter, they key quantity of interest for us will be the cumulative mean, i.e. we study the sample mean

as the sample size increases to infinity, i.e. .

The figures below illustrate this definition. It shows the distribution of for , and . We are considering the probability that is within a "tube" of width around . We can see that as we increase the higher becomes the probability that is in the interval . If we were to keep increasing then this probability would become 1, as mandated by the definition. This can be seen by noting that in the final figure, the whole distribution is contained within an distance from the actual mean .

Example 9

In Example 7, we were interested in the average , which was just the proportion of trials in which the tossed coin resulted in heads, as the number of tosses increased.

Let's now define the Weak Law of Large Numbers which shows that the sample mean of an independent sample drawn from any arbitrary distribution (as long as this distribution does not have too heavy tails) of size is increasingly concentrated around its mean.

Theorem 1

The weak law of large numbers

Let , be a sequence of i.i.d. random variables, each with finite expected value . For , let

Then for any ,

In other words, the probability that the absolute value of the difference between the sample mean, and the expected value is less than some very small number tends towards 1 as goes to infinity. The proof of this theorem is beyond the scope of this course but is provided as supplementary material.

Supplement 3

A proof of the weak law

One proof requires a result called Chebyshev's inequality (although it only works when the variance exists too).

A key part of the proof of the Weak Law of Large Numbers is the so-called Chebyshev's inequality.

Let be a random variable with finite expected value . If is a real constant such that is finite, then for any value In particular, if has finite variance, , then for any value This result is easily proved when is a continuous random variable with probability density function . In this case, The first line is a definition. The second line removes a non-negative contribution to the integral from to . The third line replaces by its smallest value and by its smallest value. The fourth line tidies up. The fifth line identifies the integrals with the probability of a particular event and the last line rearranges.

The second part of the theorem follows immediately from the first by putting .

We can now prove the Weak Law of Large Numbers. We will prove this result for the simple case where the random variables have finite variance ; however this assumption is not required for the weak law to hold.

For all , has expected value and variance . By Chebyshev's inequality, for any ,

Informally, the laws of large numbers (there is also a Strong Law too, which involves another form of stochastic convergence, which we will say no more about) tell us that the probability distribution of becomes more and more concentrated at its expected value as . Although interesting and important, this does not help us to calculate probabilities of interest associated with since it does not tell us how close is to for a given value of . The central limit theorem (CLT) provides a means of doing this, at least approximately.

The central limit theorem

Theorem 2

Let be a sequence of independent and identically distributed random variables, each with a finite mean and a finite variance . Then for sufficiently large we have that

in the sense that the cdf of the left-hand side tends to the cdf of the standard normal distribution.

The central limit theorem is often used in one of the following two equivalent forms, which can be obtained by re-arranging the terms.

approximately follows the distribution for 'sufficiently large' .
approximately follows the distribution for 'sufficiently large' .

Let's focus on 2. for now. Whatever the value of , the rules for the expected value and the variance of a linear functions tell us that and . What the central limit theorem tells us is that the shape of the distribution of tends to the normal distribution. The useful and counter-intuitive thing about the central limit theorem is that this happens no matter what the shape of the original distribution is (unless it has too heavy tails and no finite variance). For most distributions, a normal distribution is approached very quickly as increases.

Task 4

Verify that and .

Show answer

Using that and that the are independent,

Here is a video worked solution.

VideoWeek 8 - Task 2Duration1:26

The simplest way to illustrate the central limit theorem is using a graphical example.

Example 10

Exponential to normal

Suppose the random variable The figure below shows a sample of 100 points from this distribution. You can see very clearly from this plot that this is a highly skewed distribution and therefore non-normal.

Figure 6

A further simulation was carried out from the same model. This time, a sample of simulated values was obtained and the sample mean calculated. This was done 1,000 times and the sample means are displayed in histogram (i) below. Though skewed, the distribution of sample means is a lot less skewed than that of the original data. The remaining histograms repeat the simulation for even larger sample sizes, (ii) , (iii) and (iv) . Clearly as increases, the distribution of the sample means become more symmetric and looks more and more like the bell-shaped curve of the normal distribution. It can also be seen that as increases the spread of the distribution decreases (note that the scale on the horizontal axes differs between these plots).

Figure 7

This video introduces the central limit theorem by discussing the example above.

VideoThe central limit theoremDuration2:55

Example 11

Small change

It is usual for even very large financial transactions, to the value of hundreds of thousands of pounds, to be settled to fractions of pence. Suppose, instead, that financial institutions agreed to round all settlements of transactions between them to the nearest whole £1. In one year, a certain institution makes 1500 transactions. What is the probability that this institution will lose more than £5 over the course of the year?

To answer this question, let's begin by defining to be the difference in cost in £ between the computed cost of the -th transaction and its true cost . For each transaction the most that an institution can lose is 50 pence (or £0.5) and the amount that they can gain is also 50 pence (or £0.5), since a transaction will either be rounded up to the nearest pound, or rounded down to the nearest pound. We can therefore assume that .

We are not interested in the amount lost for a single transaction, but rather the total amount lost in the year over all 1500 transactions. The difference between the total cost of the 1500 transactions and the computed cost is

Although we know the distribution for each , we don't know the distribution of . We can however estimate this using the central limit theorem.

Using the results from Week 6 we can calculate and .

Then, using the central limit theorem We can then follow the same process as in Week 6 to find the probability that the institution loses at least £5 in total is

Task 5

The life-time of video projector light bulbs are known to follow an exponential distribution with a mean life-time of . The university uses projectors for 8500 hours per semester. What is the probability that 100 light bulbs will be sufficient for the semester?

Show answer

Let be the life time of the -th light bulb.

then .

Since the life-time of a light bulb is exponentially distributed, the mean and standard deviation of individual life-time is .

Then, using the central limit theorem

We are looking for

Here is a video worked solution.

VideoWeek 8 - Task 3Duration3:17

Normal approximation to the binomial

Consider the Binomial distribution and suppose you wish to calculate . The shortest way to calculate this is:

which involves 340 separate calculations! The central limit theorem allows us to make an approximation using a normal distribution.

Let be a sequence of independent and identical Bern random variables. From Week 3 we know that

and .
The sum of these variables .

The central limit theorem tells us that , which in turn means that

providing is large enough and is not too close to zero or one. Therefore, to calculate we simply approximate the binomial distribution with a normal and use the normal tables to calculate the probability.

Continuity correction

However in moving from a discrete binomial distribution to a continuous normal approximation we encounter the following problem.

Let , and consider calculating

As the binomial is a discrete distribution these two probabilities are different. However, if we apply the central limit theorem and approximate with (its normal approximation), we have a problem. This is because using the normal approximation and calculating and will give the same probability, because for a continuous distribution (its a single outcome). However, as is a discrete distribution .

Therefore each time we approximate a discrete distribution with a continuous one we make the following continuity correction. The correction works by adding or subtracting 0.5 to the outcome as follows:

is replaced with .
is replaced with .
is replaced with .
is replaced with .

Essentially if the probability to be calculated has a or sign, we need to add or subtract 0.5 to so that the probability you calculate is smaller than it would have been. In contrast, if the probability to be calculated has a or sign, then we need to add or subtract 0.5 to so that the probability you calculate is bigger than it would have been.

Example 12

Let . Using the normal distribution calculate .

Answer:

As then so that . Therefore

Task 6

Let . Using the normal distribution calculate

(a) ,

(b) ,

(d) .

Show answer

As then so that .

(a)

(b)

(c)

(d)

Here is a video worked solution.

VideoWeek 8 - Task 4Duration6:59

Normal approximation to the Poisson

Let be a sequence of independent and identical Pois random variables. From Week 3 we know that

An important property of the Poisson distribution is that the sum of independent Poisson random variables has a Poisson distribution: So if and then

Therefore .

Now the central limit theorem tells us that , which in turn means

providing is large enough. As with the binomial approximation we have to do a continuity correction as we are moving from a discrete to a continuous distribution.

Example 13

Let and calculate

(a) ,

(b) .

Answer: As then so that .

(a)

(b)

Learning outcomes for week 8

By the end of week 8, you should be able to:

calculate linear functions of the multivariate normal distribution;
calculate marginal and conditional distributions of the multivariate normal (for the bivariate case);
state and use the central limit theorem;
calculate normal approximations to the binomial and Poisson distributions.

A summary of the most important concepts and written answers to all tasks are provided overleaf.

Week 8 summary

The multivariate normal distribution

Linear functions of MVN

Suppose , then

Probability density function

Suppose that the random variable can take any real value and that has the p.d.f.

for all , then is said to have a multivariate normal distribution, with mean and (co)variance matrix , written

Marginal distributions

Let and be bivariate normal random variables, and suppose Then

Conditional distributions

Let and be bivariate normal random variables, and suppose Then

and

Large sample theory

The weak law of large numbers

Let , be a sequence of i.i.d. random variables, each with finite expected value . For , let

Then for any ,

The central limit theorem

Let be a sequence of independent and identically distributed random variables, each with a finite mean and a finite variance . Then for sufficiently large we have that

in the sense that the cdf of the left-hand side tends to the cdf of the standard normal distribution.

The central limit theorem is often used in one of the following two equivalent forms:

approximately follows the distribution for 'sufficiently large' .
approximately follows the distribution for 'sufficiently large' .