Central Limit Theorem

In this tutorial we will see applications of the Central Limit Theorem when working with different distributions of data.

Let's get started!

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import norm

Gaussian population

Begin with the most straightforward scenario: when our population follows a Gaussian distribution. We will generate the data for this population by using the np.random.normal function.

mu = 10
sigma = 5
gaussian_population = np.random.normal(mu, sigma, 100_000)

The population has a mean of 10 and a standard deviation of 5 (since these are the true parameters we used to generate the data) and a total of 100,000 observations. We can visualize its histogram by running the following code:

sns.histplot(gaussian_population, stat="density")
plt.show()

Untitled

Sampling from the population

Since this tutorial uses simulated data, we could very easily use the whole population to draw conclusions from the data. For instance if we didn't know about the values of 𝜇 and 𝜎 we could get very close estimates of the true values by computing the mean and standard deviation of the whole population like:

gaussian_pop_mean = np.mean(gaussian_population)
gaussian_pop_std = np.std(gaussian_population)
print(f"Gaussian population has mean: {gaussian_pop_mean:.1f} and std: {gaussian_pop_std:.1f}")

Gaussian population has mean: 10.0 and std: 5.0

However in real life this will most certainly not be possible and we will need to use samples that are nowhere near as big as the population to draw conclusions of the behavior of the data. After all, this is what statistics is all about.

Depending on the sampling techniques we could encounter different properties, this is where the Central Limit Theorem comes in handy. For many distributions (but not all) the following is true:

The sum or average of a large number of independent and identically distributed random variables tends to follow a normal distribution, regardless of the distribution of the individual variables themselves. This is important because the normal distribution is well-understood and allows for statistical inference and hypothesis testing.

With this in mind we need a way of averaging samples out of our population. For this the sample_means is defined:

def sample_means(data, sample_size):
    # Save all the means in a list
    means = []

    # For a big number of samples
    # This value does not impact the theorem but how nicely the histograms will look (more samples = better looking)
    for _ in range(10_000):
        # Get a sample of the data WITH replacement
        sample = np.random.choice(data, size=sample_size)

        # Save the mean of the sample
        means.append(np.mean(sample))

    # Return the means within a numpy array
    return np.array(means)

Let's break down the function above:

You take random samples out of the population (the sampling is done with replacement, which means that once you select an element you put it back in the sampling space so you could choose a particular element more than once). This ensures that the independence condition is met.