In this tutorial we will see applications of the Central Limit Theorem when working with different distributions of data.
Let's get started!
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import norm
Gaussian population
Begin with the most straightforward scenario: when our population follows a Gaussian distribution. We will generate the data for this population by using the np.random.normal function.
mu = 10
sigma = 5
gaussian_population = np.random.normal(mu, sigma, 100_000)
The population has a mean of 10 and a standard deviation of 5 (since these are the true parameters we used to generate the data) and a total of 100,000 observations. We can visualize its histogram by running the following code:
sns.histplot(gaussian_population, stat="density")
plt.show()
Sampling from the population
Since this tutorial uses simulated data, we could very easily use the whole population to draw conclusions from the data. For instance if we didn't know about the values of 𝜇 and 𝜎 we could get very close estimates of the true values by computing the mean and standard deviation of the whole population like:
gaussian_pop_mean = np.mean(gaussian_population)
gaussian_pop_std = np.std(gaussian_population)
print(f"Gaussian population has mean: {gaussian_pop_mean:.1f} and std: {gaussian_pop_std:.1f}")
Gaussian population has mean: 10.0 and std: 5.0
However in real life this will most certainly not be possible and we will need to use samples that are nowhere near as big as the population to draw conclusions of the behavior of the data. After all, this is what statistics is all about.
Depending on the sampling techniques we could encounter different properties, this is where the Central Limit Theorem comes in handy. For many distributions (but not all) the following is true:
The sum or average of a large number of independent and identically distributed random variables tends to follow a normal distribution, regardless of the distribution of the individual variables themselves. This is important because the normal distribution is well-understood and allows for statistical inference and hypothesis testing.
With this in mind we need a way of averaging samples out of our population. For this the sample_means
is defined:
def sample_means(data, sample_size):
# Save all the means in a list
means = []
# For a big number of samples
# This value does not impact the theorem but how nicely the histograms will look (more samples = better looking)
for _ in range(10_000):
# Get a sample of the data WITH replacement
sample = np.random.choice(data, size=sample_size)
# Save the mean of the sample
means.append(np.mean(sample))
# Return the means within a numpy array
return np.array(means)
Let's break down the function above: