In this tutorial we will notice that relying solely on the main statistical measures such as mean, variance (or standard deviation), and correlation may not always effectively describe the datasets. Therefore, it is always advisable to supplement these measures with visualization techniques and/or other statistical measures to gain a deeper understanding of the data.
We will be working with a well-known dataset: Anscombe's quartet dataset. The dataset is artificially generated and is used to illustrate the fact that some metrics can fail to capture important information present in a dataset. More specifically, this dataset is used to demonstrate how relying solely on metrics can sometimes be misleading. If you're interested, you can read more about Anscombe's quartet at their respective Wikipedia page
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import utils
This dataset was initially constructed by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties.
df_anscombe = pd.read_csv('df_anscombe.csv')
df_anscombe.head()
Let's determine the number of groups present in this dataset.
df_anscombe.group.nunique()
4
This dataset comprises of four groups of data, each containing two components - x
and y
. To analyze the data, we can obtain the mean and variance of each group, as well as the correlation between x and y within each group. To get a better grasp of the data it is very useful to learn a bit more about the values in each column. Pandas has a very useful function .describe()
, which returns a new dataframe with summary statistics for each of the columns. The output is a new dataframe that contains the count, mean, standard deviation, minimum value, maximum value and 25%, 50% (median) and 75% quartiles for each of the columns. To group the data by the group column, we can use the DataFrame.groupby
function.
df_anscombe.groupby('group').describe()
x ... y
count mean std min 25% ... min 25% 50% 75% max
group ...
1 11.0 9.0 3.316625 4.0 6.5 ... 4.26 6.315 7.58 8.57 10.84
2 11.0 9.0 3.316625 4.0 6.5 ... 3.10 6.695 8.14 8.95 9.26
3 11.0 9.0 3.316625 4.0 6.5 ... 5.39 6.250 7.11 7.98 12.74
4 11.0 9.0 3.316625 8.0 8.0 ... 5.25 6.170 7.04 8.19 12.50
The groups appear to be quite similar, as evidenced by the identical mean and standard deviation values for both x
and y
within each group.
Additionally, we can analyze the correlation between x
and y
within each group.
To obtain the correlation matrix for each group, we can follow the same approach as before. First, group the data by the group
column using DataFrame.groupby
, and then apply the .corr
function.
df_anscombe.groupby('group').corr()
x y
group
1 x 1.000000 0.816421
y 0.816421 1.000000
2 x 1.000000 0.816237
y 0.816237 1.000000
3 x 1.000000 0.816287
y 0.816287 1.000000
4 x 1.000000 0.816521
y 0.816521 1.000000
As observed, the correlation between x
and y
is identical within each group up to three decimal places. Moreover, the high correlation coefficient values suggest a strong linear correlation between x
and y
within each group.
Despite the similarities in the statistical measures for the groups, it is still necessary to visualize the data to get a better understanding of the differences, if any.
fig, axs = plt.subplots(2, 2, figsize=(16, 8))
fig.subplots_adjust(hspace=0.5, wspace=0.2)
fig.suptitle("Anscombe's quartet", fontsize=16)
# Plot each group in its own subplot
for i, ax in enumerate(axs.flatten(), 1):
group_data = df_anscombe[df_anscombe['group'] == i]
ax.scatter(group_data['x'], group_data['y'])
ax.set_title(f'Group {i}')
ax.set_xlim(0, 20)
ax.set_ylim(0, 15)
# Show the plot
plt.show()