In this notebook we will use the World Happiness Report dataset. The dataset consists of 2199 rows, where each row contains various happiness-related metrics for a certain country in a given year.
Linear regression is a statistical model that is used to estimate a linear relationship between two or more variables. In case of simple linear regression you have one independent (explanatory) variable and one dependent variable (response), while in case of multiple linear regression, you have more than one explanatory variable.
In this notebook, we will create a linear regression model and fit it to one and more explanatory variables to predict the response. We will use an open-source, commercially usable machine learning toolkit called scikit-learn. This toolkit contains implementations of many machine learning and statistical algorithms that we can encounter as a data scientist or machine learning practitioner.
As usual, first import all the necessary libraries that you will use in the notebook.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Import functions to create interactive widgetsimport ipywidgets as widgets
from ipywidgets import interact_manual, fixed
# Import various functions from scikit-learn to help with the modelfrom sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
The first thing we need to do is clean up the data. This code will rename columns so that there are no white spaces and drop any missing values.
# Open the notebook
df = pd.read_csv('world_happiness.csv')
# Rename the column names so they dont contain spaces
df = df.rename(columns={i: "_".join(i.split(" ")).lower() for i in df.columns})
# Drop all of the rows which contain empty values. These will not be good for fitting.
df = df.dropna()
# Show the dataframe
df.head()
Have a closer look at the output of the cell above. The dataset consists of the following columns:
country_name
: Name of the country where the data was taken.year
: The year when data was taken.life_ladder
: The average of the estimates of life quality on a scale of 1 to 10 as given by a survey. In the survey people subjectively estimate the quality of their own life.log_gdp_per_capita
: Logarithm of gross domestic product (log GDP) in purchasing power parity (PPP).social_support
: National average of responses to the binary question: “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.healthy_life_expectancy_at_birth
: Life expectancy at birth.freedom_to_make_life_choices
: National average to the binary question: “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”.generosity
: Derived from answering the question: “Have you donated money to a charity in the past month?” and GDP.