In this notebook we will use the World Happiness Report dataset. The dataset consists of 2199 rows, where each row contains various happiness-related metrics for a certain country in a given year.

Linear regression is a statistical model that is used to estimate a linear relationship between two or more variables. In case of simple linear regression you have one independent (explanatory) variable and one dependent variable (response), while in case of multiple linear regression, you have more than one explanatory variable.

In this notebook, we will create a linear regression model and fit it to one and more explanatory variables to predict the response. We will use an open-source, commercially usable machine learning toolkit called scikit-learn. This toolkit contains implementations of many machine learning and statistical algorithms that we can encounter as a data scientist or machine learning practitioner.

1. Import the Libraries

As usual, first import all the necessary libraries that you will use in the notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Import functions to create interactive widgetsimport ipywidgets as widgets
from ipywidgets import interact_manual, fixed
# Import various functions from scikit-learn to help with the modelfrom sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

2. Import and Process the Data

The first thing we need to do is clean up the data. This code will rename columns so that there are no white spaces and drop any missing values.

world_happiness.csv

# Open the notebook
df = pd.read_csv('world_happiness.csv')

# Rename the column names so they dont contain spaces
df = df.rename(columns={i: "_".join(i.split(" ")).lower() for i in df.columns})

# Drop all of the rows which contain empty values. These will not be good for fitting.
df = df.dropna()

# Show the dataframe
df.head()

Screenshot 2024-07-04 at 3.11.51 PM.png

Have a closer look at the output of the cell above. The dataset consists of the following columns: