We will build a simple linear regression model to predict sales based on TV marketing expenses. We will investigate two different approaches to this problem. We will useScikit-Learn
linear regression model, as well as construct and optimize the sum of squares cost function with gradient descent from scratch.
Load the required packages:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
Open the Dataset
We will build a linear regression model for a simple Kaggle dataset, saved in a file tvmarketing.csv
. The dataset has only two fields: TV marketing expenses (TV
) and sales amount (Sales
).
adv = pd.read_csv("tvmarketing.csv")
X = adv['TV']
Y = adv['Sales']
adv.head()
TV Sales
0 230.1 22.1
1 44.5 10.4
2 17.2 9.3
3 151.5 18.5
4 180.8 12.9
pandas
has a function to make plots from the DataFrame fields. By default, matplotlib is used at the backend.
adv.plot(x='TV', y='Sales', kind='scatter', c='black')
Scikit-Learn
Create an estimator object for a linear regression model:
lr_sklearn = LinearRegression()
The estimator can learn from data calling the fit
function. However, the data first needs to be reshaped into 2D array
X_sklearn = X[:, np.newaxis]
Y_sklearn = Y[:, np.newaxis]
print(f"Shape of new X array: {X_sklearn.shape}")
print(f"Shape of new Y array: {Y_sklearn.shape}")
Shape of new X array: (200, 1)
Shape of new Y array: (200, 1)
Fit the linear regression model passing X_sklearn
and Y_sklearn
arrays into the function lr_sklearn.fit
lr_sklearn.fit(X_sklearn, Y_sklearn)
m_sklearn = lr_sklearn.coef_[0][0] # slope
b_sklearn = lr_sklearn.intercept_[0] # intercept
print(f"Linear regression using Scikit-Learn. Slope: {m_sklearn}. Intercept: {b_sklearn}")
Linear regression using Scikit-Learn. Slope: 0.047536640433019764. Intercept: 7.032593549127693
Now, to make predictions it is convenient to use Scikit-Learn
function predict