We'll walk through building a linear regression model using a synthetic dataset. We'll generate our own data, normalize it, and then train a linear regression model using gradient descent with scikit-learn's SGDRegressor. Finally, we'll evaluate the model by making predictions and visualizing the results.


Introduction

Linear regression is one of the most widely used statistical techniques in predictive modeling. It is used to understand the relationship between independent variables (features) and a dependent variable (target). In our example, we'll generate a synthetic dataset for predicting house prices based on several features. Then, we'll use gradient descent—a powerful optimization technique—to train our linear regression model.


Creating the Synthetic Dataset

For our experiment, we'll generate a dataset with 1000 samples. Each sample represents a house and has the following columns:

We'll simulate the relationship between the target and the features with the following formula:

$$ House price in sqft=200+100×(No of bedrooms)+50×(No of floors)−1.5×(Age of the house)+ϵ\text{House price in sqft} = 200 + 100 \times (\text{No of bedrooms}) + 50 \times (\text{No of floors}) - 1.5 \times (\text{Age of the house}) + \epsilon $$

where epsilon represents Gaussian noise.

Here’s how we create the dataset:

import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42)
n_samples = 1000

# Generate features
bedrooms = np.random.randint(1, 7, size=n_samples)   # 1 to 6 bedrooms
floors = np.random.randint(1, 4, size=n_samples)       # 1 to 3 floors
age = np.random.randint(0, 101, size=n_samples)        # Age between 0 and 100 years

# Define the true relationship parameters
coef_bedrooms = 100
coef_floors = 50
coef_age = -1.5
intercept = 200

# Create the target variable with some noise
noise = np.random.normal(0, 25, size=n_samples)
price = intercept + coef_bedrooms * bedrooms + coef_floors * floors + coef_age * age + noise

# Create a DataFrame to hold our data
data = pd.DataFrame({
    "House price in sqft": price,
    "No of bedrooms": bedrooms,
    "No of floors": floors,
    "Age of the house": age
})

print("First five rows of the synthetic dataset:")
print(data.head())

First five rows of the synthetic dataset:

				House price in sqft  No of bedrooms  No of floors  Age of the house
0           589.558940               4             2                59
1           750.062337               5             1                13
2           474.434078               3             2                74
3           674.218771               5             2                81
4           848.941759               5             3                11

Data Normalization with StandardScaler