Linear Regression

Regression model predicts numbers. Classification models predicts discrete categories.

$$ y = w_1 x_1 + w_2 x_2 + b $$

Linear means that y depends only on x1 and x2, and not on x1 or x2 raised to some power, or any product of x1 and x2.

Regression means that given some independent variables (x1 and x2), we build a model or equation to predict the value of a dependent variable y.

Take a very simple, practical, and some would say utterly boring problem. Let's say x1 represents the number of bedrooms in a house, and x2 represents the total square footage, and y represents the price of the house. Let's assume that there exists a linear relationship between x1, x2, and y. Then, by learning the weights of the linear equation from some existing data about houses and their prices, we can essentially build a very simple model with which to predict the price of a house given the number of bedrooms and the square footage.

Here I've generated synthetic data based on a simple linear relationship between the number of bedrooms (x1), the total square footage (x2), and the house price (y). The formula for our linear model is y = w1 * x1 + w2 * x2 + b, where w1 and w2 are the weights or coefficients for x1 and x2, respectively, and b is the intercept.

$$ y = w_1 x_1 + w_2 x_2 + b $$

with weights w1 = 10,000 for bedrooms, w2 = 150 for square footage, and a bias, b = 50,000.

This equation represents a linear relationship where:

y is the dependent variable (e.g., the price of the house),
x1 and x2 are the independent variables (e.g., number of bedrooms and square footage),
w1 and w2 are the coefficients for the independent variables, indicating the weight or influence of each variable on y,
b is the intercept, representing the baseline value of y when all x variables are zero.

Here's a sample of the synthetic data:

Bedrooms (x1)	Sq Footage (x2)	Predicted Price (y)
1	1277	$251,550
4	2778	$506,700
2	2828	$494,200
1	3362	$564,300
4	1705	$345,750
4	3135	$560,250
4	3222	$573,300
4	2701	$495,150
2	1537	$300,550
4	3120	$558,000

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Synthetic data
x1_values = np.array([1, 4, 2, 1, 4, 4, 4, 4, 2, 4])
x2_values = np.array([1277, 2778, 2828, 3362, 1705, 3135, 3222, 2701, 1537, 3120])
predicted_prices = np.array([251550, 506700, 494200, 564300, 345750, 560250, 573300, 495150, 300550, 558000])

# Stack features and target into arrays suitable for model training
X = np.column_stack((x1_values, x2_values))
y = predicted_prices

# Create a linear regression model
model = LinearRegression()

# Fit the model to the synthetic data
model.fit(X, y)

# Retrieve the coefficients and intercept from the fitted model
coefficients = model.coef_
intercept = model.intercept_

# Display the coefficients and intercept
print("Coefficients:", coefficients)
print("Intercept:", intercept)

# Coefficients for the number of bedrooms (x1) and square footage (x2)
w1, w2 = model.coef_
# Intercept
b = model.intercept_

# Printing the model equation
print(f"The linear model equation is: y = {w1:.2f} * x1 + {w2:.2f} * x2 + {b:.2f}")

# Predict the prices using the model
predicted_prices_by_model = model.predict(X)

# Plotting the actual vs predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y, predicted_prices_by_model, color='blue', label='Predicted vs Actual')
plt.plot(y, y, color='red', label='Ideal Fit')  # Line for perfect predictions

# Labels and title
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Comparison of Actual and Predicted Prices')
plt.legend()
plt.grid(True)
plt.show()

Coefficients: [10000.   150.]
Intercept: 50000
The linear model equation is: y = 10000 * x1 + 150 * x2 + 50000

The linear regression model has been successfully fitted to the synthetic data, and it accurately estimated the coefficients and intercept based on the data provided:

Coefficient for bedrooms x1: 10,000
Coefficient for square footage x2: 150