Regression model predicts numbers. Classification models predicts discrete categories.
$$ y = w_1 x_1 + w_2 x_2 + b $$
Linear means that y depends only on x1
and x2, and not on x1
or x2
raised to some power, or any product of x1
and x2
.
Regression means that given some independent variables (x1
and x2
), we build a model or equation to predict the value of a dependent variable y
.
Take a very simple, practical, and some would say utterly boring problem. Let's say x1
represents the number of bedrooms in a house, and x2
represents the total square footage, and y
represents the price of the house. Let's assume that there exists a linear relationship between x1
, x2
, and y
. Then, by learning the weights of the linear equation from some existing data about houses and their prices, we can essentially build a very simple model with which to predict the price of a house given the number of bedrooms and the square footage.
Here I've generated synthetic data based on a simple linear relationship between the number of bedrooms (x1
), the total square footage (x2
), and the house price (y
). The formula for our linear model is y = w1 * x1 + w2 * x2 + b, where w1 and w2 are the weights or coefficients for x1
and x2
, respectively, and b is the intercept.
$$ y = w_1 x_1 + w_2 x_2 + b $$
with weights w1 = 10,000 for bedrooms, w2 = 150 for square footage, and a bias, b = 50,000.
This equation represents a linear relationship where:
x1
and x2
are the independent variables (e.g., number of bedrooms and square footage),Here's a sample of the synthetic data:
Bedrooms (x1) | Sq Footage (x2) | Predicted Price (y) |
---|---|---|
1 | 1277 | $251,550 |
4 | 2778 | $506,700 |
2 | 2828 | $494,200 |
1 | 3362 | $564,300 |
4 | 1705 | $345,750 |
4 | 3135 | $560,250 |
4 | 3222 | $573,300 |
4 | 2701 | $495,150 |
2 | 1537 | $300,550 |
4 | 3120 | $558,000 |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Synthetic data
x1_values = np.array([1, 4, 2, 1, 4, 4, 4, 4, 2, 4])
x2_values = np.array([1277, 2778, 2828, 3362, 1705, 3135, 3222, 2701, 1537, 3120])
predicted_prices = np.array([251550, 506700, 494200, 564300, 345750, 560250, 573300, 495150, 300550, 558000])
# Stack features and target into arrays suitable for model training
X = np.column_stack((x1_values, x2_values))
y = predicted_prices
# Create a linear regression model
model = LinearRegression()
# Fit the model to the synthetic data
model.fit(X, y)
# Retrieve the coefficients and intercept from the fitted model
coefficients = model.coef_
intercept = model.intercept_
# Display the coefficients and intercept
print("Coefficients:", coefficients)
print("Intercept:", intercept)
# Coefficients for the number of bedrooms (x1) and square footage (x2)
w1, w2 = model.coef_
# Intercept
b = model.intercept_
# Printing the model equation
print(f"The linear model equation is: y = {w1:.2f} * x1 + {w2:.2f} * x2 + {b:.2f}")
# Predict the prices using the model
predicted_prices_by_model = model.predict(X)
# Plotting the actual vs predicted prices
plt.figure(figsize=(10, 6))
plt.scatter(y, predicted_prices_by_model, color='blue', label='Predicted vs Actual')
plt.plot(y, y, color='red', label='Ideal Fit') # Line for perfect predictions
# Labels and title
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Comparison of Actual and Predicted Prices')
plt.legend()
plt.grid(True)
plt.show()
Coefficients: [10000. 150.]
Intercept: 50000
The linear model equation is: y = 10000 * x1 + 150 * x2 + 50000
The linear regression model has been successfully fitted to the synthetic data, and it accurately estimated the coefficients and intercept based on the data provided:
x1
: 10,000x2
: 150