Logistic Regression and SVM to Classify Blood Pressure Categories

Classifying blood pressure into "normal" and "elevated" categories is vital for managing cardiovascular health. Today, we’ll dive into how Logistic Regression and Support Vector Machines (SVM) can effectively categorize blood pressure readings based on systolic and diastolic values. We’ll explore the strengths of each model, visualize decision boundaries, and interpret the results to gain actionable insights.

The goal of classifiers is to differentiate between these two groups, so when new, unlabeled data comes in, we can predict whether a patient's blood pressure is likely normal or elevated. But how does it achieve this?

To illustrate, let’s imagine we’ve measured and collected systolic and diastolic blood pressure from various patients. These two metrics are important indicators used to determine if a patient has high blood pressure. Specifically, we will have a list of patients (Y vector) labeled as either having normal or elevated blood pressure. Each patient also has two measurements in the matrix (X): one for systolic pressure and the other for diastolic pressure. Below, I’ve provided a range of measurements for both categories:

Blood Pressure Category	Systolic Range (mm Hg)	Diastolic Range (mm Hg)
Normal	90 mm Hg < X < 120 mm Hg	60 mm Hg < X < 80 mm Hg
Elevated	120 mm Hg < X < 129 mm Hg	80 mm Hg < X < 89 mm Hg

1. Dataset Overview: Understanding Blood Pressure Categories

For this task, we used synthetic data representing patients with:

Normal Blood Pressure: Systolic between 90-120 mm Hg and diastolic between 60-80 mm Hg.
Elevated Blood Pressure: Systolic between 120-129 mm Hg and diastolic between 80-89 mm Hg.

Each patient is labeled as either "normal" or "elevated," forming the target variable for our models.

2. Why Use Logistic Regression and SVM?

Logistic Regression is widely used for binary classification, making it a natural choice for distinguishing between two blood pressure categories. It provides clear probabilistic insights and an interpretable model by assigning weights to features.

Support Vector Machines (SVM), on the other hand, work well for linearly separable data by maximizing the margin between classes. SVM is ideal for tasks like this where we aim to find a clear boundary between two categories.

3. Data Preparation and Feature Engineering

We split the data into training and test sets to evaluate model performance. We also scaled systolic and diastolic features to improve SVM’s performance and visualization. To focus on the most relevant range, we limited the y-axis (diastolic) to between 60 and 90 mm Hg in our visualization.

# Import necessary libraries
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(41)

# Define ranges for each category
normal_systolic_range = (90, 120)
normal_diastolic_range = (60, 80)

elevated_systolic_range = (120, 129)
elevated_diastolic_range = (80, 89)

# Generate synthetic data
num_patients = 200
num_normal = num_patients // 2
num_elevated = num_patients - num_normal

# Generate normal category blood pressure data
normal_systolic = np.random.uniform(*normal_systolic_range, num_normal)
normal_diastolic = np.random.uniform(*normal_diastolic_range, num_normal)
normal_labels = ['normal'] * num_normal

# Generate elevated category blood pressure data
elevated_systolic = np.random.uniform(*elevated_systolic_range, num_elevated)
elevated_diastolic = np.random.uniform(*elevated_diastolic_range, num_elevated)
elevated_labels = ['elevated'] * num_elevated

# Combine data into a DataFrame
systolic_values = np.concatenate([normal_systolic, elevated_systolic])
diastolic_values = np.concatenate([normal_diastolic, elevated_diastolic])
labels = normal_labels + elevated_labels

data = pd.DataFrame({
    'Systolic': systolic_values,
    'Diastolic': diastolic_values,
    'Category': labels
})

Summary Statistics

import seaborn as sns

summary_stats = data.groupby('Category').agg({
    'Systolic': ['mean', 'median', 'std'],
    'Diastolic': ['mean', 'median', 'std']
})

# Visualizations: Histograms and Boxplots
plt.figure(figsize=(14, 6))

# Histograms for systolic and diastolic pressures
plt.subplot(1, 2, 1)
sns.histplot(data=data, x='Systolic', hue='Category', kde=True, bins=20)
plt.title('Distribution of Systolic Pressure by Category')
plt.xlabel('Systolic Pressure (mm Hg)')

plt.subplot(1, 2, 2)
sns.histplot(data=data, x='Diastolic', hue='Category', kde=True, bins=20)
plt.title('Distribution of Diastolic Pressure by Category')
plt.xlabel('Diastolic Pressure (mm Hg)')
plt.tight_layout()
plt.show()

custom_palette = {"normal": "skyblue", "elevated": "salmon"}

# Boxplots for systolic and diastolic pressures by category
plt.figure(figsize=(8, 6))
sns.boxplot(x='Category', y='Systolic', data=data, palette=custom_palette)
plt.title('Boxplot of Systolic Pressure by Category')
plt.xlabel('Blood Pressure Category')
plt.ylabel('Systolic Pressure (mm Hg)')
plt.show()

plt.figure(figsize=(8, 6))
sns.boxplot(x='Category', y='Diastolic', data=data, palette=custom_palette)
plt.title('Boxplot of Diastolic Pressure by Category')
plt.xlabel('Blood Pressure Category')
plt.ylabel('Diastolic Pressure (mm Hg)')
plt.show()

# Correlation analysis
correlation = data[['Systolic', 'Diastolic']].corr()

summary_stats, correlation

1. Dataset Overview: Understanding Blood Pressure Categories

2. Why Use Logistic Regression and SVM?

3. Data Preparation and Feature Engineering

Summary Statistics

4. Model Training and Decision Boundary Visualization