Principal Component Analysis

Introduction

One of the primary uses of PCA is to reduce the dimensionality of large data sets. By transforming the data into a smaller number of principal components, PCA can simplify the complexity of high-dimensional data while retaining most of the important information or variation present in the original dataset.

Reduce dimensions (columns) of dataset
Preserve as much information as possible

If a dataset is small, we can afford to individually examine each features. If the dataset has thousands of features, we would likely forgo this and explore dimensionality reduction technique which are ways to condense the information in a large number of features down to a small number of derived features or alternative methods of feature selection which can be used to isolate the most important features from a candidate field of many.

Defining Principal Component Analysis

PCA projects the entire datasets onto a different feature space where it can prioritize dimensions that explain the most variance of the data. We can leverage PCA to reduce computational complexity by dropping low variance dimensions.

PCA relies on few concepts like:

Mean
Variance
Covariance, Covariance matrix
Eigen values, Eigen vectors
Projections

Variance helps us quantity how spread out our data is.

Covariance helps measure how two features of a dataset varies with respect to on another. It measures the direction of the relationship between two variables.

Screenshot 2024-05-26 at 8.43.03 AM.png

Covariance matrix is a way of storing all the relationship details between pairs of variable in your dataset.