Projection of 2D data to a line

Consider 100 data points generated in a narrow strip within a 2D space.

set.seed(1)
x = runif(100)
y = 0.4*x + 0.8 + 0.1*runif(100)
data1 = data.frame(x=x, y=y)
plot(data1$x, data1$y, xlab = "x", ylab = "y", xlim = c(0,1), ylim = c(0.5, 1.5))

Now, we can simplify the 2D data points into a single variable approximation. One approach is to start by conducting linear regression and then project each 2D point onto the regression line. The resulting projection value serves as a condensed representation, capturing the essence of the original 2D data within a single variable.

plot(data1$x, data1$y, xlab = "x", ylab = "y", xlim = c(0,1), ylim = c(0.5, 1.5))
model = lm(y ~ x, data = data1)
abline(model, col = 2)

x0 = c(0, model$coefficients[1])  # a point in the fitted line
k = model$coefficients[2]
dir = c(1, k)/sqrt(1+k*k)

x_proj = apply(as.matrix(data1), 1, function(x) {return((x - x0) %*% dir)})
data1_projected = t(sapply(x_proj, function(p) {return(p*dir + x0)}))

plot(data1$x, data1$y, xlab = "x", ylab = "y", xlim = c(0,1), ylim = c(0.5, 1.5))
abline(model, col = 2)
points(data1_projected[,1], data1_projected[,2], pch = 16, col = 4)

Principal Component Analysis

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It identifies the underlying patterns in a dataset by transforming the original variables into a new set of orthogonal variables called principal components. These components are ordered by the amount of variance they explain, allowing PCA to capture the most significant information in the data with fewer dimensions. PCA is widely used in various fields such as data analysis, image processing, and pattern recognition.

The detailed steps of Principal Component Analysis (PCA) are as follows:

(1) Data Standardization: If the features of the dataset are measured in different units or have different scales, it’s crucial to standardize the data by subtracting the mean and dividing by the standard deviation for each feature. This step ensures that all features contribute equally to the analysis.

(2) Covariance Matrix Computation: Compute the covariance matrix of the standardized dataset. The covariance matrix summarizes the relationships between pairs of variables, showing how they vary together.

(3) Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix to find its eigenvectors and eigenvalues. Eigenvectors represent the directions of maximum variance in the dataset, while eigenvalues indicate the magnitude of variance along each eigenvector.

(4) Selection of Principal Components: Sort the eigenvectors based on their corresponding eigenvalues in descending order. The eigenvectors with the highest eigenvalues (principal components) capture the most variance in the data. Choose the desired number of principal components (often determined based on the amount of variance explained).

(5) Projection onto Principal Components: Project the original data onto the selected principal components. This is achieved by multiplying the standardized data matrix by the matrix of eigenvectors, resulting in a new dataset with reduced dimensionality.

(6) Interpretation and Analysis: Analyze the transformed dataset to understand the underlying structure and patterns. Each principal component represents a combination of the original features, and their contributions can be interpreted to gain insights into the data. Visualize the data in the reduced-dimensional space to explore clusters, trends, or outliers.

Below shows a generic implementation of PCA in R programming.

pca <- function(dat) {
  # dat: matrix
  # OUtput: a list containing PCA outputs
  
  # Centering the data
  centered_data = scale(dat, center = TRUE, scale = FALSE)
  
  # Calculating the covariance matrix
  covariance_matrix = cov(centered_data)
  
  # Singular Value Decomposition (SVD) of the covariance matrix
  svd_result = svd(covariance_matrix)
  
  # Extracting important information
  eigenvalues = svd_result$d^2
  eigenvectors = svd_result$v
  
  # Calculating the principal component scores
  scores = centered_data %*% eigenvectors
  
  # Output list containing eigenvalues, eigenvectors, and scores
  pca_output = list(
    eigenvalues = eigenvalues,
    eigenvectors = eigenvectors,
    scores = scores
  )
  
  return(pca_output)
}

We apply PCA on a set of gene expression data for various experimental conditions.

exp_data = read.csv(file='./extra/data/01D/exp_data.csv', row.names = 1)
dat = t(as.matrix(exp_data))

results = pca(dat)
eig = results$eigenvalues
vec = results$eigenvectors
pcs = results$scores

print("First 10 eigenvalues:")
## [1] "First 10 eigenvalues:"
print(eig[1:10])
##  [1] 824.3608423  24.8861179  12.7640591   7.6456489   3.3222393   1.8020519
##  [7]   0.9298494   0.5855955   0.4180626   0.3232590

We can assess the proportion of explained variance to determine the number of principal components (PCs) we wish to include.

plot_pc_contributions <- function(eigenvalues, npc) {
  # eigenvalues: a vector of all eigenvalues
  # npc: number of pcs to be considered
  
  total_variance = sum(eigenvalues)
  variance_proportion = eigenvalues / total_variance
  
  variance_proportion_plotted = variance_proportion[1:npc]
  
  barplot(cumsum(variance_proportion_plotted), 
          main = "Principal Component Contributions",
          xlab = "Number of principal components",
          ylab = "Proportion of Variance Explained",
          names.arg = seq_len(npc))
  
  points(variance_proportion_plotted, col = "red", pch = 19)
  lines(variance_proportion_plotted, col = "red", type = "b")
  
  legend("topright", 
         legend = c("Individual", "Cumulative"),
         col = c("black", "red"), 
         pch = c(15, 19), 
         lty = c(0, 1),
         xpd = TRUE)
}

plot_pc_contributions(eigenvalues = eig, npc = 10)

After projecting the data onto the first two principal components, we visualize it in a 2D plot. In this instance, the projected data exhibit a sequential structure across various experimental conditions, following the order of A, B, C, D, E, F, G, H, I. By applying a principal curve to this 2D data, we can delineate the trajectory of gene expression dynamics.

library(princurve)
x = pcs[,1]
y = pcs[,2]
curve_fit <- principal_curve(pcs[,1:2])

plot(x, y, xlab = "PC1", ylab = "PC2")
text(x, y, labels = rownames(pcs), pos = 3, col = "blue")

lines(curve_fit, col = "red")