Computing PCA from SVD

Last updated: 2017-03-28

Code version: 666a06c

library(readr)
library(h5)
library(dplyr)

To compute the PCA of the response matrix, we first scaled the columns of the matrix (see IPython notebook) and then computed singular values. As a refresher, remember that for a matrix \(X\) (for which the mean of each column is \(0\)), the covariance matrix \(C\) is \[C= X^{T}X/(n-1)\]

The eigenvalue decomposition of the covariance matrix C is gives us the principle components:

\[C=VLV^{T}\]

Remember that for any matrix \(X\), the Singular Value Decomposition of that matrix \(X\) is

\[X = USV^{T}\]

It’s easy to show that \[C=VSU^{T}USV^{T}/(n-1)= V\frac{S^2}{n-1}V^{T}\]

This means that \(US\) are the principle components of \(X\)

We’ll first pull the features from the DeepSEA website

feature_url <- "http://deepsea.princeton.edu/media/help/features.txt"
features <- read_delim(feature_url,delim="\t",col_names = c("Source",
                                                            "CellType",
                                                            "DataType",
                                                            "Treatment",
                                                            "AUC"),skip = 1)

Parsed with column specification:
cols(
  Source = col_character(),
  CellType = col_character(),
  DataType = col_character(),
  Treatment = col_character(),
  AUC = col_character()
)

Now we’ll load the svd we computed in dask

train_svdf <- "/media/nwknoblauch/Data/DeepSEA/train_svd_50_3.h5"

tsvdf <- h5file(train_svdf,mode='r')

tu <- tsvdf["U"][]
td <- tsvdf["D"][]
tv <- tsvdf["V"][]
s_tu <- tu*td

Save the results

tu_df <- as_data_frame(s_tu)
colnames(tu_df) <- paste0("PC_",1:50)

d_df <- data_frame(evals=td^2/(ncol(tv)-1),ind=1:length(td))
#saveRDS(d_df,"../data/DeepSea_evals_df.RDS")

features <- mutate(features,AUC=as.numeric(AUC)) %>% filter(!is.na(AUC))

Warning in eval(substitute(expr), envir, enclos): NAs introduced by
coercion

ntu_df <- bind_cols(features,slice(tu_df,-1))
#saveRDS(ntu_df,"../data/DeepSeaPCA_df.RDS")

Session information

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.1.0     workflowr_0.4.0 rmarkdown_1.3   dplyr_0.5.0    
[5] ggplot2_2.2.1   h5_0.9.8       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10     rstudioapi_0.6   knitr_1.15.1     whisker_0.3-2   
 [5] magrittr_1.5     hms_0.3          munsell_0.4.3    colorspace_1.3-2
 [9] R6_2.2.0         stringr_1.2.0    plyr_1.8.4       tools_3.3.3     
[13] grid_3.3.3       gtable_0.2.0     DBI_0.6          git2r_0.18.0    
[17] htmltools_0.3.5  assertthat_0.1   yaml_2.1.14      lazyeval_0.2.0  
[21] rprojroot_1.2    digest_0.6.12    tibble_1.2       base64enc_0.1-3 
[25] curl_2.3         rsconnect_0.7    evaluate_0.10    labeling_0.3    
[29] stringi_1.1.2    scales_0.4.1     backports_1.0.5  jsonlite_1.3

This R Markdown site was created with workflowr

Computing PCA from SVD

Nicholas Knoblauch

2017-03-28

Session information