Last updated: 2017-03-28
Code version: 666a06c
library(readr)
library(h5)
library(dplyr)
To compute the PCA of the response matrix, we first scaled the columns of the matrix (see IPython notebook) and then computed singular values. As a refresher, remember that for a matrix \(X\) (for which the mean of each column is \(0\)), the covariance matrix \(C\) is \[C= X^{T}X/(n-1)\]
The eigenvalue decomposition of the covariance matrix C is gives us the principle components:
\[C=VLV^{T}\]
Remember that for any matrix \(X\), the Singular Value Decomposition of that matrix \(X\) is
\[X = USV^{T}\]
It’s easy to show that \[C=VSU^{T}USV^{T}/(n-1)= V\frac{S^2}{n-1}V^{T}\]
This means that \(US\) are the principle components of \(X\)
We’ll first pull the features from the DeepSEA website
feature_url <- "http://deepsea.princeton.edu/media/help/features.txt"
features <- read_delim(feature_url,delim="\t",col_names = c("Source",
"CellType",
"DataType",
"Treatment",
"AUC"),skip = 1)
Parsed with column specification:
cols(
Source = col_character(),
CellType = col_character(),
DataType = col_character(),
Treatment = col_character(),
AUC = col_character()
)
Now we’ll load the svd we computed in dask
train_svdf <- "/media/nwknoblauch/Data/DeepSEA/train_svd_50_3.h5"
tsvdf <- h5file(train_svdf,mode='r')
tu <- tsvdf["U"][]
td <- tsvdf["D"][]
tv <- tsvdf["V"][]
s_tu <- tu*td
Save the results
tu_df <- as_data_frame(s_tu)
colnames(tu_df) <- paste0("PC_",1:50)
d_df <- data_frame(evals=td^2/(ncol(tv)-1),ind=1:length(td))
#saveRDS(d_df,"../data/DeepSea_evals_df.RDS")
features <- mutate(features,AUC=as.numeric(AUC)) %>% filter(!is.na(AUC))
Warning in eval(substitute(expr), envir, enclos): NAs introduced by
coercion
ntu_df <- bind_cols(features,slice(tu_df,-1))
#saveRDS(ntu_df,"../data/DeepSeaPCA_df.RDS")
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.1.0 workflowr_0.4.0 rmarkdown_1.3 dplyr_0.5.0
[5] ggplot2_2.2.1 h5_0.9.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 rstudioapi_0.6 knitr_1.15.1 whisker_0.3-2
[5] magrittr_1.5 hms_0.3 munsell_0.4.3 colorspace_1.3-2
[9] R6_2.2.0 stringr_1.2.0 plyr_1.8.4 tools_3.3.3
[13] grid_3.3.3 gtable_0.2.0 DBI_0.6 git2r_0.18.0
[17] htmltools_0.3.5 assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0
[21] rprojroot_1.2 digest_0.6.12 tibble_1.2 base64enc_0.1-3
[25] curl_2.3 rsconnect_0.7 evaluate_0.10 labeling_0.3
[29] stringi_1.1.2 scales_0.4.1 backports_1.0.5 jsonlite_1.3
This R Markdown site was created with workflowr