Last updated: 2017-03-28
Code version: fa81b47
import itertools
import h5py
from sklearn import preprocessing
from itertools import groupby
import numpy as np
from dask.delayed import delayed
trainf = h5py.File('./deepsea_train/train.mat','r') # HDF5 file
traindata = trainf['/traindata']
#traindata = trainf['/trainxdata']
One of the quirks of dask
, is that it’s easier to work with if the data is in individual files, or at least individual HDF5
datasets, so that’s what I’ve done here.
for i in range(len(colchunks)):
chunkfile = chunkfiles[i]
print(chunkfile)
chunkfs = h5py.File(chunkfile,'w')
tdata = traindata[:,colchunks[i]]
tds = chunkfs.create_dataset("chunk",(rows,len(colchunks[i])),dtype="uint8",data=tdata)
chunkfs.close()
What’s going on here is we’re loading our data back, but using what’s known as “delayed evaluation”. The basic idea of delayed evaluation is that you give a long series of instructions to the program, the program constructs a computation graph from these instructions, but doesn’t compute any of it until it’s “asked” to.
dsets = [h5py.File(fn)['/chunk'] for fn in chunkfiles]
arrays = [da.from_array(dset, chunks=(919, 1000)) for dset in dsets]
Our matrix \(X\) has the mean subtracted from each column
x = da.concatenate(arrays,axis=1)
mx=da.mean(x,axis=1)
x=x-mx[:,None]
This is where we perform the compressed SVD. Basically instead of performing a complete SVD,( which would require the construction of a matrix much larger than memory), we only compute the first 50 Singular values. To further improve performance, we’re using a randomized algorithm that is approximating the SVD rather than computing it exactly.
u,s,v = da.linalg.svd_compressed(x,50,n_power_iter=4)
Here’s the line where we actually “ask” dask to compute everything.
cu,cs,cv=da.compute(u,s,v,num_workers=6)
The matrices are pretty big, so we store them in HDF5
cudf = h5py.File('./train_svd_50_3.h5','w')
cudf["U"]=cu
cudf["D"]=cs
cudf["V"]=cv
cudf.close()
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.1.0 workflowr_0.4.0 rmarkdown_1.3 dplyr_0.5.0
[5] ggplot2_2.2.1 h5_0.9.8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 rstudioapi_0.6 knitr_1.15.1 whisker_0.3-2
[5] magrittr_1.5 hms_0.3 munsell_0.4.3 colorspace_1.3-2
[9] R6_2.2.0 stringr_1.2.0 plyr_1.8.4 tools_3.3.3
[13] grid_3.3.3 gtable_0.2.0 DBI_0.6 git2r_0.18.0
[17] htmltools_0.3.5 assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0
[21] rprojroot_1.2 digest_0.6.12 tibble_1.2 base64enc_0.1-3
[25] curl_2.3 rsconnect_0.7 evaluate_0.10 labeling_0.3
[29] stringi_1.1.2 scales_0.4.1 backports_1.0.5 jsonlite_1.3
This R Markdown site was created with workflowr