Last updated: 2017-03-28

Code version: fa81b47

Libraries

import itertools
import h5py
from sklearn import preprocessing
from itertools import groupby
import numpy as np
from dask.delayed import delayed

Importing data

trainf = h5py.File('./deepsea_train/train.mat','r') # HDF5 file
traindata = trainf['/traindata']
#traindata = trainf['/trainxdata']

One of the quirks of dask, is that it’s easier to work with if the data is in individual files, or at least individual HDF5 datasets, so that’s what I’ve done here.

for i in range(len(colchunks)):
    chunkfile = chunkfiles[i]
    print(chunkfile)
    chunkfs = h5py.File(chunkfile,'w')
    tdata = traindata[:,colchunks[i]]
    tds = chunkfs.create_dataset("chunk",(rows,len(colchunks[i])),dtype="uint8",data=tdata)
    chunkfs.close()

What’s going on here is we’re loading our data back, but using what’s known as “delayed evaluation”. The basic idea of delayed evaluation is that you give a long series of instructions to the program, the program constructs a computation graph from these instructions, but doesn’t compute any of it until it’s “asked” to.

dsets = [h5py.File(fn)['/chunk'] for fn in chunkfiles]
arrays = [da.from_array(dset, chunks=(919, 1000)) for dset in dsets]

Our matrix \(X\) has the mean subtracted from each column

x = da.concatenate(arrays,axis=1)
mx=da.mean(x,axis=1)
x=x-mx[:,None]

This is where we perform the compressed SVD. Basically instead of performing a complete SVD,( which would require the construction of a matrix much larger than memory), we only compute the first 50 Singular values. To further improve performance, we’re using a randomized algorithm that is approximating the SVD rather than computing it exactly.

u,s,v = da.linalg.svd_compressed(x,50,n_power_iter=4)

Here’s the line where we actually “ask” dask to compute everything.

cu,cs,cv=da.compute(u,s,v,num_workers=6)

The matrices are pretty big, so we store them in HDF5

cudf = h5py.File('./train_svd_50_3.h5','w')
cudf["U"]=cu
cudf["D"]=cs
cudf["V"]=cv
cudf.close()

Session information

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_1.1.0     workflowr_0.4.0 rmarkdown_1.3   dplyr_0.5.0    
[5] ggplot2_2.2.1   h5_0.9.8       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10     rstudioapi_0.6   knitr_1.15.1     whisker_0.3-2   
 [5] magrittr_1.5     hms_0.3          munsell_0.4.3    colorspace_1.3-2
 [9] R6_2.2.0         stringr_1.2.0    plyr_1.8.4       tools_3.3.3     
[13] grid_3.3.3       gtable_0.2.0     DBI_0.6          git2r_0.18.0    
[17] htmltools_0.3.5  assertthat_0.1   yaml_2.1.14      lazyeval_0.2.0  
[21] rprojroot_1.2    digest_0.6.12    tibble_1.2       base64enc_0.1-3 
[25] curl_2.3         rsconnect_0.7    evaluate_0.10    labeling_0.3    
[29] stringi_1.1.2    scales_0.4.1     backports_1.0.5  jsonlite_1.3    

This R Markdown site was created with workflowr