How to efficiently prewhiten fMRI timeseries the “right” way

Image

I recently got around to reading the high-profile PNAS paper titled “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates” (Eklund et al. 2016), which called attention to the importance of properly processing and analyzing fMRI data.

This lead me to an earlier paper titled “Does parametric fMRI analysis with SPM yield valid results?” (Eklund et al. 2012), which identified mismodeling of temporal autocorrelation as a major factor driving high false positive rates.  From a statistical perspective, accounting for temporal autocorrelation is vital to obtain correct standard errors on model coefficients, which are the basis for hypothesis tests used to identify areas of activation in the brain, a primary goal in many fMRI studies.  (In fact many group analyses actually ignore first-level standard errors, but these should ideally be incorporated into the second-level model.)

The 2012 paper pointed out two primary reasons for mismodeling of temporal autocorrelation: (a) using too low of a model order, such as AR(1), especially for faster TRs, and (b) assuming the same AR coefficients across the brain.  They also recommend smoothing the coefficients to improve the estimates.

As a statistician, I was excited to try out this “proper” approach to prewhitening.  Increasing the model order was no big deal (I used an AR(6) model for TR=0.72 seconds), but doing the prewhitening at the voxel level was a bit of a hassle.  After much trial and error and a few new gray hairs, here’s what I’ve learned.  Some of the computational strategies are specific to R, but they can probably be easily adapted to MATLAB or Python.

  1. AR coefficients definitely seem to vary across the brain.
  2. Averaging across subjects greatly improves the smoothness of the coefficient maps, which is presumably indicative of better estimation.
  3. The prewhitening matrix (the square root inverse covariance matrix) needs to be computed quickly, since you’re going to be doing it separately for every voxel in the brain.

I’ll go into each of these in turn below, but first, let’s review why and how prewhitening is actually performed.  fMRI task activation studies typically rely on fitting a regression model at many different locations in the brain, and ordinary least squares regression assumes uncorrelated errors.  This assumption is the basis of how we compute standard errors of coefficient estimates, which in turn are the basis of hypothesis testing, which allows us to identify regions of the brain that are active in response to different tasks.  The validity of these hypothesis tests depends primarily on (1) how correction for multiple comparisons was performed (the focus of the 2016 paper mentioned above) and (2) the accuracy of the standard errors of the model coefficients.  If we ignore temporal autocorrelation, we will underestimate the true standard errors and obtain more false positives.

So how do we fix this?  If y (T×1) is the time series for a single voxel and X (T×k) is the design matrix containing an intercept, column(s) corresponding to the expected BOLD response under the task paradigm, and several columns for nuisance parameters such as motion and drift, then the regression model we want to estimate is y = Xβ + ε, where ε ~ Normal(0, σ²V).  The β coefficients corresponding to the task-related columns of X represent the level of activation in response to each task at this particular voxel.  V is a T×T temporal correlation matrix.  If we knew V, then we could simply pre-multiply both sides of the equation by W=V-1/2, since Wy = WXβ + Wε, where Wε has variance/covariance matrix σ²WVW=σ²I.  Ideally this should be done several times, pre-multiplying again by the new W at each iteration.  This results in a regression model with uncorrelated errors, which we can fit as usual to obtain estimates and valid standard errors for β.

So how do we find W?  We first solve the original regression equation y = Xβ + ε to get an estimate of the residuals, then fit an AR(p) model to the residuals to get the p AR coefficient estimates.  We can then compute V-1 directly (it has a simple form, see point 3 below), then take its square root to get W.  Simple enough.

The challenge is how to do this for every voxel in the brain and every subject in our analysis without it presenting a major computational inconvenience.  If we intend to average the coefficients across our n subjects (which we probably should – see point 2 below), then our work flow should look like:

(a) estimate the p AR coefficients for each location in the brain and each subject (resulting in a total of p*n images)

(b) average them across subjects (resulting in a total of p images)

(c) compute the prewhitening matrix W for each location in the brain (resulting in N T×T matrices, where N is the number of voxels in the brain), and

(d) fit the regression model Wy = WXβ + Wε at each voxel to get estimates and standard errors for β for each subject and voxel, which can then be passed on to a group-level model.

If this sounds like a lot of work, here’s several reasons to do it anyway:

1. AR coefficients vary across the brain.

For my AR(6) model fit on motor task data from the Human Connectome Project, here are the estimates of the AR coefficients (from a single iteration), averaged across ~500 subjects.  Each image is on a different scale, since the magnitude of the coefficients tends to decrease with the time lag.  However, what is shared across the coefficients is the fact that there are striking spatial variations.  And the spatial differences are not minor – the first coefficient varies from nearly zero in the medial temporal lobe (MTL) to over 0.2 in the angular gyrus and occipital cortex, while the second coefficient is largely positive but with some areas of negative autocorrelation in the MTL and the inferior frontal cortex.

 

coef1_all

Coefficient 1

coef2_all

Coefficient 2

coef3_all

Coefficient 3

coef4_all

Coefficient 4

coef5_all

Coefficient 5

coef6_all

Coefficient 6

2. Efficiency of coefficient estimates can be improved by averaging across subjects.

[The term “efficiency” in the title of this post has a double meaning: estimation efficiency has to do with obtaining estimates that have low variance, meaning they are close to the truth (assuming they are unbiased), while computational efficiency (see point 3) is the ability to compute stuff quickly.  Both are important, but for very different reasons.  Estimation efficiency matters because it influences accuracy of “activated regions”; computational efficiency matters because time is scarce.]

For the same AR(6) model, here are the estimates of the first AR coefficient averaged across 500 subjects, 20 subjects, and for a single subject.  The averaged images are much smoother than the single-subject image.  Another option would be to perform a small amount of spatial smoothing, especially if you want to allow for differences across subjects.

coef1_all

Average across 500 subjects

coef1_avg

Average across 20 subjects

coef1_subj

Single subject

3. The prewhitening matrix can be computed quickly for each voxel.

Hopefully you’re now convinced that voxel-wise prewhitening and averaging coefficient estimate across subjects is important.  But if it takes weeks to run and debug, let’s get real – it’s probably not going to happen. Luckily, each of the four steps in the workflow can be done reasonably quickly:

(a) Estimate the p AR coefficients for each location in the brain and each subject.  Instead of looping through all voxels to obtain residuals, this can be done very quickly by keeping everything in matrix form.  Letting Y be the T×V matrix of voxel time series, then B = (X’X)-1X’Y is the k×V matrix of coefficient estimates, and E = Y – XB is the T×V matrix of residuals.  Then you just need to loop over voxels to fit an AR model to each column of E.

(b) Average the coefficients across subjects. Sum and divide by n.

(c) Compute the prewhitening matrix W for each location in the brain. I chose to do this by looping through voxels and computing W for each voxel.  This means that for each voxel, you’ll end up computing the same W n times, once for each subject.  Alternatively, the N different prewhitening matrices (one for each voxel) can be computed once and stored, which can be done efficiently using sparse matrices with the Matrix package in R.  But actually W can be computed very quickly, so it’s no big deal to do it on the fly.

To compute W, you first need to compute V-1.  For an AR(p) model with coefficients φ1, …,φp, V-1 is (proportional to) AAt, where A is a banded diagonal matrix with 1’s on the main diagonal, -φ1 on the first lower off-diagonal, -φ2 on the second lower off-diagonal, etc., and zeros everywhere else (including the entire upper triangle).  (You might notice that the main diagonal of the resulting V-1 has all its elements equal except for the first p and last p elements.  This is a boundary effect due to fitting an AR(p) model with fewer than p time points at the beginning and end of the timeseries.  The first p time points of W, y and X should therefore be discarded prior to model fitting.)

Once you have V-1, perform singular value decomposition or compute its eigenvalues U and eigenvectors d to write V-1=UDUt, where D=diag{d}.  Then W is simply V-1/2=UD1/2Ut.  This takes about 0.2 seconds, compared with about 30 seconds using the sqrtm() function in R.

(d) Fit the regression model Wy = WXβ + Wε at each voxel.  Here it’s probably easiest just to loop through, since W and therefore WX are different for each voxel, so the matrix-based approach used in (a) no longer works.  Since W is fast to compute, this isn’t so bad.

That’s about it!  If you have an experience or perspective to share, please add them in the comments.  I would love to hear your thoughts.

R function to write CIFTI files

Standard

As I wrote about previously, CIFTI files are a new and useful file format for storing, analyzing and visualizing imaging data on the cortical surface AND subcortical areas (gray matter structures and cerebellum).  Much of the imaging data available through the Human Connectome Project is in CIFTI format, so anyone who wants to be able to work with that data needs to be able to work with CIFTI files.  Some tools have been developed to do this, but they are not yet available for R (to my knowledge at the time of writing).  My solution until recently was to just use MATLAB for all of my analyses, but I’m working on a new project that uses INLA, which has been programmed in R.  Cue the tedious back-and-forth between R and MATLAB.  Today I finally had enough, so I wrote up a quick function to write CIFTI file(s) from a vector or matrix of values.  (For an intro on how to use MATLAB from within R, see this previous post.)  For visualization purposes, the Connectome Workbench is still needed, but this has drastically streamlined my workflow.  Later I may wrap this into a package, but for now here’s the function code.

You’ll need to replace the default values for the template and toolboxloc arguments, but I leave them in as an example.  The CIFTI file you point to should be of the lower resolution variety.  As mentioned in the function comments, this assumes you have MATLAB installed and can run it from the command line (I have “module load matlab” in my .bashrc file).  You’ll also need to download the cifti-matlab or fieldtrip toolbox, both of which include the ft_read_cifti and ft_write_cifti functions.  I use cifti-matlab because fieldtrip has a bunch of other stuff I don’t need.  You also need a fair amount of memory to read and write CIFTI files, so I usually do this on the cluster.  My laptop has 8GB of RAM, which is not enough, but my desktop has 32G, and this seems to work.  Once I have the CIFTI files written, I just transfer them to my machine to visualize them with the Connectome Workbench.

Comments and suggestions welcome!

makeCIFTIs <- function(table, names, hemisphere='both', 
 template = '~/HCP/data/groupICA/groupICA_3T_Q1-Q6related468_MSMsulc_d25.ica/melodic_IC_ftb.dlabel.nii', 
 toolboxloc = '~/matlab_toolboxes/cifti-matlab/',
 scratchloc = '~/'){

 # table = VxQ matrix, each column will be written as a CIFTI file
 # table can be a vector if only a single CIFTI is needed
 # V = number of voxels in both or one hemisphere
 # Q = number of CIFTI files to be written
 # names = vector length Q, containing file names to be written (no extension)
 # hemisphere = 'both', 'left', 'right', or 'all' (includes subcortical voxels)
 # if hemisphere='both', the first half of table should be the left hemisphere
 # template = name of existing CIFTI file that can be used as a template
 # toolboxloc = location and name of folder containing cifti-matlab or fieldtrip toolbox
 # scratchloc = location where CSV files will be written then deleted

 # this function writes a MATLAB script and executes it
 # must have MATLAB installed
 # must have the cifti-matlab or fieldtrip toolbox installed
 # currently can only write CIFTIs with a vector of values (e.g. not a timeseries or a set of label maps)

if(is.vector(table)) table <- matrix(table, nrow=length(table, ncol=1))
 Q <- ncol(table)
 V <- nrow(table)

 if(hemisphere=='all') if(V != 96854) error('Number of rows must equal 96854 with hemisphere="all"')
 if(hemisphere=='both') if(V != 64984) error('Number of rows must equal 64984 with hemisphere="both"')
 if(hemisphere=='left') if(V != 32492) error('Number of rows must equal 32492 with hemisphere="left"')
 if(hemisphere=='right') if(V != 32492) error('Number of rows must equal 32492 with hemisphere="right"')

# Write table to CSV in scratch location
 tmp <- table
 tmp[is.na(tmp)] <- 0 #set NAs to zero
 fname <- file.path(scratchloc, 'tmp.csv')
 write.table(tmp, file=fname, row.names=FALSE, col.names=FALSE, sep=',')

 # Write MATLAB script

 line1 <- paste0("addpath '", toolboxloc, "'")
 line2 <- paste0("cifti = ft_read_cifti('",template,"');") #reads in structure with cell 'indexmax'
 line3 <- paste0("indsL = (cifti.brainstructure == 1);", #L cortex only
 "indsR = (cifti.brainstructure == 2);", #R cortex only
 "indsLR = (cifti.brainstructure <= 2);", #L and R cortices only
 "inds_nan = isnan(cifti.indexmax);") #location of NaNs

 line5 <- paste0("cd '", getwd(), "'")
 line6 <- paste0("fname = '",fname,"';",
 "coefs = csvread(fname);")

 line7 <- paste0("names = {'",paste(names_beta, collapse="';'"),"'};")

 line8 <- paste0("for k=1:",Q)
 line9 <- "k"
 line10 <- "vals = coefs(:,k);"
 line11 <- "cifti.indexmax = cifti.indexmax*0;"

 if(hemisphere=='all') line12 <- "cifti.indexmax = vals;"
 if(hemisphere=='left') line12 <- "cifti.indexmax(indsL) = vals;"
 if(hemisphere=='right') line12 <- "cifti.indexmax(indsR) = vals;"
 if(hemisphere=='both') line12 <- "cifti.indexmax(indsLR) = vals;"

 line13 <- "cifti.indexmax(inds_nan) = 0/0;" #put in NaNs
 line14 <- "ft_write_cifti(names{k}, cifti, 'parameter', 'indexmax');"
 line15 <- "end"

 matlab_lines <- c(line1, line2, line3, line5, line6, line7, line8,
 line9, line10, line11, line12, line13, line14, line15)
 writeLines(matlab_lines, con=file.path(scratchloc, 'myscript.m'))

 system("matlab -nodisplay -r \"run('~/myscript.m'); exit\"")

 file.remove(fname)
 file.remove(file.path(scratchloc, 'myscript.m'))
}

And since a pretty picture always helps, here’s one of the products of my current analysis based on a CIFTI file I wrote this way, visualized using the Connectome Workbench.

beta1_sphere.png

A layman’s guide to working with CIFTI files

Standard

[Update: Here is another nice intro to CIFTI files, by Jo Etzel at WashU.]

My research group recently began working with the Human Connectome Project (HCP) dataset, a large database of resting-state fMRI, task fMRI and other brain imaging and demographic data for 500+ subjects.  The HCP is pretty unique in that it combines a large number of subjects with long scans and a standardized scanning protocol.  This distinguishes it from other large neuroimaging datasets, such as ABIDE, ADHD-200 or the 1000 Functional Connectomes Project, which are grassroots compilations of data collected across multiple sites with different scanning equipment, acquisition protocols, preprocessing pipelines and quality control procedures.  Most of the resting-state scans in those datasets are 5-10 minutes long, whereas in the HCP there are 60 minutes of resting-state data collected for each subject.  

Our group and many others are interested in using the HCP to develop and evaluate different ways of analyzing brain imaging data.  However, some aspects of the HCP are new to many researchers, and it’s not always obvious from an outside perspective how to work with and interpret the data.  One example of this is data stored in “grayordinates” using the CIFTI file format.  While the HCP also contains data in volumetric space (in NIFTI file format), working with the grayordinates is appealing for a number of reasons that I won’t elaborate on here, except to say that the idea is to isolate the gray matter, which consists of cortical, subcortical and cerebellar areas.  However, working with CIFTI files can be daunting for a new user, and the documentation is not always very user-friendly.

I recently started working with these files, so based on my experience here is a basic introduction to navigating, reading and writing CIFTI files with MATLAB.  This explanation is based on my limited experience, but it will hopefully be helpful for other novice users.  Please feel free to make additions or corrections in the comments!

Part 1: Navigating CIFTI files

The first thing to know about CIFTI files is that they contain several parts, similar to a structure in MATLAB or a list in R.  In that sense, they are very different from NIFTI files, which are essentially just 3D or 4D arrays of intensities.  Furthermore, there are several types of CIFTI files, including times series (*.dtseries.nii), parcellations (*.dlabel.nii), and scalar images (*.dscalar.nii), and the parts or “fields” contained in each type of CIFTI file are different (though they have several commonalities). 

Consider a CIFTI file representing time series data.  The time series themselves are stored in a VxT matrix, where V is the number of grayordinate “voxels” and T is the number of time points in the series.  (Technically “voxels” only refer to volumetric space, but it’s convenient to have a single term for both surface and volumetric elements… please forgive the abuse of notation.)  The total number of voxels in a CIFTI file is around 90,000, including about 60,000 surface voxels (about 30,000 per hemisphere) and 30,000 subcortical and cerebellar voxels.  The surface voxels are a 2D representation of the cortical gray matter, while the subcortical voxels are still in 3D/volumetric space.  Each surface grayordinate is essentially an average of several cortical gray matter voxels in the original volumetric scan.  So what else does a CIFTI file contain?  Common to all types of CIFTI files are the following fields:

  • brainstructure: a vector of length V with a numerical indicator (1-21) for the major brain structure that each voxel forms part of
  • brainstructurelabel: a vector of length 21 with the name of each major brain structure (e.g. CORTEX_LEFT, CORTEX_RIGHT, CAUDATE_LEFT, CAUDATE_RIGHT, CEREBELLUM_LEFT, CEREBELLUM_RIGHT, etc.)
  • pos: a Vx3 matrix of the x-y-z coordinates of each voxel.  However, only the subcortical and cerebellar voxels have coordinates specified; for surface voxels, the values are missing.

In addition, each type of CIFTI file contains additional information.

  • A time series (*.dtseries.nii) CIFTI file contains time (1xT) and dtseries (VxT), where time is the timing (in seconds) of each time point (0, 0.72, 1.44, …) and dtseries is the time series of each voxel.
  • A parcellation (*.dlabel.nii) CIFTI file contains some parcellation field myfield (Vx1) and myfieldlabel (1xQ), where Q is the number of parcels.  For example, in the HCP groupICA parcellation files (melodic_IC_ftb.dlabel.nii), the parcellation field is named “indexmax“, and Q is the number of ICA components.  (As a side note, I believe “indexmax” refers to how spatial components from a groupICA were used to create a parcellation, namely by identifying the component with the maximum z-score at each voxel.)  In the FreeSurfer anatomical parcellation files ending in aparc.32k_fs_LR.dlabel.nii, theparcellation field is named “x100307_aparc”, and Q=70.
  • A scalar (*.dscalar.nii) CIFTI file contains one or more fields field1 (Vx1), field2 (Vx1), etc., each representing a scalar image of some sort.  For example, in the HCP groupICA spatial maps are represented as this type of file, with the fields x1,….,xQ representing the Q spatial maps.  Structural images for each subject are also represented as .dscalar files.  For example, the files ending in MyelinMap_BC.32k_fs_LR.dscalar.nii contain the field myelinmap_bc, which contain the estimated myelin content of each of the roughly 60,000 surface voxels.

You might have noticed that I never mentioned where CIFTI files store the location of each surface voxel.  That’s because they don’t!  This is because the surface can be “inflated” to varying degrees, from very “wiggly” (showing full anatomical detail) to completely flattened (showing no anatomical detail).  Depending on the degree of inflation, the 3-dimensional location of each surface voxel changes!  For a given degree of inflation, the x-y-z location of each voxel are stored as a different type of file, called a GIFTI.  I’ll mention below how to read and navigate these files in MATLAB too.

Part 2: Reading CIFTI files with MATLAB

I use the FieldTrip toolbox, which includes the function ft_read_cifti.  Once you have the toolbox downloaded and have added to the path, reading a CIFTI file is as simple as mycifti = ft_read_cifti(‘fname.dsomething.nii’).  mycifti is then a MATLAB structure containing fields such as the ones described above.  Each field can be accessed or altered with the “.” operator, i.e. mycifti.fieldname.  

As mentioned above, CIFTI files do not contain the x-y-z coordinates of the surface voxels.  These are instead contained in GIFTI files with names ending in “surf.gii”.  In the HCP dataset, the level of inflation is indicated in the file name and includes “flat”, “inflated”, “mid thickness”, and “very_inflated”.  The left- and right-hemispheres are stored in separate GIFTI files.  The GIFTI toolbox can be used to read GIFTI files into MATLAB.  After installing and adding the toolbox, mygifti = gifti(‘fname.surf.gii’) results in a MATLAB structure with several fields, including vertices, a Vx 3 matrix containing the x-y-z coordinates of each voxel, where VL is around 30,000, the number of surface voxels in one hemisphere of the brain.  To map the voxels in a GIFTI file to the corresponding voxels in a CIFTI file, reference the brainstructure field: all voxels with brainstructure==1 (CORTEX_LEFT) map to those in a GIFTI file for the left hemisphere; all voxels with brainstructure==2 (CORTEX_RIGHT) map to those in a GIFTI file for the right hemisphere.

Part 3: Writing CIFTI files with MATLAB

I have found this to be the most challenging aspect of working with CIFTI files, in part because the documentation for the ft_write_cifti function seems to be somewhat limited.  Furthermore, since there are different types of CIFTI files, you cannot simply run ft_write_cifti(‘fname’, mycifti), even if mycifti is simply the result of mycifti = ft_read_cifti(‘fname.d*.nii’).  Instead, you must provide some additional information in the form of “optional” key-value pairs.  Also, in general I’ve found that you must respect the original file type.  For example, to save a parcellation you should start by reading in a .dlabel file; to save a static image you should start by reading in a .dscalar.nii file; etc.  Here are a few examples.

  • ft_write_cifti(‘fname’, mycifti, ‘parameter’, ‘dtseries’) saves the object mycifti, which has a dtseries field containing the functional data, as the CIFTI file fname.dtseries.nii.
  • ft_write_cifti(‘fname’, mycifti, ‘parameter’, ‘x1’, ‘parameter’, ‘x2’) saves the object mycifti, which has fields x1 and x2 containing two static images, as the CIFTI file fname.dscalar.nii.
  • ft_write_cifti(‘fname’, ‘mycifti’, ‘parcellation’, ‘parcels’, ‘parameter’, ‘parcels’) or ft_write_cifti(‘fname’, ‘mycifti’, ‘parameter’, ‘parcels’) saves the object mycifti, which has a field parcels containing the labels of each voxel, as the CIFTI file fname.dscalar.nii.  (I haven’t figured out how to save a .dlabel file.  Running ft_write_cifti(‘fname’, ‘mycifti’, ‘parcellation’, ‘parcels’) throws an error that indicates that ‘parameter’ is required.  Specifying ‘parameter’ and ‘parcellation’ or just ‘parameter’ results in a .dscalar file.  If anyone knows how to do this, I’d love to know!)

One aspect of working with CIFTI files that I have conspiciously ignored is visualization.  This is obviously a very important aspect of working with imaging data, but for now let’s just say it’s next on my to-do list.  Luckily another member of our group, Mary Beth Nebel, has started working with the Connectome Workbench, and has produced some beautiful figures, including the ones below.  I hope to pick her brain soon and report back!

Scan-rescan reliability by ICA parcel (Q=100):

I2C2_2400

Group ICA parcellations:

ICAmaps

Derivation and interpretation of leverage in SLR

Standard

I was recently looking online for an explicit derivation of leverage — the diagonals of the “hat matrix” — in simple linear regression. If X is the n-times-p matrix of explanatory variables in a linear model, then the hat matrix is H=X(X’X)-1X’, so called because it puts the “hat” on the predicted values, since Ŷ = HY.

Leverage (hi) has a lot of nice properties that can be quite informative for model diagnostics. For example, if Yi were to change by 1 unit, then Ŷi will change by hi. Leverage is also proportional to the uncertainty in the predicted values Ŷi, since Var(Ŷ)=σ2H, where σ2 is the variance of the model residuals.

In the simple linear regression (SLR) case, if your model contains an intercept then leverage is of the form

Screen Shot 2015-07-23 at 3.21.41 PM.

Therefore, leverage is similar to Mahalanobis distance, which measures distance from the “center” of the data, standardized by the “scale” or variance. The same is true for multivariate linear regression with an intercept. However, the relationship isn’t entirely obvious (at least to me) from the hat matrix formula, and I couldn’t find the proof anywhere online (I’m sure it’s in someone’s linear models class notes but I didn’t find it from my digging around). So I wrote it up and posted it to my Resources page.

One interesting thing to note: if your model does not contain an intercept, this relationship doesn’t hold.  However, if you center the X’s, the relationship above again holds, but without the 1/n term at the beginning.

Tips for submitting to arXiv for the first time

Standard

Today I successfully submitted my first paper to arXiv!  We’ve submitted this paper to a journal, but it hasn’t been published yet, so we wanted to get a pre-print up before advertising the corresponding software packages.  Unfortunately, the process of submitting to arXiv wasn’t painless.  Now that I’ve figured out some of the quirks, however, hopefully your first submission can go a little more smoothly than mine did.

After registering, the first unpleasant surprise was that arXiv doesn’t accept PDF documents created from LaTeX code.  You must submit your LaTeX code and all supporting files.  As far as I can tell, these have to be uploaded one-by-one.  Sorry, I didn’t figure out a way around that one (but let me know if you do!).

After verifying that my code compiled without errors on my computer, I uploaded all my documents, hit “Process”, and … long page of errors!  It turns out arXiv doesn’t accept .bib files and instead requires .bbl files for BibTeX.  So I thought, “this will be easy” and uploaded my .bbl file instead.  I again hit “Process”, and … another long page of errors!

I wasn’t sure how to interpret these errors, so … I gave up for a while.  A few weeks later, I decided to give it another go.  I realized that I had made a couple of mistakes the first time:

  1. Several of my graphics were in PDF format.  This is fine, but I needed to include this line at the top of my preamble (top of my .tex document) to ensure that the .tex file is built using PDFLaTeX instead of LaTeX:
    \pdfoutput=1
  2. I had probably uploaded some unnecessary files, such as graphics that weren’t included in the final version.  ArXiv doesn’t like this.  Make sure only necessary files are uploaded.

Fixing those two things resulted finally in an error-free compilation.  Horray!  (If at this point you are still struggling with compilation errors, I highly recommend reading arXiv’s help page on Considerations for TeX Submissions.)

One final annoyance was the abstract.  I had included an abstract in my document, but arXiv has its own abstract form, and my current abstract exceeded the character limits.  Once I had pared it down, however, I got a couple of errors related to “unknown characters”.  I didn’t have anything crazy in the text, so this seemed a little ridiculous, but I went ahead and removed a ‘%’ sign and a couple of quotation marks and hyphens, which took care of the errors. I then went back and removed the abstract from the .tex document and re-uploaded it.  Submission (finally) complete!

Props to arXiv for a very nice set of help pages, the most relevant of which for this post is Submitting an Article.  This includes links to the TeX Submissions help page referenced above, among many others.

P.S. If you’re curious about the paper, it’s titled “Improving Reliability of Subject-Level Resting-State fMRI Parcellation with Shrinkage Estimators” and is available here.

Three ways to use MATLAB from R

Standard

Being a statistician working in neuroimaging is a little like living abroad and trying to speak a foreign language. For example, my first language is English, but I spent my first summer as a PhD student doing research at LMU in Munich, Germany. I had taken German in college and could have really basic conversations, but for more demanding situations I would switch to English in a heartbeat, given the option.

In the world of neuroimaging, or at least many corners of it, MATLAB is the official language. Many tools and programs for the analysis of fMRI data are programmed in MATLAB, and many of my collaborators program exclusively in MATLAB. I know MATLAB well enough to do basic things like reading in data, looping and playing with matrices, but beyond the basics, like many statisticians I am much more comfortable in R. Therefore, I often find myself straddling the line between R and MATLAB.

For example, in one of my projects, I used R to read in the fMRI time series, create summary statistics, and do some transformations on those statistics. I then used MATLAB to perform clustering on the results using a method developed by one of my collaborators. Finally, I again used R to visualize the results using brainR.  For that project, it was convenient enough to write separate programs for the different R and MATLAB parts. I saved the results of the first R part in a CSV file using write.table, read that file into MATLAB using csvread, saved the output of the MATLAB part in another CSV file, and finally read that into R using read.table. Not exactly elegant, but it did the job.

However, I also created and ran a simulation that required the same basic process but over many iterations, and the manual workflow I just described was not a realistic option. I needed a way to run MATLAB automatically from within my R program. (In my case, I eventually compromised and did the clustering in R for the purposes of the simulation, a la option 4 below.) I experimented with different ways to use MATLAB from within R, and I found three primary options. Below I’ll show how to do each of these and discuss their pros and cons, but in short the options are (1) to execute a single MATLAB line using the system() command in R; (2) to use a package like R.matlab to send code to the MATLAB server; or (3) to write out a MATLAB program within R using the writeLines() function and run the entire program using the system() command in R.

Option 1: Run a single MATLAB command at a time using system()

The system command in R runs the contents of a character string as a shell command – basically anything you’d type into a Terminal window on a Mac or in a Linux environment, such as the Hopkins cluster. For example, typing system("ls") will return a list of the files in the working directory. If your system has the ability to run MATLAB from the command line (true for the Hopkins cluster), you can take advantage of that through R by typing system("matlab -nodisplay -r 'stuff; to; do; in; matlab;'").

Tip: If you’re working locally on a Mac with MATLAB installed, you’d need to go to the directory where MATLAB is installed: for example, this works on my computer to open MATLAB on the command line: /Applications/MATLAB_R2013a.app/bin/matlab -nodisplay

Pros!

  • Quick and easy
  • Possible to run multiple lines, separated by semicolons

Cons😦

  • All commands must be contained within a single line
  • Loops and line-to-line dependencies are somewhat tricky
  • Can’t pass data between MATLAB and R
  • If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit.
system('matlab -nodisplay -r "a=2; b=1; display(a+b); exit"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b (8.2.0.701) 64-bit (glnxa64)
                                   August 13, 2013


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.


ans =

     3

To see an example of the last con – “If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit” – suppose I had mis-spelled display as displya:

system('matlab -nodisplay -r "a=2; b=1; displya(a+b); exit;"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b (8.2.0.701) 64-bit (glnxa64)
                                   August 13, 2013


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.


Undefined function 'displya' for input arguments of type 'double'.

>>

The double bracket (>>) indicates that we’re still in the MATLAB environment. Simply type exit to close the MATLAB connection and return to R.

Option 2: Use R.matlab to send code to the MATLAB server

R.matlab is a package that communicates with MATLAB, can read and write MAT files, and can pass objects to (and receive objects from) MATLAB. It’s a little tricky to use at first, but the ability to pass objects between R and MATLAB can be very useful. However, this ability is limited to objects of a reasonable size, and can be slow.

Pros!

  • Can spread commands over multiple lines
  • Loops and dependencies are easier
  • Can pass data back and forth using setVariable() and getVariable()

Cons😦

  • Can be tricky to use
  • Connection to the MATLAB server can be buggy
  • Clunky output
  • Can’t run multiple jobs in parallel jobs on the same node, since the MATLAB server can only be opened once on each node, making it limited for simulations
  • Passing large objects back and forth doesn’t always work
# load the R.matlab library and start MATLAB server
library(R.matlab)
Matlab$startServer()
matlab <- Matlab()
isOpen <- open(matlab)
                            < M A T L A B (R) >
                  Copyright 1984-2013 The MathWorks, Inc.
                    R2013b (8.2.0.701) 64-bit (glnxa64)
                              August 13, 2013

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

Running MatlabServer v2.2.0
MATLAB v7.x or higher detected.
Saving with option -V6.
Added InputStreamByteWrapper to dynamic Java CLASSPATH.
----------------------
MATLAB server started!
----------------------
Trying to open server socket (port 9999)...done.
Connected to client.
if (!isOpen) throw("MATLAB server is not running: waited 30 seconds.")

# set a variable in R and send to MATLB
x <- 10
setVariable(matlab, x = x)
Received cmd: 3
Will read MAT file: "/tmp/RtmpK3nM1k/file1821f2a4d0339.mat"
evaluate(matlab, "x")
Received cmd: 1
"eval" string: "x"

x =

    10

Sent byte: 0
# set a couple of variables in MATLAB
evaluate(matlab, "y=20; z=x+y")
Received cmd: 1
"eval" string: "y=20; z=x+y"

z =

    30

Sent byte: 0
# pass a variable from MATLAB to R
z <- getVariable(matlab, "z")
Received cmd: 1
"eval" string: "variables = {'z'};"
Sent byte: 0
Received cmd: 2
save(tmpname, '-V6', 'z')
z
$z
     [,1]
[1,]   30

attr(,"header")
attr(,"header")$description
[1] "MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Fri Aug 15 09:57:56 2014                                        "

attr(,"header")$version
[1] "5"

attr(,"header")$endian
[1] "little"

answer=0
# close the MATLAB server
close(matlab)
Received cmd: -1
-----------------------
MATLAB server shutdown!
-----------------------
> >>
>

Option 3: Write an entire MATLAB program using writeLines() and run using system()

The writeLines function can be used to create a MATLAB script (a .m file) from a character vector in R. Each element of the vector will become a separate line in the script. This approach can be useful when you want the MATLAB script to vary based on something in the R environment, such as a subject ID or parameters in a simulation. If, on the other hand, the MATLAB script never changes, you can just create the script in a text editor, rather than using writeLines. In either case, the script can be run from within R using the system command.

Pros!

  • Can run multiple lines with loops and dependencies
  • Easier to use than R.matlab
  • No problems running multiple jobs in parallel on the same node

Cons😦

  • Passing data directly between MATLAB and R isn’t possible, so must write data to CSV or similar
  • As in option 1, if there is an error in the MATLAB code, the execution will stop and MATLAB will remain option until you type exit.
#set a variable in R and save in a csv file
x <- 10
write.table(x, file='~/x.csv', sep=",", row.names=FALSE, col.names=FALSE)

#make a vector where each element is a line of MATLAB code
#matlab code reads in our variable x, creates two variables y and z, 
#and write z in a csv file
matlab.lines <- c(
    "x = csvread('~/x.csv')",
    "y=20",
    "z=x+y",
    "csvwrite('~/z.csv', z)")

#create a MATLAB script containing all the commands in matlab.lines
writeLines(matlab.lines, con="~/myscript.m")

#run our MATLAB script
system("matlab -nodisplay -r \"run('~/myscript.m'); exit\"")
# read in the variable z we created in MATLAB
z <- read.table("~/z.csv")
z
#remove the temporary files we used to pass data between R and MATLAB
system("rm ~/x.csv ~/z.csv")

Option 4: Stick with one or the other

Of course, another solution is to stick with either R or MATLAB, if it’s feasible. Opening the MATLAB server can be slow and clunky, making it problematic in a simulation setting, and passing large amounts of data back and forth can be difficult or impossible. In my simulation, I actually ended up avoiding MATLAB entirely for the sake of speed.

However, it takes time to re-write existing code in a different language, especially one you aren’t super comfortable with (Google Translate for code, anyone?). I’d love to one day be as comfortable with MATLAB as in R, as I hope to one day speak German as well as I speak Spanish (and to speak Spanish as well as I do English). Of course the only way to do any of these things are to practice coding/speaking, but full immersion is time consuming and not always practical, and the solutions above can be helpful in the meantime.

Finally, thank you to John Muschelli for explaining option 3. I also heard a rumor that he might be writing a follow-up post with some additional tips…

If you have another way of interfacing between R and MATLAB, I’d love to hear about it!

Why I’m a Screen evangelist (and how to get started)

Standard

I’ve recently made several upgrades to my workflow that make me say “WHY OH WHY did it take me this long to start doing this?”  One of the big upgrades was using Sublime + Enhanced-R as explained by Alyssa Frazee to send code line-by-line to the Terminal with a simple keystroke.  Copy and paste no more!  It makes my fingers hurt just thinking about my old ways.  Thing is, I picked up on the Sublime workflow months after Alyssa and others in my department had started using it.  I read Alyssa’s blog post back in August, but it took me a while to take action.

Not this time!  Following a recommendation from Taki Shinohara, I recently started using GNU screen, or “screen” for short, and I am in nerd heaven.  I’m so excited about this that I am constantly asking everyone if they are “using screen yet?”.  I am sure I am getting on some nerves.  But seriously, it’s the awesomest.  Here’s why.

I often work on my department’s cluster interactively, whether for debugging code that I will submit as a batch job later, or just running code that doesn’t really need to be submitted as a batch job.  I love this way of working:  I get the best of the cluster and my computer at the same time. I can use tons of memory (100G in some cases) and my computer doesn’t get bogged down.  At the same time, I can debug as I go, easily track my code’s progress, and write code as I go.  However, there is one major issue that, until recently, stopped me from working this way most of the time: the connection to the cluster could be lost, taking all my code’s progress down with it.  This can happen for a myriad of reasons: a faulty internet connection, I close my computer, I need to go home for the day, the connection simply times out after a while, and so on.  To get over these limitations, I would often use interactive sessions for the purpose of debugging and code building, but submit my code as a batch job when I needed it to run without interruption.  Alternatively, I would often work interactively, but save my progress in data files periodically and reload the saved files if the connection was lost for any reason.  This workflow does the job, but it’s not convenient or failproof.

Enter screen.  Screen allows me to work interactively for an indefinite length of time without ever having to worry about losing the connection.  A quick rundown:

1. I log onto the cluster and start a new screen by typing ‘screen’.

2. I get on a cluster node (‘qrsh’), start R or MATLAB, and submit whatever code I want to run.

3. When I am ready to stop for the time being, I type ctrl-A ctrl-D to “detach” the screen.

I can then close my computer and move on with my life, and my session will still be there the next time I need it.

4. When I’m ready to look at it again, I log onto the cluster and type ‘screen -r’ to “reattach” the screen.  And it’s back, just like magic.  This works even if code is running when I want to stop working.

Screen also works if the connection is inadvertently lost.  If you close your computer or lose your internet connection before detaching the screen, you simply need to “force detach” before reattaching it.  To do this, log onto the cluster, type ‘screen -D’ to force detach and then ‘screen -r’ to reattach.

You can also have more than one screen session open at once, running completely independently.  In this case, each screen session has an ID, and to reattach screen with ID 12345, you’d type ‘screen -r 12345’.  If you don’t know the screen ID, simply type ‘screen -r’ and screen will return a list of each session and their IDs.

Keep in mind that it’s good practice to exit screen (type ‘ctrl-A ctrl-K’ to kill an attached session) when you’re done working to make sure you aren’t taking up resources that might be needed by other users.

That’s it.  Simple and super convenient.  As long as screen is installed on your cluster, you don’t have to do anything additionally to start using it and stop losing work!