The Role of Centering in Dual Regression

Standard

Dual regression is maybe the simplest way to obtain subject-specific estimates of ICA-based resting-state networks (RSNs).  RSNs are regions of the brain that tend to act in a coordinated manner in the absence of a specific task, such as the three shown below (from a 50-component ICA of the Human Connectome Project). 

Popular ICA toolboxes for fMRI analysis like GIFT and MELODIC can perform dual regression, but such a simple approach is also straightforward to implement one one’s own.  I end up doing this a lot (and would love to hear in the comments whether others do too or prefer to use existing implementations).  This post is about the important role of centering in dual regression, which is easy to overlook (at least for me) when implementing this simple technique.

In short, given a set of group-level independent components (ICs) each representing a resting-state network, dual regression estimates the subject-level ICs and their time courses by fitting two regression models.  In Regression 1, the group ICs are regressed against the subject’s fMRI data to estimate the subject-level time courses associated with each IC.  In Regression 2, those estimated time courses are regressed against the subject’s fMRI data to obtain estimates of subject-level ICs (spatial maps).  To formalize it a bit:

T – number of volumes (time points) in fMRI timeseries

V – number of vertices or voxels within the brain

Q – number of ICs

\mathbf{Y}_i\ ({T\times V}) – fMRI timeseries for subject i

\mathbf{S}_{grp}\ ({Q\times V}) – group-level ICs

\mathbf{S}_i\ ({Q\times V}) – subject-level ICs (to be estimated)

\mathbf{A}_i\ ({T\times Q}) – subject-level IC time courses (to be estimated)

Based on the probabilistic ICA model \mathbf{Y}_i =\mathbf{A}_i\mathbf{S}_i +\mathbf{E}_i, dual regression fits the following two regression models:

Regression 1: \mathbf{Y}_i' =\mathbf{S}_{grp}'\mathbf{A}_i' +\mathbf{E}_{i1}'.

Note that \mathbf{Y}_i has been transposed so that the V locations are treated as data observations.  \mathbf{S}_{grp}' is the design matrix, and we obtain the OLS estimates of the coefficients, \hat{\mathbf{A}}_i. This is a multivariate (and multiple) regression model, since the response \mathbf{Y}_i' is a matrix with T columns. In the absence of additional assumptions, this is just equivalent to fitting T separate linear regression models, each with V data observations.  (This makes it very computationally convenient, since we can use matrix operations to estimate the entire coefficient matrix in one step as \hat{\mathbf{A}}_i' = (\mathbf{S}_{grp}\mathbf{S}_{grp}')^{-1}\mathbf{S}_{grp}\mathbf{Y}_i'.)

Regression 2: \mathbf{Y}_i = \hat{\mathbf{A}}_i\mathbf{S}_i +\mathbf{E}_{i2}.

Note that in this model, \mathbf{Y}_i is not transposed, so that the T time points, rather than the V locations, are treated as data observations.  \hat{\mathbf{A}}_i serves as the design matrix, and we obtain the coefficient estimates \hat{\mathbf{S}}_i. Again, this can be estimated efficiently in one step using matrix operations as \hat{\mathbf{S}}_i = (\hat{\mathbf{A}}_i'\hat{\mathbf{A}}_i)^{-1}\hat{\mathbf{A}}_i'\mathbf{Y}_i.

Since neither regression model includes an intercept, both the data and design matrix must be centered in order to avoid model misspecification.  And since each model involves a different data matrix and design matrix, we need to worry about centering 4 different matrices. First, though, a brief review of the role of the intercept and centering in linear regression. 

Why center in linear regression?

The goal of simple linear regression is to describe the linear relationship between an explanatory variable x and a response variable y.  The left-hand scatterplot below shows a classic motivating example. In this model, the true intercept is not zero, and fitting the model without the intercept results in a flawed estimate of the relationship between x and y (red line). The model that does include the intercept (blue line) correctly captures the relationship.

Generally, an intercept is required in linear regression, unless (1) there is reason to believe that the intercept is actually zero in truth, or (2) both the response and (each) predictor have been centered.  An example of the former might be the relationship between gestational age of a fetus and its size (although that relationship would likely not be linear).  If the response and predictor(s) have been centered, however, the intercept is exactly zero and can therefore be excluded from the model.  The right-hand scatterplot below shows the same data after centering x and y, overlaid with the estimated linear regression lines with and without an intercept.  The two fits are now exactly the same, showing that the intercept can be dropped from the model.

Dual regression requires centering across time and space

Coming back to dual regression, then, we should either include an intercept in each model OR properly center both the “data” and the “design matrix” in both regressions.  I find it simpler to do the latter and will therefore focus on centering.  A key point is that we need to center across observations in both regressions, where in Regression 1 the observations are the V locations, and in Regression 2 the observations are the T time points. Therefore, the data must be centered across both space (remove the global or gray matter signal) and time (remove the mean image, centering each time course at zero).

This double-centering can be performed in two steps, and it turns out that it doesn’t matter which is done first.  To see this, allow me to get a bit matrix-y for a moment.  Note that the mean across time for each location is \tfrac{1}{T}\mathbf{1}_T'\mathbf{Y}_i, and the data centered across time is \tilde{\mathbf{Y}}_i = \mathbf{Y}_i -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T'\mathbf{Y}_i =(\mathbf{I}_T -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T')\mathbf{Y}_i.  This thing (\mathbf{I}_T -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T') is a centering matrix, which is symmetric and idempotent.  Now to center the other way, we can do the same thing to \tilde{\mathbf{Y}}_i after transposing.  The data centered across time then space is \tilde{\tilde{\mathbf{Y}}}_i' = (\mathbf{I}_V -  \tfrac{1}{V}\mathbf{1}_V\mathbf{1}_V')\tilde{\mathbf{Y}}_i' =(\mathbf{I}_V -  \tfrac{1}{V}\mathbf{1}_V\mathbf{1}_V')\mathbf{Y}_i'(\mathbf{I}_T -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T').  By doing it the other way (centering across space first, then time), we would have ended up with (\mathbf{I}_T -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T')\mathbf{Y}_i(\mathbf{I}_V -  \tfrac{1}{V}\mathbf{1}_V\mathbf{1}_V'), which is just \tilde{\tilde{\mathbf{Y}}}_i.  This also shows that the second centering does not somehow “mess up” the first centering, since centering in either order leads to the same result.

This double-centering prior to Regression 1 (along with centering the group ICs \mathbf{S}_{grp} across locations, e.g. \tilde{\mathbf{S}}_{grp}' =(\mathbf{I}_V -  \tfrac{1}{V}\mathbf{1}_V\mathbf{1}_V')\mathbf{S}_{grp}') also happens to take care of the second regression, because \hat{\mathbf{A}}_i ends up centered.  To see this, note that \hat{\mathbf{A}}_i' = (\tilde{\mathbf{S}}_{grp}\tilde{\mathbf{S}}_{grp}')^{-1}\tilde{\mathbf{S}}_{grp}\tilde{\tilde{\mathbf{Y}}}_i'.  This can be written \hat{\mathbf{A}}_i' = (\tilde{\mathbf{S}}_{grp}\tilde{\mathbf{S}}_{grp}')^{-1}\tilde{\mathbf{S}}_{grp}\tilde{\mathbf{Y}}_i'(\mathbf{I}_T -  \tfrac{1}{T}\mathbf{1}_T\mathbf{1}_T'), where\tilde{\mathbf{Y}}_i' is only centered across locations (required for Regression 1).  The part of \hat{\mathbf{A}}_i' preceding the centering matrix is what we would have gotten if we had not also already centered \mathbf{Y}_i across time also.  Therefore, by centering \mathbf{Y}_i across time as well as space, our estimate of \mathbf{A}_i is also centered across time.  So both the data and design matrix for Regression 2 are centered, and we’re good to go.

What happens if we don’t center both ways?

Let’s see what happens if we fail to center the fMRI timeseries data both ways.  Using subject 100206 of the Human Connectome Project, we’ll use dual regression to estimate the subject-level IC’s associated with three group IC’s we showed at the beginning.  Each slideshow below shows one of the IC’s (motor, auditory and executive control), and toggles through the case when the data was centered across space only, time only, or both space and time.

The most blatantly obvious problem occurs when we center across space only, essentially removing the global or gray matter signal.  We haven’t removed the “mean image”, which is then essentially absorbed into the IC estimates.  When we center across time only, we remove the mean image but not the global signal.  In this case, we see more widespread “activation” in all three of the IC’s.  We’d also expect (perhaps even greater) contamination of the time courses associated with each IC in this case, since by failing to center across space, Regression 1 will be misspecified.

This slideshow requires JavaScript.

This slideshow requires JavaScript.

This slideshow requires JavaScript.

So in conclusion, before performing dual regression using your own implementation, center, center, and center some more!

Advertisements

How to efficiently prewhiten fMRI timeseries the “right” way

Image

I recently got around to reading the high-profile PNAS paper titled “Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates” (Eklund et al. 2016), which called attention to the importance of properly processing and analyzing fMRI data.

This lead me to an earlier paper titled “Does parametric fMRI analysis with SPM yield valid results?” (Eklund et al. 2012), which identified mismodeling of temporal autocorrelation as a major factor driving high false positive rates.  From a statistical perspective, accounting for temporal autocorrelation is vital to obtain correct standard errors on model coefficients, which are the basis for hypothesis tests used to identify areas of activation in the brain, a primary goal in many fMRI studies.  (In fact many group analyses actually ignore first-level standard errors, but these should ideally be incorporated into the second-level model.)

The 2012 paper pointed out two primary reasons for mismodeling of temporal autocorrelation: (a) using too low of a model order, such as AR(1), especially for faster TRs, and (b) assuming the same AR coefficients across the brain.  They also recommend smoothing the coefficients to improve the estimates.

As a statistician, I was excited to try out this “proper” approach to prewhitening.  Increasing the model order was no big deal (I used an AR(6) model for TR=0.72 seconds), but doing the prewhitening at the voxel level was a bit of a hassle.  After much trial and error and a few new gray hairs, here’s what I’ve learned.  Some of the computational strategies are specific to R, but they can probably be easily adapted to MATLAB or Python.

  1. AR coefficients definitely seem to vary across the brain.
  2. Averaging across subjects greatly improves the smoothness of the coefficient maps, which is presumably indicative of better estimation.
  3. The prewhitening matrix (the square root inverse covariance matrix) needs to be computed quickly, since you’re going to be doing it separately for every voxel in the brain.

I’ll go into each of these in turn below, but first, let’s review why and how prewhitening is actually performed.  fMRI task activation studies typically rely on fitting a regression model at many different locations in the brain, and ordinary least squares regression assumes uncorrelated errors.  This assumption is the basis of how we compute standard errors of coefficient estimates, which in turn are the basis of hypothesis testing, which allows us to identify regions of the brain that are active in response to different tasks.  The validity of these hypothesis tests depends primarily on (1) how correction for multiple comparisons was performed (the focus of the 2016 paper mentioned above) and (2) the accuracy of the standard errors of the model coefficients.  If we ignore temporal autocorrelation, we will underestimate the true standard errors and obtain more false positives.

So how do we fix this?  If y (T×1) is the time series for a single voxel and X (T×k) is the design matrix containing an intercept, column(s) corresponding to the expected BOLD response under the task paradigm, and several columns for nuisance parameters such as motion and drift, then the regression model we want to estimate is y = Xβ + ε, where ε ~ Normal(0, σ²V).  The β coefficients corresponding to the task-related columns of X represent the level of activation in response to each task at this particular voxel.  V is a T×T temporal correlation matrix.  If we knew V, then we could simply pre-multiply both sides of the equation by W=V-1/2, since Wy = WXβ + Wε, where Wε has variance/covariance matrix σ²WVW=σ²I.  Ideally this should be done several times, pre-multiplying again by the new W at each iteration.  This results in a regression model with uncorrelated errors, which we can fit as usual to obtain estimates and valid standard errors for β.

So how do we find W?  We first solve the original regression equation y = Xβ + ε to get an estimate of the residuals, then fit an AR(p) model to the residuals to get the p AR coefficient estimates.  We can then compute V-1 directly (it has a simple form, see point 3 below), then take its square root to get W.  Simple enough.

The challenge is how to do this for every voxel in the brain and every subject in our analysis without it presenting a major computational inconvenience.  If we intend to average the coefficients across our n subjects (which we probably should – see point 2 below), then our work flow should look like:

(a) estimate the p AR coefficients for each location in the brain and each subject (resulting in a total of p*n images)

(b) average them across subjects (resulting in a total of p images)

(c) compute the prewhitening matrix W for each location in the brain (resulting in N T×T matrices, where N is the number of voxels in the brain), and

(d) fit the regression model Wy = WXβ + Wε at each voxel to get estimates and standard errors for β for each subject and voxel, which can then be passed on to a group-level model.

If this sounds like a lot of work, here’s several reasons to do it anyway:

1. AR coefficients vary across the brain.

For my AR(6) model fit on motor task data from the Human Connectome Project, here are the estimates of the AR coefficients (from a single iteration), averaged across ~500 subjects.  Each image is on a different scale, since the magnitude of the coefficients tends to decrease with the time lag.  However, what is shared across the coefficients is the fact that there are striking spatial variations.  And the spatial differences are not minor – the first coefficient varies from nearly zero in the medial temporal lobe (MTL) to over 0.2 in the angular gyrus and occipital cortex, while the second coefficient is largely positive but with some areas of negative autocorrelation in the MTL and the inferior frontal cortex.

 

coef1_all

Coefficient 1

coef2_all

Coefficient 2

coef3_all

Coefficient 3

coef4_all

Coefficient 4

coef5_all

Coefficient 5

coef6_all

Coefficient 6

2. Efficiency of coefficient estimates can be improved by averaging across subjects.

[The term “efficiency” in the title of this post has a double meaning: estimation efficiency has to do with obtaining estimates that have low variance, meaning they are close to the truth (assuming they are unbiased), while computational efficiency (see point 3) is the ability to compute stuff quickly.  Both are important, but for very different reasons.  Estimation efficiency matters because it influences accuracy of “activated regions”; computational efficiency matters because time is scarce.]

For the same AR(6) model, here are the estimates of the first AR coefficient averaged across 500 subjects, 20 subjects, and for a single subject.  The averaged images are much smoother than the single-subject image.  Another option would be to perform a small amount of spatial smoothing, especially if you want to allow for differences across subjects.

coef1_all

Average across 500 subjects

coef1_avg

Average across 20 subjects

coef1_subj

Single subject

3. The prewhitening matrix can be computed quickly for each voxel.

Hopefully you’re now convinced that voxel-wise prewhitening and averaging coefficient estimate across subjects is important.  But if it takes weeks to run and debug, let’s get real – it’s probably not going to happen. Luckily, each of the four steps in the workflow can be done reasonably quickly:

(a) Estimate the p AR coefficients for each location in the brain and each subject.  Instead of looping through all voxels to obtain residuals, this can be done very quickly by keeping everything in matrix form.  Letting Y be the T×V matrix of voxel time series, then B = (X’X)-1X’Y is the k×V matrix of coefficient estimates, and E = Y – XB is the T×V matrix of residuals.  Then you just need to loop over voxels to fit an AR model to each column of E.

(b) Average the coefficients across subjects. Sum and divide by n.

(c) Compute the prewhitening matrix W for each location in the brain. I chose to do this by looping through voxels and computing W for each voxel.  This means that for each voxel, you’ll end up computing the same W n times, once for each subject.  Alternatively, the N different prewhitening matrices (one for each voxel) can be computed once and stored, which can be done efficiently using sparse matrices with the Matrix package in R.  But actually W can be computed very quickly, so it’s no big deal to do it on the fly.

To compute W, you first need to compute V-1.  For an AR(p) model with coefficients φ1, …,φp, V-1 is (proportional to) AAt, where A is a banded diagonal matrix with 1’s on the main diagonal, -φ1 on the first lower off-diagonal, -φ2 on the second lower off-diagonal, etc., and zeros everywhere else (including the entire upper triangle).  (You might notice that the main diagonal of the resulting V-1 has all its elements equal except for the first p and last p elements.  This is a boundary effect due to fitting an AR(p) model with fewer than p time points at the beginning and end of the timeseries.  The first p time points of W, y and X should therefore be discarded prior to model fitting.)

Once you have V-1, perform singular value decomposition or compute its eigenvalues U and eigenvectors d to write V-1=UDUt, where D=diag{d}.  Then W is simply V-1/2=UD1/2Ut.  This takes about 0.2 seconds, compared with about 30 seconds using the sqrtm() function in R.

(d) Fit the regression model Wy = WXβ + Wε at each voxel.  Here it’s probably easiest just to loop through, since W and therefore WX are different for each voxel, so the matrix-based approach used in (a) no longer works.  Since W is fast to compute, this isn’t so bad.

That’s about it!  If you have an experience or perspective to share, please add them in the comments.  I would love to hear your thoughts.

R function to write CIFTI files

Standard

As I wrote about previously, CIFTI files are a new and useful file format for storing, analyzing and visualizing imaging data on the cortical surface AND subcortical areas (gray matter structures and cerebellum).  Much of the imaging data available through the Human Connectome Project is in CIFTI format, so anyone who wants to be able to work with that data needs to be able to work with CIFTI files.  Some tools have been developed to do this, but they are not yet available for R (to my knowledge at the time of writing).  My solution until recently was to just use MATLAB for all of my analyses, but I’m working on a new project that uses INLA, which has been programmed in R.  Cue the tedious back-and-forth between R and MATLAB.  Today I finally had enough, so I wrote up a quick function to write CIFTI file(s) from a vector or matrix of values.  (For an intro on how to use MATLAB from within R, see this previous post.)  For visualization purposes, the Connectome Workbench is still needed, but this has drastically streamlined my workflow.  Later I may wrap this into a package, but for now here’s the function code.

You’ll need to replace the default values for the template and toolboxloc arguments, but I leave them in as an example.  The CIFTI file you point to should be of the lower resolution variety.  As mentioned in the function comments, this assumes you have MATLAB installed and can run it from the command line (I have “module load matlab” in my .bashrc file).  You’ll also need to download the cifti-matlab or fieldtrip toolbox, both of which include the ft_read_cifti and ft_write_cifti functions.  I use cifti-matlab because fieldtrip has a bunch of other stuff I don’t need.  You also need a fair amount of memory to read and write CIFTI files, so I usually do this on the cluster.  My laptop has 8GB of RAM, which is not enough, but my desktop has 32G, and this seems to work.  Once I have the CIFTI files written, I just transfer them to my machine to visualize them with the Connectome Workbench.

Comments and suggestions welcome!

makeCIFTIs <- function(table, names, hemisphere='both', 
 template = '~/HCP/data/groupICA/groupICA_3T_Q1-Q6related468_MSMsulc_d25.ica/melodic_IC_ftb.dlabel.nii', 
 toolboxloc = '~/matlab_toolboxes/cifti-matlab/',
 scratchloc = '~/'){

 # table = VxQ matrix, each column will be written as a CIFTI file
 # table can be a vector if only a single CIFTI is needed
 # V = number of voxels in both or one hemisphere
 # Q = number of CIFTI files to be written
 # names = vector length Q, containing file names to be written (no extension)
 # hemisphere = 'both', 'left', 'right', or 'all' (includes subcortical voxels)
 # if hemisphere='both', the first half of table should be the left hemisphere
 # template = name of existing CIFTI file that can be used as a template
 # toolboxloc = location and name of folder containing cifti-matlab or fieldtrip toolbox
 # scratchloc = location where CSV files will be written then deleted

 # this function writes a MATLAB script and executes it
 # must have MATLAB installed
 # must have the cifti-matlab or fieldtrip toolbox installed
 # currently can only write CIFTIs with a vector of values (e.g. not a timeseries or a set of label maps)

if(is.vector(table)) table <- matrix(table, nrow=length(table, ncol=1))
 Q <- ncol(table)
 V <- nrow(table)

 if(hemisphere=='all') if(V != 96854) error('Number of rows must equal 96854 with hemisphere="all"')
 if(hemisphere=='both') if(V != 64984) error('Number of rows must equal 64984 with hemisphere="both"')
 if(hemisphere=='left') if(V != 32492) error('Number of rows must equal 32492 with hemisphere="left"')
 if(hemisphere=='right') if(V != 32492) error('Number of rows must equal 32492 with hemisphere="right"')

# Write table to CSV in scratch location
 tmp <- table
 tmp[is.na(tmp)] <- 0 #set NAs to zero
 fname <- file.path(scratchloc, 'tmp.csv')
 write.table(tmp, file=fname, row.names=FALSE, col.names=FALSE, sep=',')

 # Write MATLAB script

 line1 <- paste0("addpath '", toolboxloc, "'")
 line2 <- paste0("cifti = ft_read_cifti('",template,"');") #reads in structure with cell 'indexmax'
 line3 <- paste0("indsL = (cifti.brainstructure == 1);", #L cortex only
 "indsR = (cifti.brainstructure == 2);", #R cortex only
 "indsLR = (cifti.brainstructure <= 2);", #L and R cortices only
 "inds_nan = isnan(cifti.indexmax);") #location of NaNs

 line5 <- paste0("cd '", getwd(), "'")
 line6 <- paste0("fname = '",fname,"';",
 "coefs = csvread(fname);")

 line7 <- paste0("names = {'",paste(names_beta, collapse="';'"),"'};")

 line8 <- paste0("for k=1:",Q)
 line9 <- "k"
 line10 <- "vals = coefs(:,k);"
 line11 <- "cifti.indexmax = cifti.indexmax*0;"

 if(hemisphere=='all') line12 <- "cifti.indexmax = vals;"
 if(hemisphere=='left') line12 <- "cifti.indexmax(indsL) = vals;"
 if(hemisphere=='right') line12 <- "cifti.indexmax(indsR) = vals;"
 if(hemisphere=='both') line12 <- "cifti.indexmax(indsLR) = vals;"

 line13 <- "cifti.indexmax(inds_nan) = 0/0;" #put in NaNs
 line14 <- "ft_write_cifti(names{k}, cifti, 'parameter', 'indexmax');"
 line15 <- "end"

 matlab_lines <- c(line1, line2, line3, line5, line6, line7, line8,
 line9, line10, line11, line12, line13, line14, line15)
 writeLines(matlab_lines, con=file.path(scratchloc, 'myscript.m'))

 system("matlab -nodisplay -r \"run('~/myscript.m'); exit\"")

 file.remove(fname)
 file.remove(file.path(scratchloc, 'myscript.m'))
}

And since a pretty picture always helps, here’s one of the products of my current analysis based on a CIFTI file I wrote this way, visualized using the Connectome Workbench.

beta1_sphere.png

A layman’s guide to working with CIFTI files

Standard

[Update: Here is another nice intro to CIFTI files, by Jo Etzel at WashU.]

My research group recently began working with the Human Connectome Project (HCP) dataset, a large database of resting-state fMRI, task fMRI and other brain imaging and demographic data for 500+ subjects.  The HCP is pretty unique in that it combines a large number of subjects with long scans and a standardized scanning protocol.  This distinguishes it from other large neuroimaging datasets, such as ABIDE, ADHD-200 or the 1000 Functional Connectomes Project, which are grassroots compilations of data collected across multiple sites with different scanning equipment, acquisition protocols, preprocessing pipelines and quality control procedures.  Most of the resting-state scans in those datasets are 5-10 minutes long, whereas in the HCP there are 60 minutes of resting-state data collected for each subject.  

Our group and many others are interested in using the HCP to develop and evaluate different ways of analyzing brain imaging data.  However, some aspects of the HCP are new to many researchers, and it’s not always obvious from an outside perspective how to work with and interpret the data.  One example of this is data stored in “grayordinates” using the CIFTI file format.  While the HCP also contains data in volumetric space (in NIFTI file format), working with the grayordinates is appealing for a number of reasons that I won’t elaborate on here, except to say that the idea is to isolate the gray matter, which consists of cortical, subcortical and cerebellar areas.  However, working with CIFTI files can be daunting for a new user, and the documentation is not always very user-friendly.

I recently started working with these files, so based on my experience here is a basic introduction to navigating, reading and writing CIFTI files with MATLAB.  This explanation is based on my limited experience, but it will hopefully be helpful for other novice users.  Please feel free to make additions or corrections in the comments!

Part 1: Navigating CIFTI files

The first thing to know about CIFTI files is that they contain several parts, similar to a structure in MATLAB or a list in R.  In that sense, they are very different from NIFTI files, which are essentially just 3D or 4D arrays of intensities.  Furthermore, there are several types of CIFTI files, including times series (*.dtseries.nii), parcellations (*.dlabel.nii), and scalar images (*.dscalar.nii), and the parts or “fields” contained in each type of CIFTI file are different (though they have several commonalities). 

Consider a CIFTI file representing time series data.  The time series themselves are stored in a VxT matrix, where V is the number of grayordinate “voxels” and T is the number of time points in the series.  (Technically “voxels” only refer to volumetric space, but it’s convenient to have a single term for both surface and volumetric elements… please forgive the abuse of notation.)  The total number of voxels in a CIFTI file is around 90,000, including about 60,000 surface voxels (about 30,000 per hemisphere) and 30,000 subcortical and cerebellar voxels.  The surface voxels are a 2D representation of the cortical gray matter, while the subcortical voxels are still in 3D/volumetric space.  Each surface grayordinate is essentially an average of several cortical gray matter voxels in the original volumetric scan.  So what else does a CIFTI file contain?  Common to all types of CIFTI files are the following fields:

  • brainstructure: a vector of length V with a numerical indicator (1-21) for the major brain structure that each voxel forms part of
  • brainstructurelabel: a vector of length 21 with the name of each major brain structure (e.g. CORTEX_LEFT, CORTEX_RIGHT, CAUDATE_LEFT, CAUDATE_RIGHT, CEREBELLUM_LEFT, CEREBELLUM_RIGHT, etc.)
  • pos: a Vx3 matrix of the x-y-z coordinates of each voxel.  However, only the subcortical and cerebellar voxels have coordinates specified; for surface voxels, the values are missing.

In addition, each type of CIFTI file contains additional information.

  • A time series (*.dtseries.nii) CIFTI file contains time (1xT) and dtseries (VxT), where time is the timing (in seconds) of each time point (0, 0.72, 1.44, …) and dtseries is the time series of each voxel.
  • A parcellation (*.dlabel.nii) CIFTI file contains some parcellation field myfield (Vx1) and myfieldlabel (1xQ), where Q is the number of parcels.  For example, in the HCP groupICA parcellation files (melodic_IC_ftb.dlabel.nii), the parcellation field is named “indexmax“, and Q is the number of ICA components.  (As a side note, I believe “indexmax” refers to how spatial components from a groupICA were used to create a parcellation, namely by identifying the component with the maximum z-score at each voxel.)  In the FreeSurfer anatomical parcellation files ending in aparc.32k_fs_LR.dlabel.nii, theparcellation field is named “x100307_aparc”, and Q=70.
  • A scalar (*.dscalar.nii) CIFTI file contains one or more fields field1 (Vx1), field2 (Vx1), etc., each representing a scalar image of some sort.  For example, in the HCP groupICA spatial maps are represented as this type of file, with the fields x1,….,xQ representing the Q spatial maps.  Structural images for each subject are also represented as .dscalar files.  For example, the files ending in MyelinMap_BC.32k_fs_LR.dscalar.nii contain the field myelinmap_bc, which contain the estimated myelin content of each of the roughly 60,000 surface voxels.

You might have noticed that I never mentioned where CIFTI files store the location of each surface voxel.  That’s because they don’t!  This is because the surface can be “inflated” to varying degrees, from very “wiggly” (showing full anatomical detail) to completely flattened (showing no anatomical detail).  Depending on the degree of inflation, the 3-dimensional location of each surface voxel changes!  For a given degree of inflation, the x-y-z location of each voxel are stored as a different type of file, called a GIFTI.  I’ll mention below how to read and navigate these files in MATLAB too.

Part 2: Reading CIFTI files with MATLAB

I use the FieldTrip toolbox, which includes the function ft_read_cifti.  Once you have the toolbox downloaded and have added to the path, reading a CIFTI file is as simple as mycifti = ft_read_cifti(‘fname.dsomething.nii’).  mycifti is then a MATLAB structure containing fields such as the ones described above.  Each field can be accessed or altered with the “.” operator, i.e. mycifti.fieldname.  

As mentioned above, CIFTI files do not contain the x-y-z coordinates of the surface voxels.  These are instead contained in GIFTI files with names ending in “surf.gii”.  In the HCP dataset, the level of inflation is indicated in the file name and includes “flat”, “inflated”, “mid thickness”, and “very_inflated”.  The left- and right-hemispheres are stored in separate GIFTI files.  The GIFTI toolbox can be used to read GIFTI files into MATLAB.  After installing and adding the toolbox, mygifti = gifti(‘fname.surf.gii’) results in a MATLAB structure with several fields, including vertices, a Vx 3 matrix containing the x-y-z coordinates of each voxel, where VL is around 30,000, the number of surface voxels in one hemisphere of the brain.  To map the voxels in a GIFTI file to the corresponding voxels in a CIFTI file, reference the brainstructure field: all voxels with brainstructure==1 (CORTEX_LEFT) map to those in a GIFTI file for the left hemisphere; all voxels with brainstructure==2 (CORTEX_RIGHT) map to those in a GIFTI file for the right hemisphere.

Part 3: Writing CIFTI files with MATLAB

I have found this to be the most challenging aspect of working with CIFTI files, in part because the documentation for the ft_write_cifti function seems to be somewhat limited.  Furthermore, since there are different types of CIFTI files, you cannot simply run ft_write_cifti(‘fname’, mycifti), even if mycifti is simply the result of mycifti = ft_read_cifti(‘fname.d*.nii’).  Instead, you must provide some additional information in the form of “optional” key-value pairs.  Also, in general I’ve found that you must respect the original file type.  For example, to save a parcellation you should start by reading in a .dlabel file; to save a static image you should start by reading in a .dscalar.nii file; etc.  Here are a few examples.

  • ft_write_cifti(‘fname’, mycifti, ‘parameter’, ‘dtseries’) saves the object mycifti, which has a dtseries field containing the functional data, as the CIFTI file fname.dtseries.nii.
  • ft_write_cifti(‘fname’, mycifti, ‘parameter’, ‘x1’, ‘parameter’, ‘x2’) saves the object mycifti, which has fields x1 and x2 containing two static images, as the CIFTI file fname.dscalar.nii.
  • ft_write_cifti(‘fname’, ‘mycifti’, ‘parcellation’, ‘parcels’, ‘parameter’, ‘parcels’) or ft_write_cifti(‘fname’, ‘mycifti’, ‘parameter’, ‘parcels’) saves the object mycifti, which has a field parcels containing the labels of each voxel, as the CIFTI file fname.dscalar.nii.  (I haven’t figured out how to save a .dlabel file.  Running ft_write_cifti(‘fname’, ‘mycifti’, ‘parcellation’, ‘parcels’) throws an error that indicates that ‘parameter’ is required.  Specifying ‘parameter’ and ‘parcellation’ or just ‘parameter’ results in a .dscalar file.  If anyone knows how to do this, I’d love to know!)

One aspect of working with CIFTI files that I have conspiciously ignored is visualization.  This is obviously a very important aspect of working with imaging data, but for now let’s just say it’s next on my to-do list.  Luckily another member of our group, Mary Beth Nebel, has started working with the Connectome Workbench, and has produced some beautiful figures, including the ones below.  I hope to pick her brain soon and report back!

Scan-rescan reliability by ICA parcel (Q=100):

I2C2_2400

Group ICA parcellations:

ICAmaps

Derivation and interpretation of leverage in SLR

Standard

I was recently looking online for an explicit derivation of leverage — the diagonals of the “hat matrix” — in simple linear regression. If X is the n-times-p matrix of explanatory variables in a linear model, then the hat matrix is H=X(X’X)-1X’, so called because it puts the “hat” on the predicted values, since Ŷ = HY.

Leverage (hi) has a lot of nice properties that can be quite informative for model diagnostics. For example, if Yi were to change by 1 unit, then Ŷi will change by hi. Leverage is also proportional to the uncertainty in the predicted values Ŷi, since Var(Ŷ)=σ2H, where σ2 is the variance of the model residuals.

In the simple linear regression (SLR) case, if your model contains an intercept then leverage is of the form

Screen Shot 2015-07-23 at 3.21.41 PM.

Therefore, leverage is similar to Mahalanobis distance, which measures distance from the “center” of the data, standardized by the “scale” or variance. The same is true for multivariate linear regression with an intercept. However, the relationship isn’t entirely obvious (at least to me) from the hat matrix formula, and I couldn’t find the proof anywhere online (I’m sure it’s in someone’s linear models class notes but I didn’t find it from my digging around). So I wrote it up and posted it to my Resources page.

One interesting thing to note: if your model does not contain an intercept, this relationship doesn’t hold.  However, if you center the X’s, the relationship above again holds, but without the 1/n term at the beginning.

Tips for submitting to arXiv for the first time

Standard

Today I successfully submitted my first paper to arXiv!  We’ve submitted this paper to a journal, but it hasn’t been published yet, so we wanted to get a pre-print up before advertising the corresponding software packages.  Unfortunately, the process of submitting to arXiv wasn’t painless.  Now that I’ve figured out some of the quirks, however, hopefully your first submission can go a little more smoothly than mine did.

After registering, the first unpleasant surprise was that arXiv doesn’t accept PDF documents created from LaTeX code.  You must submit your LaTeX code and all supporting files.  As far as I can tell, these have to be uploaded one-by-one.  Sorry, I didn’t figure out a way around that one (but let me know if you do!).

After verifying that my code compiled without errors on my computer, I uploaded all my documents, hit “Process”, and … long page of errors!  It turns out arXiv doesn’t accept .bib files and instead requires .bbl files for BibTeX.  So I thought, “this will be easy” and uploaded my .bbl file instead.  I again hit “Process”, and … another long page of errors!

I wasn’t sure how to interpret these errors, so … I gave up for a while.  A few weeks later, I decided to give it another go.  I realized that I had made a couple of mistakes the first time:

  1. Several of my graphics were in PDF format.  This is fine, but I needed to include this line at the top of my preamble (top of my .tex document) to ensure that the .tex file is built using PDFLaTeX instead of LaTeX:
    \pdfoutput=1
  2. I had probably uploaded some unnecessary files, such as graphics that weren’t included in the final version.  ArXiv doesn’t like this.  Make sure only necessary files are uploaded.

Fixing those two things resulted finally in an error-free compilation.  Horray!  (If at this point you are still struggling with compilation errors, I highly recommend reading arXiv’s help page on Considerations for TeX Submissions.)

One final annoyance was the abstract.  I had included an abstract in my document, but arXiv has its own abstract form, and my current abstract exceeded the character limits.  Once I had pared it down, however, I got a couple of errors related to “unknown characters”.  I didn’t have anything crazy in the text, so this seemed a little ridiculous, but I went ahead and removed a ‘%’ sign and a couple of quotation marks and hyphens, which took care of the errors. I then went back and removed the abstract from the .tex document and re-uploaded it.  Submission (finally) complete!

Props to arXiv for a very nice set of help pages, the most relevant of which for this post is Submitting an Article.  This includes links to the TeX Submissions help page referenced above, among many others.

P.S. If you’re curious about the paper, it’s titled “Improving Reliability of Subject-Level Resting-State fMRI Parcellation with Shrinkage Estimators” and is available here.

Three ways to use MATLAB from R

Standard

Being a statistician working in neuroimaging is a little like living abroad and trying to speak a foreign language. For example, my first language is English, but I spent my first summer as a PhD student doing research at LMU in Munich, Germany. I had taken German in college and could have really basic conversations, but for more demanding situations I would switch to English in a heartbeat, given the option.

In the world of neuroimaging, or at least many corners of it, MATLAB is the official language. Many tools and programs for the analysis of fMRI data are programmed in MATLAB, and many of my collaborators program exclusively in MATLAB. I know MATLAB well enough to do basic things like reading in data, looping and playing with matrices, but beyond the basics, like many statisticians I am much more comfortable in R. Therefore, I often find myself straddling the line between R and MATLAB.

For example, in one of my projects, I used R to read in the fMRI time series, create summary statistics, and do some transformations on those statistics. I then used MATLAB to perform clustering on the results using a method developed by one of my collaborators. Finally, I again used R to visualize the results using brainR.  For that project, it was convenient enough to write separate programs for the different R and MATLAB parts. I saved the results of the first R part in a CSV file using write.table, read that file into MATLAB using csvread, saved the output of the MATLAB part in another CSV file, and finally read that into R using read.table. Not exactly elegant, but it did the job.

However, I also created and ran a simulation that required the same basic process but over many iterations, and the manual workflow I just described was not a realistic option. I needed a way to run MATLAB automatically from within my R program. (In my case, I eventually compromised and did the clustering in R for the purposes of the simulation, a la option 4 below.) I experimented with different ways to use MATLAB from within R, and I found three primary options. Below I’ll show how to do each of these and discuss their pros and cons, but in short the options are (1) to execute a single MATLAB line using the system() command in R; (2) to use a package like R.matlab to send code to the MATLAB server; or (3) to write out a MATLAB program within R using the writeLines() function and run the entire program using the system() command in R.

Option 1: Run a single MATLAB command at a time using system()

The system command in R runs the contents of a character string as a shell command – basically anything you’d type into a Terminal window on a Mac or in a Linux environment, such as the Hopkins cluster. For example, typing system("ls") will return a list of the files in the working directory. If your system has the ability to run MATLAB from the command line (true for the Hopkins cluster), you can take advantage of that through R by typing system("matlab -nodisplay -r 'stuff; to; do; in; matlab;'").

Tip: If you’re working locally on a Mac with MATLAB installed, you’d need to go to the directory where MATLAB is installed: for example, this works on my computer to open MATLAB on the command line: /Applications/MATLAB_R2013a.app/bin/matlab -nodisplay

Pros!

  • Quick and easy
  • Possible to run multiple lines, separated by semicolons

Cons 😦

  • All commands must be contained within a single line
  • Loops and line-to-line dependencies are somewhat tricky
  • Can’t pass data between MATLAB and R
  • If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit.
system('matlab -nodisplay -r "a=2; b=1; display(a+b); exit"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b (8.2.0.701) 64-bit (glnxa64)
                                   August 13, 2013


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.


ans =

     3

To see an example of the last con – “If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit” – suppose I had mis-spelled display as displya:

system('matlab -nodisplay -r "a=2; b=1; displya(a+b); exit;"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b (8.2.0.701) 64-bit (glnxa64)
                                   August 13, 2013


To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.


Undefined function 'displya' for input arguments of type 'double'.

>>

The double bracket (>>) indicates that we’re still in the MATLAB environment. Simply type exit to close the MATLAB connection and return to R.

Option 2: Use R.matlab to send code to the MATLAB server

R.matlab is a package that communicates with MATLAB, can read and write MAT files, and can pass objects to (and receive objects from) MATLAB. It’s a little tricky to use at first, but the ability to pass objects between R and MATLAB can be very useful. However, this ability is limited to objects of a reasonable size, and can be slow.

Pros!

  • Can spread commands over multiple lines
  • Loops and dependencies are easier
  • Can pass data back and forth using setVariable() and getVariable()

Cons 😦

  • Can be tricky to use
  • Connection to the MATLAB server can be buggy
  • Clunky output
  • Can’t run multiple jobs in parallel jobs on the same node, since the MATLAB server can only be opened once on each node, making it limited for simulations
  • Passing large objects back and forth doesn’t always work
# load the R.matlab library and start MATLAB server
library(R.matlab)
Matlab$startServer()
matlab <- Matlab()
isOpen <- open(matlab)
                            < M A T L A B (R) >
                  Copyright 1984-2013 The MathWorks, Inc.
                    R2013b (8.2.0.701) 64-bit (glnxa64)
                              August 13, 2013

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.

Running MatlabServer v2.2.0
MATLAB v7.x or higher detected.
Saving with option -V6.
Added InputStreamByteWrapper to dynamic Java CLASSPATH.
----------------------
MATLAB server started!
----------------------
Trying to open server socket (port 9999)...done.
Connected to client.
if (!isOpen) throw("MATLAB server is not running: waited 30 seconds.")

# set a variable in R and send to MATLB
x <- 10
setVariable(matlab, x = x)
Received cmd: 3
Will read MAT file: "/tmp/RtmpK3nM1k/file1821f2a4d0339.mat"
evaluate(matlab, "x")
Received cmd: 1
"eval" string: "x"

x =

    10

Sent byte: 0
# set a couple of variables in MATLAB
evaluate(matlab, "y=20; z=x+y")
Received cmd: 1
"eval" string: "y=20; z=x+y"

z =

    30

Sent byte: 0
# pass a variable from MATLAB to R
z <- getVariable(matlab, "z")
Received cmd: 1
"eval" string: "variables = {'z'};"
Sent byte: 0
Received cmd: 2
save(tmpname, '-V6', 'z')
z
$z
     [,1]
[1,]   30

attr(,"header")
attr(,"header")$description
[1] "MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Fri Aug 15 09:57:56 2014                                        "

attr(,"header")$version
[1] "5"

attr(,"header")$endian
[1] "little"

answer=0
# close the MATLAB server
close(matlab)
Received cmd: -1
-----------------------
MATLAB server shutdown!
-----------------------
> >>
>

Option 3: Write an entire MATLAB program using writeLines() and run using system()

The writeLines function can be used to create a MATLAB script (a .m file) from a character vector in R. Each element of the vector will become a separate line in the script. This approach can be useful when you want the MATLAB script to vary based on something in the R environment, such as a subject ID or parameters in a simulation. If, on the other hand, the MATLAB script never changes, you can just create the script in a text editor, rather than using writeLines. In either case, the script can be run from within R using the system command.

Pros!

  • Can run multiple lines with loops and dependencies
  • Easier to use than R.matlab
  • No problems running multiple jobs in parallel on the same node

Cons 😦

  • Passing data directly between MATLAB and R isn’t possible, so must write data to CSV or similar
  • As in option 1, if there is an error in the MATLAB code, the execution will stop and MATLAB will remain option until you type exit.
#set a variable in R and save in a csv file
x <- 10
write.table(x, file='~/x.csv', sep=",", row.names=FALSE, col.names=FALSE)

#make a vector where each element is a line of MATLAB code
#matlab code reads in our variable x, creates two variables y and z, 
#and write z in a csv file
matlab.lines <- c(
    "x = csvread('~/x.csv')",
    "y=20",
    "z=x+y",
    "csvwrite('~/z.csv', z)")

#create a MATLAB script containing all the commands in matlab.lines
writeLines(matlab.lines, con="~/myscript.m")

#run our MATLAB script
system("matlab -nodisplay -r \"run('~/myscript.m'); exit\"")
# read in the variable z we created in MATLAB
z <- read.table("~/z.csv")
z
#remove the temporary files we used to pass data between R and MATLAB
system("rm ~/x.csv ~/z.csv")

Option 4: Stick with one or the other

Of course, another solution is to stick with either R or MATLAB, if it’s feasible. Opening the MATLAB server can be slow and clunky, making it problematic in a simulation setting, and passing large amounts of data back and forth can be difficult or impossible. In my simulation, I actually ended up avoiding MATLAB entirely for the sake of speed.

However, it takes time to re-write existing code in a different language, especially one you aren’t super comfortable with (Google Translate for code, anyone?). I’d love to one day be as comfortable with MATLAB as in R, as I hope to one day speak German as well as I speak Spanish (and to speak Spanish as well as I do English). Of course the only way to do any of these things are to practice coding/speaking, but full immersion is time consuming and not always practical, and the solutions above can be helpful in the meantime.

Finally, thank you to John Muschelli for explaining option 3. I also heard a rumor that he might be writing a follow-up post with some additional tips…

If you have another way of interfacing between R and MATLAB, I’d love to hear about it!