Why I’m a Screen evangelist (and how to get started)

Standard

I’ve recently made several upgrades to my workflow that make me say “WHY OH WHY did it take me this long to start doing this?”  One of the big upgrades was using Sublime + Enhanced-R as explained by Alyssa Frazee to send code line-by-line to the Terminal with a simple keystroke.  Copy and paste no more!  It makes my fingers hurt just thinking about my old ways.  Thing is, I picked up on the Sublime workflow months after Alyssa and others in my department had started using it.  I read Alyssa’s blog post back in August, but it took me a while to take action.

Not this time!  Following a recommendation from Taki Shinohara, I recently started using GNU screen, or “screen” for short, and I am in nerd heaven.  I’m so excited about this that I am constantly asking everyone if they are “using screen yet?”.  I am sure I am getting on some nerves.  But seriously, it’s the awesomest.  Here’s why.

I often work on my department’s cluster interactively, whether for debugging code that I will submit as a batch job later, or just running code that doesn’t really need to be submitted as a batch job.  I love this way of working:  I get the best of the cluster and my computer at the same time. I can use tons of memory (100G in some cases) and my computer doesn’t get bogged down.  At the same time, I can debug as I go, easily track my code’s progress, and write code as I go.  However, there is one major issue that, until recently, stopped me from working this way most of the time: the connection to the cluster could be lost, taking all my code’s progress down with it.  This can happen for a myriad of reasons: a faulty internet connection, I close my computer, I need to go home for the day, the connection simply times out after a while, and so on.  To get over these limitations, I would often use interactive sessions for the purpose of debugging and code building, but submit my code as a batch job when I needed it to run without interruption.  Alternatively, I would often work interactively, but save my progress in data files periodically and reload the saved files if the connection was lost for any reason.  This workflow does the job, but it’s not convenient or failproof.

Enter screen.  Screen allows me to work interactively for an indefinite length of time without ever having to worry about losing the connection.  A quick rundown:

1. I log onto the cluster and start a new screen by typing ‘screen’.

2. I get on a cluster node (‘qrsh’), start R or MATLAB, and submit whatever code I want to run.

3. When I am ready to stop for the time being, I type ctrl-A ctrl-D to “detach” the screen.

I can then close my computer and move on with my life, and my session will still be there the next time I need it.

4. When I’m ready to look at it again, I log onto the cluster and type ‘screen -r’ to “reattach” the screen.  And it’s back, just like magic.  This works even if code is running when I want to stop working.

Screen also works if the connection is inadvertently lost.  If you close your computer or lose your internet connection before detaching the screen, you simply need to “force detach” before reattaching it.  To do this, log onto the cluster, type ‘screen -D’ to force detach and then ‘screen -r’ to reattach.

You can also have more than one screen session open at once, running completely independently.  In this case, each screen session has an ID, and to reattach screen with ID 12345, you’d type ‘screen -r 12345′.  If you don’t know the screen ID, simply type ‘screen -r’ and screen will return a list of each session and their IDs.

Keep in mind that it’s good practice to exit screen (type ‘ctrl-A ctrl-K’ to kill an attached session) when you’re done working to make sure you aren’t taking up resources that might be needed by other users.

That’s it.  Simple and super convenient.  As long as screen is installed on your cluster, you don’t have to do anything additionally to start using it and stop losing work!

What is quantitative MRI, and why does it matter?

Standard

This post is going to be a gentle introduction to MRI, which is a type of brain imaging and my current focus of research.  I’m planning a future post or two about my research, so I thought I’d provide a little background for people who don’t know much, if anything, about MRI.  I am by training a statistician by way of industrial engineering, so I am by no means a neurologist, radiologist, physicist or other type of “-ist”, meaning this is going to be pretty basic.  This particular post will be about something called relaxation times and quantitative MRI scans.

I got interested in neuroimaging about a year ago, thanks to Elizabeth Sweeney, who mentored me through the process of getting up to speed.  It’s a fascinating area for biostatistics, as it combines big data, nice visualizations, and interesting and impactful biological questions.  As imaging technologies like MRI get more sophisticated, scientists are really just starting to understand the details about how the brain works.  And statisticians are just starting to get involved in this learning process, so there are a lot of opportunities to look at existing questions and problems from a new, more statistical perspective.

For about the past year, I have been working on a project having to do with quantitative MRI with Elizabeth, Taki Shinohara, and people at the NINDS.  I’ll talk more about that project in a future post, but for now here’s a basic introduction to the topic.

Imagine yourself lying flat in an MRI scanner.   As the technicians turn the MRI machine on, it creates a strong magnetic field of 3-7 Tesla.  Under that magnetic field, the protons in your brain, which were previously pointing in every direction, will all align in one direction, parallel to the magnetic field.  Then, the technicians will turn the magnetic field off, and your brain’s protons will go back to pointing their original directions.  MRI scanners can measure the “T1 relaxation times”, or the time it takes for your protons to “relax” back to their original positions.  At least, they can try to.  More on that in a bit.

But first, why measure the T1 relaxation time?  The interest in relaxation times began in 1971, when it was discovered that tumors have higher relaxation times than regular tissue.  Since then, studies have established that people with certain disorders, including multiple sclerosis (MS), tend to have widespread increases in T1 relaxation times, indicating neurological degradation, including loss of axons and myelin in the brain.  And while some affected tissue, like tumors and lesions, are easy to see on a regular, unitless MRI scan, other tissue with elevated T1 doesn’t look visibly different on MRI than healthy tissue.  That’s where quantitative images that measure T1 (“T1 maps”) come in.  Though affected tissue may not look different to the naked eye on the scan, the measured T1 tells the story.  If we can measure T1 in normal-appearing gray and white matter, we can better understand the widespread neuro-degenerative impacts of diseases like MS.  Through longitudinal studies of T1, we can better understand long-term disease burden and hopefully design better treatments and relapse detection methods.

So if T1 maps are so great, what’s the problem?  There are several major issues with T1 maps, compared with the more common clinical MRI scans, such as T1-weighted, which offer a contrast image but lack interpretable units.  Here’s an example of a T1 map (A) and a T1-weighted image (B).

T1Map-T1w

  • As is clearly seen, the T1 map has high levels of noise and poor signal to noise ratio compared with the T1-weighted image.  This introduces high levels of variability into the estimation of T1 levels and makes detecting subtle differences in T1 much more difficult.
  • T1 maps are more difficult to obtain, since the MRI machine must be very carefully calibrated.
  • They take longer to acquire, which can be problematic for some patient groups.
  • T1 maps are not included in most standard protocols.
  • Historically T1 maps have not been routinely acquired, which means they are missing from many long-term, longitudinal studies of multiple sclerosis and other neurodegenerative diseases.

This is where my research comes in.  Given a standard protocol, consisting of 4 clinical MRI scans, we want to predict the T1 map.  If we can do this with a reasonable degree of accuracy, we can overcome the problems above, as our estimated T1 maps will:

  • Have better signal to noise ratio, similar to the contrast images used in the T1 map prediction model.
  • Be easy to obtain using only the standard protocol of T1-weighted, T2-weighted, FLAIR and PD-weighted volumes.
  • Not require any additional scan time.
  • Be retroactively available for longitudinal and cross-sectional imaging studies of MS and other neurodegenerative diseases.

Stay tuned for a future post on how we are trying to do this.

10 reasons to switch to ggplot

Standard

Making plots is a necessary and useful task for anyone who works with data. While making the standard plots is a pretty straightforward task in most statistical programming languagues, including R, when it comes to using R‘s base graphics to make custom plots (the plots you actually want to make) things can get complicated. As with many of life’s problems, Hadley Wickham comes to the rescue. His R package ggplot2 is based on the principles outlined in Leland Wilkinson’s landmark 1999 book The Grammar of Graphics (hence “gg”). Full disclosure: I haven’t read the book. But I have been using ggplot exclusively for over a year, and I have become a believer in the gg approach to visualization. There are a lot of R users out there who are resistant to switching from base graphics to ggplot, and while there are some legitimate downsides and probably settings where it doesn’t make sense, for most users I believe it’s well worth the time and effort required to get started with ggplot. I mean, the New York Times thinks so! So for all those doubters (or newbies) out there, here goes my best shot at convincing you that switching to ggplot is worth the effort.

1. It can do quick-and-dirty and complex, so you only need one system

Many base users turn to lattice when they want more complexity, but with ggplot you only have to learn one system. The most important functions in ggplot are: qplot() (“q” for “quick”) and ggplot(). qplot() shares very similar syntax with plot(), and it’s a great place for new users to start. ggplot(), on the other hand, gives you the power to create layered plots that tell a more complex story.

library(ggplot2)
# load the diamonds dataset
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")
ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() +
  ggtitle("I'm a ggplot")

plot of chunk unnamed-chunk-2 plot of chunk unnamed-chunk-2

2. The default colors and other aesthetics are nicer.

In the example below, the qplot was just as easy to make but looks much prettier. The axis titles, tickmarks, margins and points are all much better in the ggplot default settings. The base plot could be made to look nicer, but that requires more work.

#make a plot using base
plot(x=diamonds$carat, y=diamonds$price, type="p")
title(main="I'm a base plot")
#make a "quick" plot using ggplot2
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")

plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-3

3. Never again lose an axis title (or get told your pdf can’t be created) due to misspecified outer or inner margins.

We’ve all been there.

4. You can save plots (or the beginnings of a plot) as objects.

This is great when you want to make several slightly different versions of the same basic plot without having to repeat a lot of code.

# make the basis for a plot using ggplot save it as an object, p
p <- ggplot(data = diamonds, aes(x = carat, y = price))
# add a geom (points) and display the plot
p + geom_point()

plot of chunk unnamed-chunk-4

5. Multivariate exploration is greatly simplified through faceting and coloring.

Use facet_grid() or facet_wrap() to create a separate plot for each value of a factor variable. We don’t have to change any of the original plotting code, just add the facet command to it. Faceting can also be done on more than one categorical variable to create a grid of plots.

p + geom_point() + facet_grid(. ~ color, labeller = label_both)

plot of chunk unnamed-chunk-5

When it comes to continuous variables or factor variables with many levels, coloring by that variable may be more practical. Again, ggplot makes this very easy:

# color by a continuous variable
ggplot(data = diamonds, aes(x = carat, y = price, colour = depth)) + geom_point()
# color by a factor variable
ggplot(data = diamonds, aes(x = carat, y = price, colour = color)) + geom_point()

plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-6

6. Easily build plots in layers to tell a more complete story.

For example, we might want to summarize the data in the previous plot with a smoother on top of the points. With ggplot, we can simply add the geom_smooth command. Each geom just adds another layer to the plot.

ggplot(data=diamonds, aes(x=carat, y=price, colour=clarity)) +
  geom_point(alpha=0.1) + geom_smooth()

plot of chunk unnamed-chunk-7

The default smoother can be changed through the stat option of geom_smooth() or by using stat_smooth() for more fine-grained control. I wanted the dots to be transparent so the smoothers could stand out, so I set alpha=0.1, meaning 10% opacity.

7. Let your plots evolve (or devolve) with minimal changes to code

Plotting is often very exploratory. With ggplot, it’s easy to add complexity, such as faceting, and equally easy to take it away. Base can be used to make most of the same plots, but it requires a lot of manual work that isn’t easy to undo to get back to a simpler version of the plot.

Here’s how some of the plots above would be accomplished using base. All these steps are feasible, but not trivial to do (or undo).

  • Faceting requires setting up a plot matrix using par(mfrow=c(nrow,ncol)), specifying different subsets of the data, and looping through these subsets. The number of plots and universal axis limits must be pre-computed and set manually.
  • Coloring by a different variable would require specifying each subset of the data and using points() or lines() to add each subset to the plot, specifying a different color for each subset.
  • To add a smoother, we would first compute the smoother and add it using lines().
  • To color by a continuous variable, you’d have to do something pretty clever. I’d probably set up a palette with a large number of colors, say 20, using rainbow or some other continous scale, then discretize the the continuous variable so that the discrete version has 20 values and use one color for each unique value.

Making complicated plots with base often just requires brute force. In a future post, I’m planning to show a side-by-side comparison of making the same plot using ggplot versus base. (Spoiler alert: ggplot wins!)

8. Make layered histograms and other cool plots

This is one of my favorite ways to compare distributions of a variable across groups. geom_histogram works well for two groups, but geom_density is easier to look at for several groups.

ggplot(data=diamonds, aes(x=price, fill=cut)) +
  geom_density(alpha=0.3)

plot of chunk unnamed-chunk-8

You can also make maps, consultants’ charts, plots with backgrounds, and other cool plots using ggplot.

9. It’s not that hard.

It’s really not hard to get started if you start with qplot and build from there. Many plots can be made using qplot, and many of the documentation pages start with a qplot example. I actually made the switch to ggplot by forcing myself to use qplot instead any time I wanted to use plot.

10. The documentation is great.

ggplot is very well-documented on the ggplot2 website and on Stack Overflow. Some other resources are

Of course, there are also some drawbacks and caveats worth considering:

  1. ggplot is often slower than base graphics.
  2. The default colors can be difficult to change.
  3. You might need to change the structure of your data frame to make certain plots. The functions cast() and melt() from the reshape package are worth getting familiar with. Hadley Wickham’s Tidy Data presentation is a good place to started.
  4. lattice is another “newfangled” (true quote) alternative to base graphics that many people like. Here’s a nice comparison of ggplot2 and lattice. Here’s a couple more.