Derivation and interpretation of leverage in SLR


I was recently looking online for an explicit derivation of leverage — the diagonals of the “hat matrix” — in simple linear regression. If X is the n-times-p matrix of explanatory variables in a linear model, then the hat matrix is H=X(X’X)-1X’, so called because it puts the “hat” on the predicted values, since Ŷ = HY.

Leverage (hi) has a lot of nice properties that can be quite informative for model diagnostics. For example, if Yi were to change by 1 unit, then Ŷi will change by hi. Leverage is also proportional to the uncertainty in the predicted values Ŷi, since Var(Ŷ)=σ2H, where σ2 is the variance of the model residuals.

In the simple linear regression (SLR) case, if your model contains an intercept then leverage is of the form

Screen Shot 2015-07-23 at 3.21.41 PM.

Therefore, leverage is similar to Mahalanobis distance, which measures distance from the “center” of the data, standardized by the “scale” or variance. The same is true for multivariate linear regression with an intercept. However, the relationship isn’t entirely obvious (at least to me) from the hat matrix formula, and I couldn’t find the proof anywhere online (I’m sure it’s in someone’s linear models class notes but I didn’t find it from my digging around). So I wrote it up and posted it to my Resources page.

One interesting thing to note: if your model does not contain an intercept, this relationship doesn’t hold.  However, if you center the X’s, the relationship above again holds, but without the 1/n term at the beginning.

Tips for submitting to arXiv for the first time


Today I successfully submitted my first paper to arXiv!  We’ve submitted this paper to a journal, but it hasn’t been published yet, so we wanted to get a pre-print up before advertising the corresponding software packages.  Unfortunately, the process of submitting to arXiv wasn’t painless.  Now that I’ve figured out some of the quirks, however, hopefully your first submission can go a little more smoothly than mine did.

After registering, the first unpleasant surprise was that arXiv doesn’t accept PDF documents created from LaTeX code.  You must submit your LaTeX code and all supporting files.  As far as I can tell, these have to be uploaded one-by-one.  Sorry, I didn’t figure out a way around that one (but let me know if you do!).

After verifying that my code compiled without errors on my computer, I uploaded all my documents, hit “Process”, and … long page of errors!  It turns out arXiv doesn’t accept .bib files and instead requires .bbl files for BibTeX.  So I thought, “this will be easy” and uploaded my .bbl file instead.  I again hit “Process”, and … another long page of errors!

I wasn’t sure how to interpret these errors, so … I gave up for a while.  A few weeks later, I decided to give it another go.  I realized that I had made a couple of mistakes the first time:

  1. Several of my graphics were in PDF format.  This is fine, but I needed to include this line at the top of my preamble (top of my .tex document) to ensure that the .tex file is built using PDFLaTeX instead of LaTeX:
  2. I had probably uploaded some unnecessary files, such as graphics that weren’t included in the final version.  ArXiv doesn’t like this.  Make sure only necessary files are uploaded.

Fixing those two things resulted finally in an error-free compilation.  Horray!  (If at this point you are still struggling with compilation errors, I highly recommend reading arXiv’s help page on Considerations for TeX Submissions.)

One final annoyance was the abstract.  I had included an abstract in my document, but arXiv has its own abstract form, and my current abstract exceeded the character limits.  Once I had pared it down, however, I got a couple of errors related to “unknown characters”.  I didn’t have anything crazy in the text, so this seemed a little ridiculous, but I went ahead and removed a ‘%’ sign and a couple of quotation marks and hyphens, which took care of the errors. I then went back and removed the abstract from the .tex document and re-uploaded it.  Submission (finally) complete!

Props to arXiv for a very nice set of help pages, the most relevant of which for this post is Submitting an Article.  This includes links to the TeX Submissions help page referenced above, among many others.

P.S. If you’re curious about the paper, it’s titled “Improving Reliability of Subject-Level Resting-State fMRI Parcellation with Shrinkage Estimators” and is available here.

Three ways to use MATLAB from R


Being a statistician working in neuroimaging is a little like living abroad and trying to speak a foreign language. For example, my first language is English, but I spent my first summer as a PhD student doing research at LMU in Munich, Germany. I had taken German in college and could have really basic conversations, but for more demanding situations I would switch to English in a heartbeat, given the option.

In the world of neuroimaging, or at least many corners of it, MATLAB is the official language. Many tools and programs for the analysis of fMRI data are programmed in MATLAB, and many of my collaborators program exclusively in MATLAB. I know MATLAB well enough to do basic things like reading in data, looping and playing with matrices, but beyond the basics, like many statisticians I am much more comfortable in R. Therefore, I often find myself straddling the line between R and MATLAB.

For example, in one of my projects, I used R to read in the fMRI time series, create summary statistics, and do some transformations on those statistics. I then used MATLAB to perform clustering on the results using a method developed by one of my collaborators. Finally, I again used R to visualize the results using brainR.  For that project, it was convenient enough to write separate programs for the different R and MATLAB parts. I saved the results of the first R part in a CSV file using write.table, read that file into MATLAB using csvread, saved the output of the MATLAB part in another CSV file, and finally read that into R using read.table. Not exactly elegant, but it did the job.

However, I also created and ran a simulation that required the same basic process but over many iterations, and the manual workflow I just described was not a realistic option. I needed a way to run MATLAB automatically from within my R program. (In my case, I eventually compromised and did the clustering in R for the purposes of the simulation, a la option 4 below.) I experimented with different ways to use MATLAB from within R, and I found three primary options. Below I’ll show how to do each of these and discuss their pros and cons, but in short the options are (1) to execute a single MATLAB line using the system() command in R; (2) to use a package like R.matlab to send code to the MATLAB server; or (3) to write out a MATLAB program within R using the writeLines() function and run the entire program using the system() command in R.

Option 1: Run a single MATLAB command at a time using system()

The system command in R runs the contents of a character string as a shell command – basically anything you’d type into a Terminal window on a Mac or in a Linux environment, such as the Hopkins cluster. For example, typing system("ls") will return a list of the files in the working directory. If your system has the ability to run MATLAB from the command line (true for the Hopkins cluster), you can take advantage of that through R by typing system("matlab -nodisplay -r 'stuff; to; do; in; matlab;'").

Tip: If you’re working locally on a Mac with MATLAB installed, you’d need to go to the directory where MATLAB is installed: for example, this works on my computer to open MATLAB on the command line: /Applications/ -nodisplay


  • Quick and easy
  • Possible to run multiple lines, separated by semicolons

Cons :(

  • All commands must be contained within a single line
  • Loops and line-to-line dependencies are somewhat tricky
  • Can’t pass data between MATLAB and R
  • If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit.
system('matlab -nodisplay -r "a=2; b=1; display(a+b); exit"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b ( 64-bit (glnxa64)
                                   August 13, 2013

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit

ans =


To see an example of the last con – “If there is an error in your code, the execution will stop and MATLAB will remain open until you re-type exit” – suppose I had mis-spelled display as displya:

system('matlab -nodisplay -r "a=2; b=1; displya(a+b); exit;"')
                                 < M A T L A B (R) >
                       Copyright 1984-2013 The MathWorks, Inc.
                         R2013b ( 64-bit (glnxa64)
                                   August 13, 2013

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit

Undefined function 'displya' for input arguments of type 'double'.


The double bracket (>>) indicates that we’re still in the MATLAB environment. Simply type exit to close the MATLAB connection and return to R.

Option 2: Use R.matlab to send code to the MATLAB server

R.matlab is a package that communicates with MATLAB, can read and write MAT files, and can pass objects to (and receive objects from) MATLAB. It’s a little tricky to use at first, but the ability to pass objects between R and MATLAB can be very useful. However, this ability is limited to objects of a reasonable size, and can be slow.


  • Can spread commands over multiple lines
  • Loops and dependencies are easier
  • Can pass data back and forth using setVariable() and getVariable()

Cons :(

  • Can be tricky to use
  • Connection to the MATLAB server can be buggy
  • Clunky output
  • Can’t run multiple jobs in parallel jobs on the same node, since the MATLAB server can only be opened once on each node, making it limited for simulations
  • Passing large objects back and forth doesn’t always work
# load the R.matlab library and start MATLAB server
matlab <- Matlab()
isOpen <- open(matlab)
                            < M A T L A B (R) >
                  Copyright 1984-2013 The MathWorks, Inc.
                    R2013b ( 64-bit (glnxa64)
                              August 13, 2013

To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit

Running MatlabServer v2.2.0
MATLAB v7.x or higher detected.
Saving with option -V6.
Added InputStreamByteWrapper to dynamic Java CLASSPATH.
MATLAB server started!
Trying to open server socket (port 9999)...done.
Connected to client.
if (!isOpen) throw("MATLAB server is not running: waited 30 seconds.")

# set a variable in R and send to MATLB
x <- 10
setVariable(matlab, x = x)
Received cmd: 3
Will read MAT file: "/tmp/RtmpK3nM1k/file1821f2a4d0339.mat"
evaluate(matlab, "x")
Received cmd: 1
"eval" string: "x"

x =


Sent byte: 0
# set a couple of variables in MATLAB
evaluate(matlab, "y=20; z=x+y")
Received cmd: 1
"eval" string: "y=20; z=x+y"

z =


Sent byte: 0
# pass a variable from MATLAB to R
z <- getVariable(matlab, "z")
Received cmd: 1
"eval" string: "variables = {'z'};"
Sent byte: 0
Received cmd: 2
save(tmpname, '-V6', 'z')
[1,]   30

[1] "MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Fri Aug 15 09:57:56 2014                                        "

[1] "5"

[1] "little"

# close the MATLAB server
Received cmd: -1
MATLAB server shutdown!
> >>

Option 3: Write an entire MATLAB program using writeLines() and run using system()

The writeLines function can be used to create a MATLAB script (a .m file) from a character vector in R. Each element of the vector will become a separate line in the script. This approach can be useful when you want the MATLAB script to vary based on something in the R environment, such as a subject ID or parameters in a simulation. If, on the other hand, the MATLAB script never changes, you can just create the script in a text editor, rather than using writeLines. In either case, the script can be run from within R using the system command.


  • Can run multiple lines with loops and dependencies
  • Easier to use than R.matlab
  • No problems running multiple jobs in parallel on the same node

Cons :(

  • Passing data directly between MATLAB and R isn’t possible, so must write data to CSV or similar
  • As in option 1, if there is an error in the MATLAB code, the execution will stop and MATLAB will remain option until you type exit.
#set a variable in R and save in a csv file
x <- 10
write.table(x, file='~/x.csv', sep=",", row.names=FALSE, col.names=FALSE)

#make a vector where each element is a line of MATLAB code
#matlab code reads in our variable x, creates two variables y and z, 
#and write z in a csv file
matlab.lines <- c(
    "x = csvread('~/x.csv')",
    "csvwrite('~/z.csv', z)")

#create a MATLAB script containing all the commands in matlab.lines
writeLines(matlab.lines, con="~/myscript.m")

#run our MATLAB script
system("matlab -nodisplay -r \"run('~/myscript.m'); exit\"")
# read in the variable z we created in MATLAB
z <- read.table("~/z.csv")
#remove the temporary files we used to pass data between R and MATLAB
system("rm ~/x.csv ~/z.csv")

Option 4: Stick with one or the other

Of course, another solution is to stick with either R or MATLAB, if it’s feasible. Opening the MATLAB server can be slow and clunky, making it problematic in a simulation setting, and passing large amounts of data back and forth can be difficult or impossible. In my simulation, I actually ended up avoiding MATLAB entirely for the sake of speed.

However, it takes time to re-write existing code in a different language, especially one you aren’t super comfortable with (Google Translate for code, anyone?). I’d love to one day be as comfortable with MATLAB as in R, as I hope to one day speak German as well as I speak Spanish (and to speak Spanish as well as I do English). Of course the only way to do any of these things are to practice coding/speaking, but full immersion is time consuming and not always practical, and the solutions above can be helpful in the meantime.

Finally, thank you to John Muschelli for explaining option 3. I also heard a rumor that he might be writing a follow-up post with some additional tips…

If you have another way of interfacing between R and MATLAB, I’d love to hear about it!

Why I’m a Screen evangelist (and how to get started)


I’ve recently made several upgrades to my workflow that make me say “WHY OH WHY did it take me this long to start doing this?”  One of the big upgrades was using Sublime + Enhanced-R as explained by Alyssa Frazee to send code line-by-line to the Terminal with a simple keystroke.  Copy and paste no more!  It makes my fingers hurt just thinking about my old ways.  Thing is, I picked up on the Sublime workflow months after Alyssa and others in my department had started using it.  I read Alyssa’s blog post back in August, but it took me a while to take action.

Not this time!  Following a recommendation from Taki Shinohara, I recently started using GNU screen, or “screen” for short, and I am in nerd heaven.  I’m so excited about this that I am constantly asking everyone if they are “using screen yet?”.  I am sure I am getting on some nerves.  But seriously, it’s the awesomest.  Here’s why.

I often work on my department’s cluster interactively, whether for debugging code that I will submit as a batch job later, or just running code that doesn’t really need to be submitted as a batch job.  I love this way of working:  I get the best of the cluster and my computer at the same time. I can use tons of memory (100G in some cases) and my computer doesn’t get bogged down.  At the same time, I can debug as I go, easily track my code’s progress, and write code as I go.  However, there is one major issue that, until recently, stopped me from working this way most of the time: the connection to the cluster could be lost, taking all my code’s progress down with it.  This can happen for a myriad of reasons: a faulty internet connection, I close my computer, I need to go home for the day, the connection simply times out after a while, and so on.  To get over these limitations, I would often use interactive sessions for the purpose of debugging and code building, but submit my code as a batch job when I needed it to run without interruption.  Alternatively, I would often work interactively, but save my progress in data files periodically and reload the saved files if the connection was lost for any reason.  This workflow does the job, but it’s not convenient or failproof.

Enter screen.  Screen allows me to work interactively for an indefinite length of time without ever having to worry about losing the connection.  A quick rundown:

1. I log onto the cluster and start a new screen by typing ‘screen’.

2. I get on a cluster node (‘qrsh’), start R or MATLAB, and submit whatever code I want to run.

3. When I am ready to stop for the time being, I type ctrl-A ctrl-D to “detach” the screen.

I can then close my computer and move on with my life, and my session will still be there the next time I need it.

4. When I’m ready to look at it again, I log onto the cluster and type ‘screen -r’ to “reattach” the screen.  And it’s back, just like magic.  This works even if code is running when I want to stop working.

Screen also works if the connection is inadvertently lost.  If you close your computer or lose your internet connection before detaching the screen, you simply need to “force detach” before reattaching it.  To do this, log onto the cluster, type ‘screen -D’ to force detach and then ‘screen -r’ to reattach.

You can also have more than one screen session open at once, running completely independently.  In this case, each screen session has an ID, and to reattach screen with ID 12345, you’d type ‘screen -r 12345′.  If you don’t know the screen ID, simply type ‘screen -r’ and screen will return a list of each session and their IDs.

Keep in mind that it’s good practice to exit screen (type ‘ctrl-A ctrl-K’ to kill an attached session) when you’re done working to make sure you aren’t taking up resources that might be needed by other users.

That’s it.  Simple and super convenient.  As long as screen is installed on your cluster, you don’t have to do anything additionally to start using it and stop losing work!

What is quantitative MRI, and why does it matter?


This post is going to be a gentle introduction to MRI, which is a type of brain imaging and my current focus of research.  I’m planning a future post or two about my research, so I thought I’d provide a little background for people who don’t know much, if anything, about MRI.  I am by training a statistician by way of industrial engineering, so I am by no means a neurologist, radiologist, physicist or other type of “-ist”, meaning this is going to be pretty basic.  This particular post will be about something called relaxation times and quantitative MRI scans.

I got interested in neuroimaging about a year ago, thanks to Elizabeth Sweeney, who mentored me through the process of getting up to speed.  It’s a fascinating area for biostatistics, as it combines big data, nice visualizations, and interesting and impactful biological questions.  As imaging technologies like MRI get more sophisticated, scientists are really just starting to understand the details about how the brain works.  And statisticians are just starting to get involved in this learning process, so there are a lot of opportunities to look at existing questions and problems from a new, more statistical perspective.

For about the past year, I have been working on a project having to do with quantitative MRI with Elizabeth, Taki Shinohara, and people at the NINDS.  I’ll talk more about that project in a future post, but for now here’s a basic introduction to the topic.

Imagine yourself lying flat in an MRI scanner.   As the technicians turn the MRI machine on, it creates a strong magnetic field of 3-7 Tesla.  Under that magnetic field, the protons in your brain, which were previously pointing in every direction, will all align in one direction, parallel to the magnetic field.  Then, the technicians will turn the magnetic field off, and your brain’s protons will go back to pointing their original directions.  MRI scanners can measure the “T1 relaxation times”, or the time it takes for your protons to “relax” back to their original positions.  At least, they can try to.  More on that in a bit.

But first, why measure the T1 relaxation time?  The interest in relaxation times began in 1971, when it was discovered that tumors have higher relaxation times than regular tissue.  Since then, studies have established that people with certain disorders, including multiple sclerosis (MS), tend to have widespread increases in T1 relaxation times, indicating neurological degradation, including loss of axons and myelin in the brain.  And while some affected tissue, like tumors and lesions, are easy to see on a regular, unitless MRI scan, other tissue with elevated T1 doesn’t look visibly different on MRI than healthy tissue.  That’s where quantitative images that measure T1 (“T1 maps”) come in.  Though affected tissue may not look different to the naked eye on the scan, the measured T1 tells the story.  If we can measure T1 in normal-appearing gray and white matter, we can better understand the widespread neuro-degenerative impacts of diseases like MS.  Through longitudinal studies of T1, we can better understand long-term disease burden and hopefully design better treatments and relapse detection methods.

So if T1 maps are so great, what’s the problem?  There are several major issues with T1 maps, compared with the more common clinical MRI scans, such as T1-weighted, which offer a contrast image but lack interpretable units.  Here’s an example of a T1 map (A) and a T1-weighted image (B).

  • As is clearly seen, the T1 map has high levels of noise and poor signal to noise ratio compared with the T1-weighted image.  This introduces high levels of variability into the estimation of T1 levels and makes detecting subtle differences in T1 much more difficult.
  • T1 maps are more difficult to obtain, since the MRI machine must be very carefully calibrated.
  • They take longer to acquire, which can be problematic for some patient groups.
  • T1 maps are not included in most standard protocols.
  • Historically T1 maps have not been routinely acquired, which means they are missing from many long-term, longitudinal studies of multiple sclerosis and other neurodegenerative diseases.

This is where my research comes in.  Given a standard protocol, consisting of 4 clinical MRI scans, we want to predict the T1 map.  If we can do this with a reasonable degree of accuracy, we can overcome the problems above, as our estimated T1 maps will:

  • Have better signal to noise ratio, similar to the contrast images used in the T1 map prediction model.
  • Be easy to obtain using only the standard protocol of T1-weighted, T2-weighted, FLAIR and PD-weighted volumes.
  • Not require any additional scan time.
  • Be retroactively available for longitudinal and cross-sectional imaging studies of MS and other neurodegenerative diseases.

Stay tuned for a future post on how we are trying to do this.

10 reasons to switch to ggplot


Making plots is a necessary and useful task for anyone who works with data. While making the standard plots is a pretty straightforward task in most statistical programming languagues, including R, when it comes to using R‘s base graphics to make custom plots (the plots you actually want to make) things can get complicated. As with many of life’s problems, Hadley Wickham comes to the rescue. His R package ggplot2 is based on the principles outlined in Leland Wilkinson’s landmark 1999 book The Grammar of Graphics (hence “gg”). Full disclosure: I haven’t read the book. But I have been using ggplot exclusively for over a year, and I have become a believer in the gg approach to visualization. There are a lot of R users out there who are resistant to switching from base graphics to ggplot, and while there are some legitimate downsides and probably settings where it doesn’t make sense, for most users I believe it’s well worth the time and effort required to get started with ggplot. I mean, the New York Times thinks so! So for all those doubters (or newbies) out there, here goes my best shot at convincing you that switching to ggplot is worth the effort.

1. It can do quick-and-dirty and complex, so you only need one system

Many base users turn to lattice when they want more complexity, but with ggplot you only have to learn one system. The most important functions in ggplot are: qplot() (“q” for “quick”) and ggplot(). qplot() shares very similar syntax with plot(), and it’s a great place for new users to start. ggplot(), on the other hand, gives you the power to create layered plots that tell a more complex story.

# load the diamonds dataset
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")
ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() +
  ggtitle("I'm a ggplot")

plot of chunk unnamed-chunk-2 plot of chunk unnamed-chunk-2

2. The default colors and other aesthetics are nicer.

In the example below, the qplot was just as easy to make but looks much prettier. The axis titles, tickmarks, margins and points are all much better in the ggplot default settings. The base plot could be made to look nicer, but that requires more work.

#make a plot using base
plot(x=diamonds$carat, y=diamonds$price, type="p")
title(main="I'm a base plot")
#make a "quick" plot using ggplot2
qplot(x=carat, y=price, data=diamonds, geom="point") +
  ggtitle("I'm a qplot")

plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-3

3. Never again lose an axis title (or get told your pdf can’t be created) due to misspecified outer or inner margins.

We’ve all been there.

4. You can save plots (or the beginnings of a plot) as objects.

This is great when you want to make several slightly different versions of the same basic plot without having to repeat a lot of code.

# make the basis for a plot using ggplot save it as an object, p
p <- ggplot(data = diamonds, aes(x = carat, y = price))
# add a geom (points) and display the plot
p + geom_point()

plot of chunk unnamed-chunk-4

5. Multivariate exploration is greatly simplified through faceting and coloring.

Use facet_grid() or facet_wrap() to create a separate plot for each value of a factor variable. We don’t have to change any of the original plotting code, just add the facet command to it. Faceting can also be done on more than one categorical variable to create a grid of plots.

p + geom_point() + facet_grid(. ~ color, labeller = label_both)

plot of chunk unnamed-chunk-5

When it comes to continuous variables or factor variables with many levels, coloring by that variable may be more practical. Again, ggplot makes this very easy:

# color by a continuous variable
ggplot(data = diamonds, aes(x = carat, y = price, colour = depth)) + geom_point()
# color by a factor variable
ggplot(data = diamonds, aes(x = carat, y = price, colour = color)) + geom_point()

plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-6

6. Easily build plots in layers to tell a more complete story.

For example, we might want to summarize the data in the previous plot with a smoother on top of the points. With ggplot, we can simply add the geom_smooth command. Each geom just adds another layer to the plot.

ggplot(data=diamonds, aes(x=carat, y=price, colour=clarity)) +
  geom_point(alpha=0.1) + geom_smooth()

plot of chunk unnamed-chunk-7

The default smoother can be changed through the stat option of geom_smooth() or by using stat_smooth() for more fine-grained control. I wanted the dots to be transparent so the smoothers could stand out, so I set alpha=0.1, meaning 10% opacity.

7. Let your plots evolve (or devolve) with minimal changes to code

Plotting is often very exploratory. With ggplot, it’s easy to add complexity, such as faceting, and equally easy to take it away. Base can be used to make most of the same plots, but it requires a lot of manual work that isn’t easy to undo to get back to a simpler version of the plot.

Here’s how some of the plots above would be accomplished using base. All these steps are feasible, but not trivial to do (or undo).

  • Faceting requires setting up a plot matrix using par(mfrow=c(nrow,ncol)), specifying different subsets of the data, and looping through these subsets. The number of plots and universal axis limits must be pre-computed and set manually.
  • Coloring by a different variable would require specifying each subset of the data and using points() or lines() to add each subset to the plot, specifying a different color for each subset.
  • To add a smoother, we would first compute the smoother and add it using lines().
  • To color by a continuous variable, you’d have to do something pretty clever. I’d probably set up a palette with a large number of colors, say 20, using rainbow or some other continous scale, then discretize the the continuous variable so that the discrete version has 20 values and use one color for each unique value.

Making complicated plots with base often just requires brute force. In a future post, I’m planning to show a side-by-side comparison of making the same plot using ggplot versus base. (Spoiler alert: ggplot wins!)

8. Make layered histograms and other cool plots

This is one of my favorite ways to compare distributions of a variable across groups. geom_histogram works well for two groups, but geom_density is easier to look at for several groups.

ggplot(data=diamonds, aes(x=price, fill=cut)) +

plot of chunk unnamed-chunk-8

You can also make maps, consultants’ charts, plots with backgrounds, and other cool plots using ggplot.

9. It’s not that hard.

It’s really not hard to get started if you start with qplot and build from there. Many plots can be made using qplot, and many of the documentation pages start with a qplot example. I actually made the switch to ggplot by forcing myself to use qplot instead any time I wanted to use plot.

10. The documentation is great.

ggplot is very well-documented on the ggplot2 website and on Stack Overflow. Some other resources are

Of course, there are also some drawbacks and caveats worth considering:

  1. ggplot is often slower than base graphics.
  2. The default colors can be difficult to change.
  3. You might need to change the structure of your data frame to make certain plots. The functions cast() and melt() from the reshape package are worth getting familiar with. Hadley Wickham’s Tidy Data presentation is a good place to started.
  4. lattice is another “newfangled” (true quote) alternative to base graphics that many people like. Here’s a nice comparison of ggplot2 and lattice. Here’s a couple more.