Making plots is a necessary and useful task for anyone who works with data. While making the standard plots is a pretty straightforward task in most statistical programming languagues, including
R, when it comes to using
R‘s base graphics to make custom plots (the plots you actually want to make) things can get complicated. As with many of life’s problems, Hadley Wickham comes to the rescue. His
ggplot2 is based on the principles outlined in Leland Wilkinson’s landmark 1999 book The Grammar of Graphics (hence “gg”). Full disclosure: I haven’t read the book. But I have been using ggplot exclusively for over a year, and I have become a believer in the gg approach to visualization. There are a lot of
R users out there who are resistant to switching from base graphics to ggplot, and while there are some legitimate downsides and probably settings where it doesn’t make sense, for most users I believe it’s well worth the time and effort required to get started with ggplot. I mean, the New York Times thinks so! So for all those doubters (or newbies) out there, here goes my best shot at convincing you that switching to ggplot is worth the effort.
1. It can do quick-and-dirty and complex, so you only need one system
Many base users turn to lattice when they want more complexity, but with ggplot you only have to learn one system. The most important functions in ggplot are:
qplot() (“q” for “quick”) and
qplot() shares very similar syntax with
plot(), and it’s a great place for new users to start.
ggplot(), on the other hand, gives you the power to create layered plots that tell a more complex story.
library(ggplot2) # load the diamonds dataset head(diamonds)
## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
qplot(x=carat, y=price, data=diamonds, geom="point") + ggtitle("I'm a qplot") ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("I'm a ggplot")
2. The default colors and other aesthetics are nicer.
In the example below, the qplot was just as easy to make but looks much prettier. The axis titles, tickmarks, margins and points are all much better in the ggplot default settings. The base plot could be made to look nicer, but that requires more work.
#make a plot using base plot(x=diamonds$carat, y=diamonds$price, type="p") title(main="I'm a base plot") #make a "quick" plot using ggplot2 qplot(x=carat, y=price, data=diamonds, geom="point") + ggtitle("I'm a qplot")
3. Never again lose an axis title (or get told your pdf can’t be created) due to misspecified outer or inner margins.
We’ve all been there.
4. You can save plots (or the beginnings of a plot) as objects.
This is great when you want to make several slightly different versions of the same basic plot without having to repeat a lot of code.
# make the basis for a plot using ggplot save it as an object, p p <- ggplot(data = diamonds, aes(x = carat, y = price)) # add a geom (points) and display the plot p + geom_point()
5. Multivariate exploration is greatly simplified through faceting and coloring.
facet_wrap() to create a separate plot for each value of a factor variable. We don’t have to change any of the original plotting code, just add the
facet command to it. Faceting can also be done on more than one categorical variable to create a grid of plots.
p + geom_point() + facet_grid(. ~ color, labeller = label_both)
When it comes to continuous variables or factor variables with many levels, coloring by that variable may be more practical. Again, ggplot makes this very easy:
# color by a continuous variable ggplot(data = diamonds, aes(x = carat, y = price, colour = depth)) + geom_point() # color by a factor variable ggplot(data = diamonds, aes(x = carat, y = price, colour = color)) + geom_point()
6. Easily build plots in layers to tell a more complete story.
For example, we might want to summarize the data in the previous plot with a smoother on top of the points. With ggplot, we can simply add the
geom_smooth command. Each geom just adds another layer to the plot.
ggplot(data=diamonds, aes(x=carat, y=price, colour=clarity)) + geom_point(alpha=0.1) + geom_smooth()
The default smoother can be changed through the
stat option of
geom_smooth() or by using
stat_smooth() for more fine-grained control. I wanted the dots to be transparent so the smoothers could stand out, so I set
7. Let your plots evolve (or devolve) with minimal changes to code
Plotting is often very exploratory. With ggplot, it’s easy to add complexity, such as faceting, and equally easy to take it away. Base can be used to make most of the same plots, but it requires a lot of manual work that isn’t easy to undo to get back to a simpler version of the plot.
Here’s how some of the plots above would be accomplished using base. All these steps are feasible, but not trivial to do (or undo).
- Faceting requires setting up a plot matrix using
par(mfrow=c(nrow,ncol)), specifying different subsets of the data, and looping through these subsets. The number of plots and universal axis limits must be pre-computed and set manually.
- Coloring by a different variable would require specifying each subset of the data and using
lines()to add each subset to the plot, specifying a different color for each subset.
- To add a smoother, we would first compute the smoother and add it using
- To color by a continuous variable, you’d have to do something pretty clever. I’d probably set up a palette with a large number of colors, say 20, using
rainbowor some other continous scale, then discretize the the continuous variable so that the discrete version has 20 values and use one color for each unique value.
Making complicated plots with base often just requires brute force. In a future post, I’m planning to show a side-by-side comparison of making the same plot using ggplot versus base. (Spoiler alert: ggplot wins!)
8. Make layered histograms and other cool plots
This is one of my favorite ways to compare distributions of a variable across groups.
geom_histogram works well for two groups, but
geom_density is easier to look at for several groups.
ggplot(data=diamonds, aes(x=price, fill=cut)) + geom_density(alpha=0.3)
9. It’s not that hard.
It’s really not hard to get started if you start with qplot and build from there. Many plots can be made using qplot, and many of the documentation pages start with a qplot example. I actually made the switch to ggplot by forcing myself to use qplot instead any time I wanted to use plot.
10. The documentation is great.
ggplot is very well-documented on the ggplot2 website and on Stack Overflow. Some other resources are
- Hadley Wickham’s ggplot2 book and paper
- Cookbook for R: Graphs
- This tutorial on ggplot2
- Some really nice videos by Roger Peng.
Of course, there are also some drawbacks and caveats worth considering:
- ggplot is often slower than base graphics.
- The default colors can be difficult to change.
- You might need to change the structure of your data frame to make certain plots. The functions
melt()from the reshape package are worth getting familiar with. Hadley Wickham’s Tidy Data presentation is a good place to started.
- lattice is another “newfangled” (true quote) alternative to base graphics that many people like. Here’s a nice comparison of ggplot2 and lattice. Here’s a couple more.