Advanced Plotting in R

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
  • Advanced plotting
    • ggplot2 basics
    • Graphics layers
    • Transformations and statistics

Advanced plotting

  • There are three main plotting systems in R,
    1. the base plotting system which we have seen already;
    2. the lattice package;
    3. and the ggplot2 package.
  • For the rest of the session, we’ll learn about the ggplot2 package, because it is the common plotting library in R for creating publication quality graphics.
  • ggplot2 is built on the idea that any plot can be expressed from the same set of components:
    1. a data set,
    2. a coordinate system, and
    3. a set of geoms – the visual representation of data points.
  • The key to understanding ggplot2 is thinking about a figure in layers.
  • This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.
  • We will begin by loading the gapminder data again along with ggplot2:
require(gapminder)
require(ggplot2)    

ggplot2 basics

  • Let's start off with an example:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

plot of chunk unnamed-chunk-2

  • The first thing we do is call the ggplot function.

  • This function lets R know that we're creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot:

    • i.e., they apply to all layers on the plot.

ggplot2 basics

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

plot of chunk unnamed-chunk-3

  • We've passed in two arguments to ggplot.

  • First, we tell ggplot what data we want to show on our figure, in this example the gapminder data we read in earlier.

  • For the second argument, we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure;

    • in this case the aesthetic properties are the x and y locations.
  • Here we told ggplot we want to plot the “gdpPercap” column on the x-axis and the “lifeExp” column on the y-axis.

ggplot2 basics

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

plot of chunk unnamed-chunk-4

  • Notice that we didn't need to explicitly pass aes these columns (e.g. x = gapminder[, "gdpPercap"]);

    • this is because ggplot will look in the dataframe for that column.

ggplot2 basics

  • By itself, the call to ggplot isn't enough to draw a figure:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

plot of chunk unnamed-chunk-5

  • We need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer.

  • In our example, we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatterplot of points.

Layers

  • Using a scatterplot probably isn't the best for visualizing change over time.

  • Instead, let's tell ggplot to visualize the data as a line plot:

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

plot of chunk unnamed-chunk-6

  • Instead of adding a geom_point layer, we've added a geom_line layer.

  • We've added the by aesthetic, which tells ggplot to draw a line for each country.

Layers

  • Q: what do you think we can do to visualize both lines and points on the plot?

  • A: We can add another layer to the plot:

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line() + geom_point()

plot of chunk unnamed-chunk-7

Layers

  • It's important to note that each layer is drawn on top of the previous layer.

  • In this example, the points have been drawn on top of the lines.

  • Here's a demonstration:

ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
  geom_line(mapping = aes(color=continent)) + geom_point()

plot of chunk unnamed-chunk-8

  • In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points.

    • Now we can clearly see that the points are drawn on top of the lines.

Layers

  • So far, we've seen how to use an aesthetic (such as color) as a mapping to a variable in the data.

  • For example, when we use geom_line(mapping = aes(color=continent)), ggplot will give a different color to each continent.

  • But what if we want to change the color of all lines to blue?

    • It may seem that geom_line(mapping = aes(color="blue")) should work, but it doesn't.
  • Since we don't want to create a mapping to a specific variable, we can move the color specification outside of the aes() function, like this: geom_line(color="blue").

Layers

  • The result of changing the code as in the last slide is
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
  geom_line(mapping = aes(color=continent)) + geom_line(color="blue")

plot of chunk unnamed-chunk-9

Transformations and statistics in ggplot2

  • ggplot2 also makes it easy to overlay statistical models over the data – to demonstrate we'll go back to our first example:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

plot of chunk unnamed-chunk-10

  • Currently it's hard to see the relationship between the points due to some strong outliers in GDP per capita.

Transformations and statistics in ggplot2

  • We can change the scale of units on the x axis using the scale functions.

  • These control the mapping between the data values and visual values of an aesthetic.

  • We can also modify the transparency of the points, using the alpha function, which is especially helpful when you have a large amount of data which is very clustered.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10()

plot of chunk unnamed-chunk-11

Transformations and statistics in ggplot2

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10()

plot of chunk unnamed-chunk-12

  • The log10 function applied a transformation to the values of the gdpPercap column before rendering them on the plot, so that each multiple of 10 now only corresponds to an increase in 1 on the transformed scale,

    • e.g. a GDP per capita of 1,000 is now 3 on the x axis, a value of 10,000 corresponds to 4 on the x axis and so on.
  • This makes it easier to visualize the spread of data on the x-axis.

Fitting a simple regression

  • We can fit a simple relationship to the data by adding another layer, geom_smooth:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + geom_smooth(method="lm")

plot of chunk unnamed-chunk-13

  • The lm method refers to the fact we are using the standard “linear model” regression function built-in to R.

  • The regression line is meant to describe the trend between an increase in income (in a log-10 scale) and the associated increase in life expectancy.

Fitting a simple regression

  • As a quick preview for the midterm, the lm function can be used directly as follows:
my_linear_model_object <- lm(lifeExp ~ log10(gdpPercap), data=gapminder)
  • Notice in the above expression, we utilize a special ~ operator to instruct the function that we are inserting a proxy for a mathematical equation.

  • This is meant to represent the formula

\[ \begin{align} Y_\mathrm{\text{Life expectancy}} = \beta_0 + \beta_1 \log_{10}\left( X_\mathrm{\text{GDP per capita}}\right) + \epsilon \end{align} \]

  • The linear model is an object that has certain built-in methods that are standard in regression analysis.

    • Specifically, it automatically performs or knows how to perform much of the mathematical analysis we should run on the above equation.

Fitting a simple regression

  • To learn a bit about the model and the statistical relationship we are studying, we can use the summary command:
summary(my_linear_model_object)

Call:
lm(formula = lifeExp ~ log10(gdpPercap), data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.778  -4.204   1.212   4.658  19.285 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -9.1009     1.2277  -7.413 1.93e-13 ***
log10(gdpPercap)  19.3534     0.3425  56.500  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.62 on 1702 degrees of freedom
Multiple R-squared:  0.6522,    Adjusted R-squared:  0.652 
F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16
  • Notice our formula is represented with a keyword argument in the above summary.

    • Various pieces of analysis used for the midterm can be derived from the above information.
    • For example, we can see the p-value for the predictor log10(gdpPercap) and the \( R^2 \) value above.

Setting additional plotting parameters

  • Back in the plot, we can make the line thicker by setting the size aesthetic in the geom_smooth layer:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + geom_smooth(method="lm", size=1.5)

plot of chunk unnamed-chunk-16

  • There are two ways an aesthetic can be specified. Here we set the size aesthetic by passing it as an argument to geom_smooth.

  • Previously in the lesson we've used the aes function to define a mapping between data variables and their visual representation.

Multi-panel figures

  • Earlier we visualized the change in life expectancy over time across all countries in one plot but we can split this out over multiple panels by adding a layer of facet panels.

  • We start by making a subset of data including only countries located in the Americas.

  • This includes 25 countries, which will begin to clutter the figure.

americas <- gapminder[gapminder$continent == "Americas",]
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() + facet_wrap( ~ country) 

plot of chunk unnamed-chunk-17

  • The facet_wrap layer also takes a “formula” as its argument, denoted by the tilde (~), telling R to draw a panel for each unique value in the country column of the gapminder dataset.