Writing functions in R

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:
    • What are functions
    • How to write functions
    • Combining functions
    • Defensive programming practices
    • Good programming habits

An introduction to functions

  • If we only had one data set to analyze, it would probably be faster to load the file into a spreadsheet and use that to plot simple statistics.

  • However, the gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again.

  • We may also obtain similar data from a different source in the future.

  • In this lesson, we'll learn how to write a function so that we can repeat several operations with a single command.

What is a function?

  • Functions gather a sequence of operations into a whole, preserving it for ongoing use.
  • Functions provide:
    1. a name we can remember and invoke it by;
    2. relief from the need to remember the individual operations;
    3. a defined set of inputs and expected outputs;
    4. rich connections to the larger programming environment; and
    5. a more maintainable use of our code, when we can re-use and re-apply the same methods without the clutter and possible bugs of re-writting the the same method every time.
  • As the basic building block of most programming languages, user-defined functions constitute “programming” as much as any single abstraction can.
  • If you have written a function, you are a computer programmer.

Defining a function

  • Let's define a function fahr_to_kelvin() that converts temperatures from Fahrenheit to Kelvin:
fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}
  • We define fahr_to_kelvin() by assigning it to the output of function.

  • The list of argument names are contained within parentheses.

  • The body of the function (the statements that are executed when it runs) is contained within curly braces ({}).

  • The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.

What are functions?

fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}
  • It is useful to think of creating functions like writing a cookbook.

  • First you define the “ingredients” that your function needs.

  • In this case, we only need one ingredient to use our function: “temp”.

  • After we list our ingredients, we then say what we will do with them, in this case, we are taking our ingredient and applying a set of mathematical operators to it.

  • When we call the function, the values we pass to it as arguments are assigned to those variables so that we can use them inside the function.

  • Inside the function, we use a return statement to send a result back to whoever asked for it.

Returns

  • One feature unique to R is that the return statement is not required.
    • R automatically returns whichever variable is on the last line of the body of the function.
  • But for clarity, it is often good practice to explicitly define the return statement.
  • This can also be used for control flow within a function to break at an arbitrary line and not return the last line;
    • using an if() statement, we can break out of the function and return a value based on a condition.

Running functions

  • Calling our own function is no different from calling any other function:
# freezing point of water
fahr_to_kelvin(32)
[1] 273.15
# boiling point of water
fahr_to_kelvin(212)
[1] 373.15

Combining functions

  • The real power of functions comes from mixing, matching and combining them into ever-larger chunks to get the effect we want.

  • Let's define two functions that will convert temperature from Fahrenheit to Kelvin, and Kelvin to Celsius:

fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}

kelvin_to_celsius <- function(temp) {
  celsius <- temp - 273.15
  return(celsius)
}
  • Q: how can we define a function to convert directly from Fahrenheit to Celsius by reusing these two functions above?

  • A: consider the following code

fahr_to_celsius <- function(temp) {
  temp_k <- fahr_to_kelvin(temp)
  result <- kelvin_to_celsius(temp_k)
  return(result)
}

Interlude: Defensive Programming

  • Writing functions provides an efficient way to make R code re-usable and modular;
    • we should note that it is important to ensure that functions only work in their intended use-cases.
  • Checking function parameters is related to the concept of defensive programming.
    • Defensive programming encourages us to frequently check conditions and throw an error if something is wrong.
  • These checks are referred to as assertion statements because we want to assert some condition is TRUE before proceeding.
    • They make it easier to debug because they give us a better idea of where the errors originate.

Checking conditions

  • Let's start by re-examining fahr_to_kelvin(),
fahr_to_kelvin <- function(temp) {
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}
  • For this function to work as intended, the argument temp must be a numeric value;
    • otherwise, the mathematical procedure for converting between the two temperature scales will not work.
  • To create an error, we can use the function stop().

Checking conditions with if()

  • Since the argument temp must be a numeric vector, we could check for this condition with an if statement and throw an error if the condition was violated. We could augment our function above like so:
fahr_to_kelvin <- function(temp) {
  if (!is.numeric(temp)) {
    stop("temp must be a numeric vector.")
  }
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}
  • If we had multiple conditions or arguments to check, it would take many lines of code to check all of them.
  • Luckily R provides the convenience function stopifnot().
  • We can list as many requirements that should evaluate to TRUE;
    • stopifnot() throws an error if it finds one that is FALSE.
  • Listing these conditions also serves a secondary purpose as extra documentation for the function.

Checking conditions with stopifnot()

  • Let's try out defensive programming with stopifnot() by adding assertions to check the input to our function fahr_to_kelvin().

  • We want to assert the following: temp is a numeric vector.

fahr_to_kelvin <- function(temp) {
  stopifnot(is.numeric(temp))
  kelvin <- ((temp - 32) * (5 / 9)) + 273.15
  return(kelvin)
}

Checking conditions with stopifnot()

  • Our function still works when given proper input.
# freezing point of water
fahr_to_kelvin(temp = 32)
[1] 273.15
  • But fails instantly if given improper input.
# Metric is a factor instead of numeric
fahr_to_kelvin(temp = as.factor(32))
Error in fahr_to_kelvin(temp = as.factor(32)): is.numeric(temp) is not TRUE

More on combining functions

  • Now, we're going to define a function that calculates the Gross Domestic Product of a nation from the data available in our dataset:
# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat) {
  gdp <- dat$pop * dat$gdpPercap
  return(gdp)
}
  • We define calcGDP() by assigning it to the output of function.
  • The list of argument names are contained within parentheses.
  • Next, the body of the function is contained within curly braces ({}).
  • We've indented the statements in the body by two spaces.
  • This makes the code easier to read but does not affect how it operates.

More on combining functions

  • When we call the function, the values we pass to it are assigned to the arguments, which become variables inside the body of the function.
require(gapminder)
calcGDP(head(gapminder))
[1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231
  • That's not very informative.
  • Let's add some more arguments so we can extract that per year and country…

More on combining functions

calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap

  new <- cbind(dat, gdp=gdp)
  return(new)
}
  • The function now subsets the provided data by year if the year argument isn't empty, then subsets the result by country if the country argument isn't empty.
  • Then it calculates the GDP for whatever subset emerges from the previous two steps.
  • The function then adds the GDP as a new column to the subsetted data and returns this as the final result.
  • You can see that the output is much more informative than a vector of numbers.

More on combining functions

  • Let's take a look at what happens when we specify the year:
head(calcGDP(gapminder, year=2007))
      country continent year lifeExp      pop  gdpPercap          gdp
1 Afghanistan      Asia 2007  43.828 31889923   974.5803  31079291949
2     Albania    Europe 2007  76.423  3600523  5937.0295  21376411360
3     Algeria    Africa 2007  72.301 33333216  6223.3675 207444851958
4      Angola    Africa 2007  42.731 12420476  4797.2313  59583895818
5   Argentina  Americas 2007  75.320 40301927 12779.3796 515033625357
6   Australia   Oceania 2007  81.235 20434176 34435.3674 703658358894
  • Or for a specific country:
calcGDP(gapminder, country="Australia")
     country continent year lifeExp      pop gdpPercap          gdp
1  Australia   Oceania 1952  69.120  8691212  10039.60  87256254102
2  Australia   Oceania 1957  70.330  9712569  10949.65 106349227169
3  Australia   Oceania 1962  70.930 10794968  12217.23 131884573002
4  Australia   Oceania 1967  71.100 11872264  14526.12 172457986742
5  Australia   Oceania 1972  71.930 13177000  16788.63 221223770658
6  Australia   Oceania 1977  73.490 14074100  18334.20 258037329175
7  Australia   Oceania 1982  74.740 15184200  19477.01 295742804309
8  Australia   Oceania 1987  76.320 16257249  21888.89 355853119294
9  Australia   Oceania 1992  77.560 17481977  23424.77 409511234952
10 Australia   Oceania 1997  78.830 18565243  26997.94 501223252921
11 Australia   Oceania 2002  80.370 19546792  30687.75 599847158654
12 Australia   Oceania 2007  81.235 20434176  34435.37 703658358894

More on combining functions

  • Or both:
calcGDP(gapminder, year=2007, country="Australia")
    country continent year lifeExp      pop gdpPercap          gdp
1 Australia   Oceania 2007  81.235 20434176  34435.37 703658358894

Pass by value

  • Functions in R almost always make copies of the data to operate on inside of a function body.
  • When we modify dat inside the function we are modifying the copy of the gapminder dataset stored in dat, not the original variable we gave as the first argument.

  • This is called “pass-by-value” and it makes writing code much safer:

    • you can always be sure that whatever changes you make within the body of the function, stay inside the body of the function.

Function scope

  • The idea of pass-by-value is related to the concept of scoping:
    • any variables (or functions!) you create or modify inside the body of a function only exist for the lifetime of the function's execution.
  • When we call calcGDP(), the variables dat, gdp and new only exist inside the body of the function.
  • Even if we have variables of the same name in our interactive R session, they are not modified in any way when executing a function.
  gdp <- dat$pop * dat$gdpPercap
  new <- cbind(dat, gdp=gdp)
  return(new)
}
  • Finally, we calculated the GDP on our new subset, and created a new data frame with that column added.

Testing and documenting code

  • It’s important to both test functions and document them:
  • Documentation helps you, and others, understand:
    1. what the purpose of your function is;
    2. how to use it; and
    3. to make sure that your function actually does what you think.

Testing and documenting code

  • When you first start out, your workflow will probably look a lot like this:
    1. Write a function
    2. Comment parts of the function to document its intended behaviour
    3. Load in the source file
    4. Experiment with it in the console to make sure it behaves as you expect
    5. Make any necessary bug fixes
    6. Repeat the process until you have documented, tested and benchmarked the code and not found any bugs.
  • Formal documentation for functions, written in separate .Rd files, gets turned into the documentation you see in help files.
  • The roxygen2 package allows R coders to write documentation alongside the function code and then process it into the appropriate .Rd files.
  • You will want to switch to this more formal method of writing documentation when you start writing more complicated R projects.
  • Formal automated tests can be written using the testthatpackage.

Tips and future reading

  • R has some unique aspects that can be exploited when performing more complicated operations.
  • We will not be writing anything that requires knowledge of these more advanced concepts.
  • In the future when you are comfortable writing functions in R, you can learn more by reading the R Language Manual or this chapter from Advanced R Programming by Hadley Wickham.