An introduction to programming in R -- Part 1

08/26/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • What is R and RStudio
    • How to install packages and get help
    • How to use R as a calculator
    • Variables and data types

Introduction

  • This course will lean heavily on programming;

    • while it is possible to perform statistical analysis by hand for some very simple problems, any realistic problem solving must be done on a computer.
  • This course does not assume that you are already familiar with programming;

    • this course will also not require a deep knowledge of programming or computer science.
    • However, everyone is responsible to learn enough R to become proficient with standard modeling and plotting functions.
  • Students are recommended to use the lessons in Sofware Carpentry as a free reference for scientific programming in R.

What is R?

  • There are a number of common choices of programming/ scripting languages for performing statistical modelling, e.g.:
    • SAS
    • SPSS
    • STATA
    • Python
    • R
  • We will use R for the following reasons:
    1. it is free and open source software with extensive documentation and tutorials available;
    2. it has well established libraries for statistical modeling with a wide functionality;
    3. the “Faraway” package has extensive educational examples available for running our analyses;
    4. there are free interactive, introductory lessons from DataCamp which will be used for the first homework.

RStudio

View of R studio development environment.
  • There is also a commonly used and supported integrated development environment for R, “RStudio”.
  • RStudio is highly recommended for all beginning programmers;
    • this is not the same thing as R, but a set of graphical tools to quickly write and develop code.
  • The figure on the left shows the RStudio environment as a collection of different windows.
  • In the left-most window is the console, where an interactive session of R is taking place.
  • R can be used as an “interactive” language, in which an interpreter accepts commands and returns a response in real time.
  • R can also be used as “scripting” language, in which a script or a set of instructions are given to R to perform and an output is directed based on the script.

RStudio

  • In the following we will go through a tour of RStudio.

  • To follow along with this video tutorial, you need to download both of

  • As a prerequisite to this course, you need to have access to a computer where you can use R and RStudio, as well as download data and install packages.

  • An essential element of this class is to use modern statistical software to solve real-world problems.

  • In our activities in class, and in the project assignments, you will be expected to exercise basic skills in R for statistical analysis, including documenting your work.

    • One of the primary means to blend your research documentation together with your R code is with the use of RMarkdown files.
  • We will now begin the tutorial in RStudio.

Installing packages

Image of the CRAN main webpage.
  • The strength of R as a language comes from the variety of packages/ libraries that are available for use.
  • These libraries are mostly written by statistical scientists for free and public use in academic settings;
    • Note: some libraries have restrictions of use for commercial purposes.
  • These libraries, as with the current and development version of the R language, are hosted by the CRAN project.
  • We also note, because this is a community repository, not all software is built to the same quality or with standard conventions.
  • However, we will mostly use what have become “standard” libraries, which are well maintained and widely accepted and supported by the community.

Installing packages – continued

  • We will often use the “Faraway” package which contains many example data sets to study — to install this, we can simply type:
install.packages('faraway')
  • The “install.packages()” function will initiate an installation of the library with the package manager.

    • This will connect your installation of R directly with CRAN, and handle all dependencies, so you don't need to do anything else.
  • When a library has already been installed, but we want to use it in our environment, we can simply call

require(faraway)

Getting help in RStudio

  • Whenever you are uncertain about the use of a function or a topic in general, you can use the “?” command in R to obtain a help file.
?install.packages
  • If you’re not sure what package a function is in, or how it’s specifically spelled you can do a fuzzy search:
??install.packages
  • This will pull up related documentation and help pages in a search format.

R as a calculator

  • R accepts a set of human-readable instructions and converts these into machine language.

  • R can be used simply as a powerful calculator, for example:

    • if we enter a mathematical expression into an R console, we can evaluate mathematical expressions,
1 + 1
[1] 2

R as a calculator – continued

  • R uses standard mathematical notations for its operations, and follows the standard mathematical order of precedence:

  • Parentheses

(1 + 1)
[1] 2
  • Exponents
(1 + 1)^2
[1] 4
  • Division
(1 + 1)^2 / 4
[1] 1

R as a calculator – continued

  • Multiplication
(1 + 1)^2 / 4 * 3
[1] 3
  • Addition
(1 + 1)^2 / 4 * 3 + 1
[1] 4
  • Subtraction
(1 + 1)^2 / 4 * 3 + 1 - 2
[1] 2

R as a calculator – continued

  • R also has many standard built-in mathematical functions and variables, e.g.,
log(1)
[1] 0
cos(pi)
[1] -1
sin(pi)
[1] 1.224647e-16
  • The notation “ae-16” refers to the mathematical expression \( a \times 10^{-16} \), where \( a \) is the leading coefficient.

  • Notice that R doesn't see \( sin(\pi) \) as zero, as it is mathematically, but is extremely small.

  • This has to do with the way in which numbers are encoded into programming languages – this will be discussed further shortly.

typeof(sin(pi))
[1] "double"

Comparing things

  • Not all values in the computing language are numeric, and not all numerical values are built the same.

  • Consider the comparison operator “==” for evaluating if two inputs are the same,

sin(pi) == 0
[1] FALSE
0 == 0
[1] TRUE
  • This shows one of the dangers of trusting computer arithmetic to be exact – because sin(pi)) is a floating point, double precision approximation, the comparision operator doesn't recognize it to be equal to zero.

  • If you want to compare more accurately two R values, the better approach is to use

all.equal(sin(pi), 0)
[1] TRUE
  • This function will take into account the finite-precision error in representing the true number.

Comparing things – continued

  • We can also compare if two inputs are not the same,
1 != 2
[1] TRUE
  • Also for example,
sin(pi) != 0
[1] TRUE

Comparing things – continued

  • Notice that the outputs of the earlier comparisons are either “TRUE” or “FALSE” – these are examples of logical values, which are the output of logical expressions.
typeof(TRUE)
[1] "logical"
  • We can also compare the relative size of different values
1 > 2
[1] FALSE
2 >= 2
[1] TRUE
-1 <= 0
[1] TRUE

Variables and assignment

  • Values such as the output of different expressions can be assigned a variable name,
my_variable <- 2 + 2
  • In the above expression, the operator “<-” tells R to associate the output of the expression \( 2 +2 \) to “my_variable”.
my_variable
[1] 4
  • We can show the current variables in the environment using the command “ls()”
ls()
[1] "my_variable"

Variables and assignment – continued

  • We can re-assign a value to “my_variable” which will be stored in the environment and memory,
my_variable = my_variable + my_variable
my_variable
[1] 8
  • Notice that the right hand side of the assignment operator “<-” is always evaluated first, then the assignment is given.

    • In this case, as above, we can recursively define a variable.

Variables and assignment – continued

  • Key to writing “good” code is to use good variable naming (and commenting).

    • Generally, it is preferable to name variables with something descriptive, e.g.,
mean_sea_surface_temp <- 10
  • For longer names as above, we can use e.g.,

    • underscores;
    • periods; or
mean.sea.surface.temp <- 10
  • capital letters.
meanSeaSurfaceTemp <- 10
  • All the above are commonly used conventions and all are acceptable — the key is to be clear and consistent in your code.

Vectorization

  • R is a vectorized language, meaning that variables and functions can have vectors as values.

  • A vector in R describes a set of values in a certain order of the same data type.

    • The type of data will become increasingly important as we start using vectors.
  • A simple way to construct a vector is with the constructor function “c()”

c(1, 3, 6)
[1] 1 3 6

Vectorization – continued

  • The function takes an arbitrary number of elements as above, and creates a vector.
my_variable <- c(TRUE, pi)
my_variable
[1] 1.000000 3.141593
  • Notice that the output of the above expression looks different from the input — this is because R forces vectors to have data of a single type:
typeof(my_variable)
[1] "double"
  • Here, the value “TRUE” has been forced into its numeric counterpart “1”.

Vectorization – continued

  • In the last example, we saw that a logical value “TRUE” was forced into a numeric value by the constructor function.

  • This variable “coercion” occurs in various situations, and we need to be careful with the results.

  • Q: what do you expect the result of the following to be?

1 == TRUE
  • A:
1 == TRUE
[1] TRUE
typeof(1)
[1] "double"
typeof(TRUE)
[1] "logical"

Vectorization – continued

  • Vectors are built by definition with an order of the data that is stored — data can be accessed by calling this index:
my_variable[1]
[1] 1
my_variable[2]
[1] 3.141593
  • Mathematical operations can also be performed on vectors when their arguments accept vectors, and they can be applied element-wise on the vector entries:
sin(my_variable)
[1] 8.414710e-01 1.224647e-16

Vectorization – continued

  • Certain functions allow us to construct vectors automatically based on a range of values, known as a “slice”
my_variable <- 1:5
my_variable
[1] 1 2 3 4 5
  • We can make a general slice where the arguments are given as a:b and returns a vector of all integer spaced values between a and b:
10:5
[1] 10  9  8  7  6  5
4:10
[1]  4  5  6  7  8  9 10
  • This is often quite useful for extracting a subset of data from a large vector or matrix.

Vectorization – continued

  • We can also apply a mathematical operation to a scalar element-wise by the entries of a vector
2^my_variable
[1]  2  4  8 16 32
  • Or use a vector as the index of a vector
my_variable[2:3]
[1] 2 3
  • This likewise goes for logical, comparison operators.

  • Q: what do you expect to be the output of the following line?

1:10 > 5
  • A:
1:10>5
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Vectorization – continued

  • Note that logical vectors are also useful for extracting subsets of data.

    • Particularly, we may wish to set up a statement that we wish to evaluate on the data and find all data points that satisfy the condition.
my_variable <- 1:10
my_index <- my_variable>5
my_variable[my_index]
[1]  6  7  8  9 10
  • We might also have non-numeric vectors, such as
my_variable <- c('red', 'blue', 'green')
my_variable
[1] "red"   "blue"  "green"
  • For such a vector, a logical statement can also be quite useful,
my_index <- my_variable == 'red'
my_variable[my_index]
[1] "red"