Subsetting data and vectorization part II

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

  • The following topics will be covered in this lecture:

    • More on subsetting data
    • Dataframes
    • Vectorization

Factor subsetting

  • Now that we've explored the different ways to subset vectors, how do we subset the other data structures?

  • Factor subsetting works the same way as vector subsetting.

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]
[1] a a
Levels: a b c d
f[f %in% c("b", "c")]
[1] b c c
Levels: a b c d
f[1:3]
[1] a a b
Levels: a b c d

Factor subsetting – continued

  • Skipping elements will not remove the level even if no more of that category exists in the factor:
f[-3]
[1] a a c c d
Levels: a b c d

Matrix subsetting

  • Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:
set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]
            [,1]       [,2]
[1,]  1.12493092 -0.8356286
[2,] -0.04493361  1.5952808
  • You can leave the first or second arguments blank to retrieve all the rows or columns respectively:
m[, c(3,4)]
            [,1]        [,2]
[1,] -0.62124058  0.82122120
[2,] -2.21469989  0.59390132
[3,]  1.12493092  0.91897737
[4,] -0.04493361  0.78213630
[5,] -0.01619026  0.07456498
[6,]  0.94383621 -1.98935170

Matrix subsetting – continued

  • If we only access one row or column, R will automatically convert the result to a vector:
m[3,]
[1] -0.8356286  0.5757814  1.1249309  0.9189774
  • If you want to keep the output as a matrix, you need to specify a third argument; drop = FALSE:
m[3, , drop=FALSE]
           [,1]      [,2]     [,3]      [,4]
[1,] -0.8356286 0.5757814 1.124931 0.9189774

Matrix subsetting – continued

  • Unlike vectors, if we try to access a row or column outside of the matrix, R will throw an error:
m[, c(3,6)]
Error in m[, c(3, 6)]: subscript out of bounds
  • When dealing with multi-dimensional arrays, each argument to [ corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.

Matrix subsetting – continued

  • Because matrices are vectors, we can also subset using only one argument:
m[5]
[1] 0.3295078
  • This usually isn't useful, and often confusing to read. However it is useful to note that matrices are laid out in column-major format by default.

  • That is the elements of the vector are arranged column-wise:

matrix(1:6, nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrix subsetting – continued

  • If you wish to populate the matrix by row, use byrow=TRUE:
matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
  • Matrices can also be subsetted using their rownames and column names instead of their row and column indices.

List subsetting

  • Now we'll introduce some new subsetting operators. There are three functions used to subset lists. We've already seen these when learning about atomic vectors and matrices: [, [[, and $.

  • Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [.

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris))
xlist[1]
$a
[1] "Software Carpentry"
  • This returns a list with one element.

  • To extract individual elements of a list, you need to use the double-square bracket function: [[.

xlist[[1]]
[1] "Software Carpentry"

List subsetting – continued

  • You can't extract more than one element at once:
xlist[[1:2]]
Error in xlist[[1:2]]: subscript out of bounds
  • Nor use it to skip elements:
xlist[[-1]]
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
  • But you can use names to both subset and extract elements:
xlist[["a"]]
[1] "Software Carpentry"

List subsetting – continued

  • The $ function is a shorthand way for extracting elements by name:
xlist$data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Data frames

  • Remember the data frames are lists underneath the hood, so similar rules apply.

  • However they are also two dimensional objects:

    • [ with one argument will act the same way as for lists, where each list element corresponds to a column.
    • The resulting object will be a data frame:
require(gapminder)
head(gapminder[3])
# A tibble: 6 x 1
   year
  <int>
1  1952
2  1957
3  1962
4  1967
5  1972
6  1977
  • Similarly, [[ will act to extract a single column:
head(gapminder[["lifeExp"]])
[1] 28.801 30.332 31.997 34.020 36.088 38.438

Data frames – continued

  • The $ symbol provides a convenient shorthand to extract columns by name:
head(gapminder$year)
[1] 1952 1957 1962 1967 1972 1977
  • With two arguments, [ behaves the same way as for matrices:
gapminder[1:3,]
# A tibble: 3 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.

Data frames – continued

  • If we subset a single row, the result will be a data frame (because the elements are mixed types):
gapminder[3,]
# A tibble: 1 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1962    32.0 10267083      853.
  • But for a single column the result will be a vector (this can be changed with the third argument, drop = FALSE).

Data frames – continued

  • Most of R's functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.

  • This makes writing code more concise, easy to read, and less error prone.

x <- 1:4
x * 2
[1] 2 4 6 8
  • The multiplication happened to each element of the vector.

Vectorization

  • We can also add two vectors together:
y <- 6:9
x + y
[1]  7  9 11 13
  • Each element of x was added to its corresponding element of y:
x:  1  2  3  4
    +  +  +  +
y:  6  7  8  9
---------------
    7  9 11 13
  • Comparison operators, logical operators, and many functions are also vectorized:
x > 2
[1] FALSE FALSE  TRUE  TRUE
a <- x > 3  # or, for clarity, a <- (x > 3)
a
[1] FALSE FALSE FALSE  TRUE

Vectorization and performance

  • R, while having the benefit of being an easy-to-learn language with powerful software, is not especially fast.

  • For many users, this doesn't pose an obstacle however as the vectorization of the language can actually make many computations in R competitive.

    • When a mathematic operation or function is run as a vectorized operation, the computer calls underlying C code that has been optimized for performance.
    • This is the same performance-gain technique that is used in, e.g., MATLAB and Python.
  • Though we have not discussed FOR loops yet, we will mention now that in general you should always try to write your operations vector-wise instead of with FOR loops in R.

A remark on element-wise vs. matrix multiplication

  • Very important: the operator * gives you element-wise multiplication!
  • To do matrix multiplication, we need to use the %*% operator:
m %*% matrix(1, nrow=4, ncol=1)
            [,1]
[1,]  0.06095586
[2,] -0.69883054
[3,]  1.78406103
[4,]  2.02709511
[5,]  1.89966366
[6,] -1.47614063
matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
     [,1]
[1,]   30