Subsetting data and vectorization part II


  The following topics will be covered in this lecture:

    • More on subsetting data
    • Dataframes
    • Vectorization

Factor subsetting

  • Now that we've explored the different ways to subset vectors, how do we subset the other data structures?

  • Factor subsetting works the same way as vector subsetting.

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]
[1] a a
Levels: a b c d
f[f %in% c("b", "c")]
[1] b c c
Levels: a b c d
[1] a a b
Levels: a b c d

Factor subsetting – continued

  • Skipping elements will not remove the level even if no more of that category exists in the factor:
[1] a a c c d
Levels: a b c d

Matrix subsetting

  • Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]
            [,1]       [,2]
[1,]  1.12493092 -0.8356286
[2,] -0.04493361  1.5952808
  • You can leave the first or second arguments blank to retrieve all the rows or columns respectively:
m[, c(3,4)]
            [,1]        [,2]
[1,] -0.62124058  0.82122120
[2,] -2.21469989  0.59390132
[3,]  1.12493092  0.91897737
[4,] -0.04493361  0.78213630
[5,] -0.01619026  0.07456498
[6,]  0.94383621 -1.98935170

Matrix subsetting – continued

  • If we only access one row or column, R will automatically convert the result to a vector:
[1] -0.8356286  0.5757814  1.1249309  0.9189774
  • If you want to keep the output as a matrix, you need to specify a third argument; drop = FALSE:
m[3, , drop=FALSE]
           [,1]      [,2]     [,3]      [,4]
[1,] -0.8356286 0.5757814 1.124931 0.9189774

Matrix subsetting – continued

  • Unlike vectors, if we try to access a row or column outside of the matrix, R will throw an error:
m[, c(3,6)]
Error in m[, c(3, 6)]: subscript out of bounds
  • When dealing with multi-dimensional arrays, each argument to [ corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.

Matrix subsetting – continued

  • Because matrices are vectors, we can also subset using only one argument:
[1] 0.3295078
  • This usually isn't useful, and often confusing to read. However it is useful to note that matrices are laid out in column-major format by default.

  • That is the elements of the vector are arranged column-wise:

matrix(1:6, nrow=2, ncol=3)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrix subsetting – continued

  • If you wish to populate the matrix by row, use byrow=TRUE:
matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
  • Matrices can also be subsetted using their rownames and column names instead of their row and column indices.

List subsetting

  • Now we'll introduce some new subsetting operators. There are three functions used to subset lists. We've already seen these when learning about atomic vectors and matrices: [, [[, and $.

  • Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [.

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris))
[1] "Software Carpentry"
  • This returns a list with one element.

  • To extract individual elements of a list, you need to use the double-square bracket function: [[.

[1] "Software Carpentry"

List subsetting – continued

  • You can't extract more than one element at once:
Error in xlist[[1:2]]: subscript out of bounds
  • Nor use it to skip elements:
Error in xlist[[-1]]: invalid negative subscript in get1index <real>
  • But you can use names to both subset and extract elements:
[1] "Software Carpentry"

List subsetting – continued

  • The $ function is a shorthand way for extracting elements by name:
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Data frames

  • Remember the data frames are lists underneath the hood, so similar rules apply.

  • However they are also two dimensional objects:

    • [ with one argument will act the same way as for lists, where each list element corresponds to a column.
    • The resulting object will be a data frame:
# A tibble: 6 x 1
1  1952
2  1957
3  1962
4  1967
5  1972
6  1977
  • Similarly, [[ will act to extract a single column:
[1] 28.801 30.332 31.997 34.020 36.088 38.438

Data frames – continued

  • The $ symbol provides a convenient shorthand to extract columns by name:
[1] 1952 1957 1962 1967 1972 1977
  • With two arguments, [ behaves the same way as for matrices:
# A tibble: 3 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.

Data frames – continued

  • If we subset a single row, the result will be a data frame (because the elements are mixed types):
# A tibble: 1 x 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1962    32.0 10267083      853.
  • But for a single column the result will be a vector (this can be changed with the third argument, drop = FALSE).

Data frames – continued

  • Most of R's functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time.

  • This makes writing code more concise, easy to read, and less error prone.

x <- 1:4
x * 2
[1] 2 4 6 8
  • The multiplication happened to each element of the vector.


  • We can also add two vectors together:
y <- 6:9
x + y
[1]  7  9 11 13
  • Each element of x was added to its corresponding element of y:
x:  1  2  3  4
    +  +  +  +
y:  6  7  8  9
    7  9 11 13
  • Comparison operators, logical operators, and many functions are also vectorized:
x > 2
a <- x > 3  # or, for clarity, a <- (x > 3)

Vectorization and performance

  • R, while having the benefit of being an easy-to-learn language with powerful software, is not especially fast.

  • For many users, this doesn't pose an obstacle however as the vectorization of the language can actually make many computations in R competitive.

    • When a mathematic operation or function is run as a vectorized operation, the computer calls underlying C code that has been optimized for performance.
    • This is the same performance-gain technique that is used in, e.g., MATLAB and Python.
  • Though we have not discussed FOR loops yet, we will mention now that in general you should always try to write your operations vector-wise instead of with FOR loops in R.

A remark on element-wise vs. matrix multiplication

  • Very important: the operator * gives you element-wise multiplication!
  • To do matrix multiplication, we need to use the %*% operator:
m %*% matrix(1, nrow=4, ncol=1)
[1,]  0.06095586
[2,] -0.69883054
[3,]  1.78406103
[4,]  2.02709511
[5,]  1.89966366
[6,] -1.47614063
matrix(1:4, nrow=1) %*% matrix(1:4, ncol=1)
[1,]   30