Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The following topics will be covered in this lecture:
R has many powerful subset operators – mastering them will allow you to easily perform complex operations on any kind of dataset.
There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures.
Let's start with the workhorse of R: a simple numeric vector.
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
a b c d e
5.4 6.2 7.1 4.8 7.5
So now that we've created a dummy vector to play with, how do we get at its contents?
To extract elements of a vector we can give their corresponding index, starting from one:
x[1]
a
5.4
x[4]
d
4.8
It may look different, but the square brackets operator is a function.
For vectors (and matrices), it means “get me the nth element”.
x[c(1, 3)]
a c
5.4 7.1
x[1:4]
a b c d
5.4 6.2 7.1 4.8
:
operator creates a sequence of numbers from the left element to the right.1:4
[1] 1 2 3 4
c(1, 2, 3, 4)
[1] 1 2 3 4
x[c(1,1,3)]
a a c
5.4 5.4 7.1
x[6]
<NA>
NA
NA
, whose name is also NA
.x[0]
named numeric(0)
In many programming languages (C and Python, for example), the first element of a vector has an index of 0.
In R, the first element is 1.
x[-2]
a c d e
5.4 7.1 4.8 7.5
x[c(-1, -5)] # or x[-c(1,5)]
b c d
6.2 7.1 4.8
A common trip up for novices occurs when trying to skip slices of a vector.
It's natural to try to negate a slice as follows
x[-1:3]
Error in x[-1:3]: only 0's may be mixed with negative subscripts
The key to understanding this is remembering the order of operations.
:
is really a function, which we want to respect the arguments of.It takes its first argument as -1, and its second as 3, so generates the sequence of numbers: c(-1, 0, 1, 2, 3)
.
The correct solution is to wrap that function call in brackets, so that the -
operator applies to the result:
x[-(1:3)]
d e
4.8 7.5
x <- x[-4]
x
a b c e
5.4 6.2 7.1 7.5
x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
x[c("a", "c")]
a c
5.4 7.1
This is usually a much more reliable way to subset objects:
x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
c e
7.1 7.5
Since comparison operators (e.g. >
, <
, ==
) evaluate to logical vectors, we can also use them to succinctly subset vectors:
x[x > 7]
c e
7.1 7.5
Breaking it down, this statement first evaluates x>7
,
c(FALSE, FALSE, TRUE, FALSE, TRUE)
, x
corresponding to the TRUE
values.We can use ==
to mimic the previous method of indexing by name (remember you have to use ==
rather than =
for comparisons):
x[names(x) == "a"]
a
5.4
We often want to combine multiple logical criteria.
For example, we might want to find all the countries that are located in Asia or Europe and have life expectancies within a certain range.
Several operations for combining logical vectors exist in R:
&
, the “logical AND” operator: returns TRUE
if both the left and right are TRUE
.|
, the “logical OR” operator: returns TRUE
, if either the left or right (or both) are TRUE
.You may sometimes see &&
and ||
instead of &
and |
.
These two-character operators only look at the first element of each vector and ignore the remaining elements. In general you should not use the two-character operators in data analysis;
save them for programming, i.e. deciding whether to execute a statement.
!
, the “logical NOT” operator: converts TRUE
to FALSE
and FALSE
to TRUE
.
It can negate a single logical condition (eg !TRUE
becomes FALSE
), or a whole vector of conditions(eg !c(TRUE, FALSE)
becomes c(FALSE, TRUE)
).
Additionally, you can compare the elements within a single vector using the all
function (which returns TRUE
if every element of the vector is TRUE
) and the any
function (which returns TRUE
if one or more elements of the vector are TRUE
).
You should be aware that it is possible for multiple elements in a vector to have the same name. (For a data frame, columns can have the same name;
Consider this example:
x <- 1:3
x
[1] 1 2 3
names(x) <- c('a', 'a', 'a')
x
a a a
1 2 3
x['a'] # only returns first value
a
1
x[names(x) == 'a'] # returns all three values
a a a
1 2 3
x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we start again by naming a vector 'on the fly'
x[-"a"]
Error in -"a": invalid argument to unary operator
!=
(not-equals) operator to construct a logical vector that will do what we want:x[names(x) != "a"]
b c d e
6.2 7.1 4.8 7.5
"a"
and "c"
elements, so we try this:x[names(x)!=c("a","c")]
b c d e
6.2 7.1 4.8 7.5
R did something, but it gave us a warning that we ought to pay attention to - and it apparently gave us the wrong answer (the "c"
element is still included in the vector)!
!=
actually do in this case? That's an excellent question…names(x) != c("a", "c")
[1] FALSE TRUE TRUE TRUE TRUE
When you use !=
, R tries to compare each element of the left argument with the corresponding element of its right
argument.
What happens when you compare vectors of different lengths?
The way to get R to do what we really want (match each element of the left argument with all of the elements of the right argument) it to use the %in%
operator.
The %in%
operator goes through each element of its left argument, in this case the names of x
, and asks, “Does this element occur in the second argument?”.
Here, since we want to exclude values, we also need a !
operator to change “in” to “not in”:
x[! names(x) %in% c("a","c") ]
b d e
6.2 4.8 7.5
At some point you will encounter functions in R that cannot handle missing, infinite, or undefined data.
There are a number of special functions you can use to filter out this data:
is.na
will return all positions in a vector, matrix, or data.frame
containing NA
(or NaN
)is.nan
, and is.infinite
will do the same for NaN
and Inf
.is.finite
will return all positions in a vector, matrix, or data.frame
that do not contain NA
, NaN
or Inf
.na.omit
will filter out all missing values from a vector