Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
The following topics will be covered in this lecture:
cats <- data.frame(coat = c("calico", "black", "tabby"),
weight = c(2.1, 5.0, 3.2),
likes_string = c(1, 0, 1))
cats
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
The assignment of the vectors to names in the arguments assigned the column names.
Each column consists of a vector of uniform data type.
If we want to extract a named column from a dataframe, this can be done with the “$” sign and the column name:
cats$weight
[1] 2.1 5.0 3.2
We might suppose that the scale used to measure the cats' weights was off by two kgs.
In this case, we can re-assign values into the column weight as follows:
cats$weight
[1] 2.1 5.0 3.2
cats$weight <- cats$weight + 2
cats
coat weight likes_string
1 calico 4.1 1
2 black 7.0 0
3 tabby 5.2 1
A data structure related to vectors are lists.
Lists function as containers for heterogeneous data, allowing different types:
list_example <- list(1, "a", TRUE, 1+4i)
list_example
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
typeof(list_example[[1]])
[1] "double"
typeof(list_example[[2]])
[1] "character"
list_example + 2
Error in list_example + 2: non-numeric argument to binary operator
cats
coat weight likes_string
1 calico 4.1 1
2 black 7.0 0
3 tabby 5.2 1
We know that dataframes contain homogeneous data in each column, but each row may be inhomogeneous.
Q:What kind of data structure do you think a dataframe is? Can it be a vector? Why or why not?
A: A dataframe cannot be a vector because of coercion rules — instead it operates as a list of vectors:
typeof(cats)
[1] "list"
typeof(cats$weight)
[1] "double"
cats$coat
[1] "calico" "black" "tabby"
typeof(cats$coat)
[1] "character"
as.numeric(cats$coat)
[1] NA NA NA
Q: can you hypothesize what the meaning is of this data? What is a level, and why are the coats “integer”?
A: R likes to treat character strings in dataframes as categorical variables;
Let's suppose that we need to include more information on our cats in our analysis;
age <- c(2, 3, 5)
cats <- cbind(cats, age)
cats
coat weight likes_string age
1 calico 4.1 1 2
2 black 7.0 0 3
3 tabby 5.2 1 5
This function introduces the new vector as an additional column in the dataframe;
age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)
A: dataframes, like matrices, need to have consistent dimensions of the data;
age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 5
dim(cats)
[1] 3 4
In the above we see the standard, matrix style dimensions of the dataframe.
nrow(cats)
[1] 3
ncol(cats)
[1] 4
cats[2,4]
[1] 3
cats[1:2, 2:3]
weight likes_string
1 4.1 1
2 7.0 0
cats[,1]
[1] "calico" "black" "tabby"
cats[1,]
coat weight likes_string age
1 calico 4.1 1 2
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
cats
coat weight likes_string age
1 calico 4.1 1 2
2 black 7.0 0 3
3 tabby 5.2 1 5
4 tortoiseshell 3.3 1 9
Notice, the “NA” value in the above for the coat of the fourth cat.
Factors, like other vectors, are strict in R;
levels(cats$coat)
NULL
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
cats
coat weight likes_string age
1 <NA> 4.1 1 2
2 <NA> 7.0 0 3
3 <NA> 5.2 1 5
4 tortoiseshell 3.3 1 9
5 tortoiseshell 3.3 1 9
In general, we want to know if a dataframe has missing values, and what kind of variables are in it.
Several common functions allow this, including
str(cats)
'data.frame': 5 obs. of 4 variables:
$ coat : Factor w/ 1 level "tortoiseshell": NA NA NA 1 1
$ weight : num 4.1 7 5.2 3.3 3.3
$ likes_string: num 1 0 1 1 1
$ age : num 2 3 5 9 9
summary(cats)
coat weight likes_string age
tortoiseshell:2 Min. :3.30 Min. :0.0 Min. :2.0
NA's :3 1st Qu.:3.30 1st Qu.:1.0 1st Qu.:3.0
Median :4.10 Median :1.0 Median :5.0
Mean :4.58 Mean :0.8 Mean :5.6
3rd Qu.:5.20 3rd Qu.:1.0 3rd Qu.:9.0
Max. :7.00 Max. :1.0 Max. :9.0
We will often want to remove cases with missing data;
na.omit(cats)
coat weight likes_string age
4 tortoiseshell 3.3 1 9
5 tortoiseshell 3.3 1 9
cats[-1,]
coat weight likes_string age
2 <NA> 7.0 0 3
3 <NA> 5.2 1 5
4 tortoiseshell 3.3 1 9
5 tortoiseshell 3.3 1 9
cats[,-1]
weight likes_string age
1 4.1 1 2
2 7.0 0 3
3 5.2 1 5
4 3.3 1 9
5 3.3 1 9
write.csv(x = cats, file = "feline-data.csv", row.names = FALSE)
cats_from_file <- read.csv(file = "feline-data.csv")
cats_from_file
coat weight likes_string age
1 <NA> 4.1 1 2
2 <NA> 7.0 0 3
3 <NA> 5.2 1 5
4 tortoiseshell 3.3 1 9
5 tortoiseshell 3.3 1 9
coat, weight, likes_string
calico, 2.1, 1
black, 5.0, 0
tabby, 3.2, 1
tabby, 2.3 or 2.4, 1
Our friend was uncertain about the weight of the last cat and placed two values into the CSV.
We suppose that this is written in the file “feline-data_v2.csv” which we will read with “read.csv()”,
cats_from_file <- read.csv(file="feline-data_v2.csv")
cats_from_file$weight
[1] "2.1" "5.0" "3.2" "2.3 or 2.4"
When R read the inhomogeneous data in the weights column, it first converted the values into character type;
Converting characters to factors automatically can be suppressed with additional arguments:
cats_from_file <- read.csv(file="feline-data_v2.csv", stringsAsFactors=FALSE)
cats_from_file$weight
[1] "2.1" "5.0" "3.2" "2.3 or 2.4"