DataFrames and factors


  • The following topics will be covered in this lecture:

    • A review of DataFrames
    • Lists
    • Factors
    • Manipulating dataframes
    • Basic file I/O

Data structures – a quick review

  • We will start by picking up with the “cats” dataframe we studied in the last activity.
cats <- data.frame(coat = c("calico", "black", "tabby"), 
                    weight = c(2.1, 5.0, 3.2), 
                    likes_string = c(1, 0, 1))
  • Notice that the arguments of the function “data.frame()” are three expressions associating a name with a vector.


  • Printing the variable “cats”, we see what tabular data looks like in a dataframe:
    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1
  • The assignment of the vectors to names in the arguments assigned the column names.

  • Each column consists of a vector of uniform data type.

  • If we want to extract a named column from a dataframe, this can be done with the “$” sign and the column name:

[1] 2.1 5.0 3.2
  • Each row, on the other hand, consists of multiple measurements (of different data types) corresponding to one specific case of the data set.

Dataframes – continued

  • We might suppose that the scale used to measure the cats' weights was off by two kgs.

  • In this case, we can re-assign values into the column weight as follows:

[1] 2.1 5.0 3.2
cats$weight <- cats$weight + 2
  • We can verify that the assignment went into the column for weight in “cats”,
    coat weight likes_string
1 calico    4.1            1
2  black    7.0            0
3  tabby    5.2            1


  • A data structure related to vectors are lists.

  • Lists function as containers for heterogeneous data, allowing different types:

list_example <- list(1, "a", TRUE, 1+4i)
[1] 1

[1] "a"

[1] TRUE

[1] 1+4i

Lists – continued

  • In the last example no type coercion has taken place;
[1] "double"
[1] "character"
  • all of the original types have been respected, but because the data is allowed to be inhomogeneous we can't use vector operations on a list.
list_example + 2
Error in list_example + 2: non-numeric argument to binary operator
  • Here we see an error message because the “+” operator only knows how to operate on numeric arguments, or ones that can be coerced into one.

Lists – continued

  • Recall our dataframe cats,
    coat weight likes_string
1 calico    4.1            1
2  black    7.0            0
3  tabby    5.2            1
  • We know that dataframes contain homogeneous data in each column, but each row may be inhomogeneous.

  • Q:What kind of data structure do you think a dataframe is? Can it be a vector? Why or why not?

  • A: A dataframe cannot be a vector because of coercion rules — instead it operates as a list of vectors:

[1] "list"
[1] "double"


  • Consider now the vector “coat” in the dataframe:
[1] "calico" "black"  "tabby" 
[1] "character"
[1] NA NA NA
  • Q: can you hypothesize what the meaning is of this data? What is a level, and why are the coats “integer”?

  • A: R likes to treat character strings in dataframes as categorical variables;

    • in this case, the categories are “black”, “calico” and “tabby”;
    • each integer value is encoding whether the case (or row) belongs to category 1, 2 or 3, where category labels are sorted alphanumerically.

Manipulating dataframes

  • Let's suppose that we need to include more information on our cats in our analysis;

    • a friend has provided ages of all the cats for us:
age <- c(2, 3, 5)
  • We want to combine this into our dataframe, which can be done with “cbind()”
cats <- cbind(cats, age)
    coat weight likes_string age
1 calico    4.1            1   2
2  black    7.0            0   3
3  tabby    5.2            1   5
  • This function introduces the new vector as an additional column in the dataframe;

    • the variable name is defined the column name, and we reassign the dataframe to cats recursively.

Adding columns to dataframes

  • Q: can you hypothesize what will be the output if we try to combine the following vector with the dataframe?
age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)
  • A: dataframes, like matrices, need to have consistent dimensions of the data;

    • this new column is too long, so we will get an error:
age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 5

Dataframe dimensions

  • We can examine the dimensions of a dataframe with standard functions:
[1] 3 4
  • In the above we see the standard, matrix style dimensions of the dataframe.

    • these can also be extracted individually with “nrow” and “ncol”:
[1] 3
[1] 4

Dataframe indices

  • Entries of dataframes can be accessed directly using matrix indexing:
[1] 3
  • This can also be performed with slices:
cats[1:2, 2:3]
  weight likes_string
1    4.1            1
2    7.0            0

Dataframe indices

  • Additionally, we can use specialized notation for accessing an entire row or column:
[1] "calico" "black"  "tabby" 
    coat weight likes_string age
1 calico    4.1            1   2
  • Here, the blank in the place of the index tells R to extract the entire row or column.

Adding rows to dataframes

  • Let's suppose we have examined a new cat and we want to add a case to our dataframe:
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
           coat weight likes_string age
1        calico    4.1            1   2
2         black    7.0            0   3
3         tabby    5.2            1   5
4 tortoiseshell    3.3            1   9
  • Notice, the “NA” value in the above for the coat of the fourth cat.

    • While the row was added successfully, it produces a “Not Available”, missing data entry.
  • Factors, like other vectors, are strict in R;

    • when we attempt to add a value that is not recognized as one of the categories, R treats this as missing data.

Adding levels to factors

  • We can access the levels of a factor vector with the “levels()” function:
  • If we want to re-assign new levels to a factor, we can do so recursively
levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
           coat weight likes_string age
1          <NA>    4.1            1   2
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9
  • Notice, “tortiseshell” was now accepted as a category, but the NA value remains.

Summarizing dataframes

  • In general, we want to know if a dataframe has missing values, and what kind of variables are in it.

  • Several common functions allow this, including

    • “str()” or the structure function:
'data.frame':   5 obs. of  4 variables:
 $ coat        : Factor w/ 1 level "tortoiseshell": NA NA NA 1 1
 $ weight      : num  4.1 7 5.2 3.3 3.3
 $ likes_string: num  1 0 1 1 1
 $ age         : num  2 3 5 9 9
  • This tells us what the dimensions are, the column names are, and what types of variables we are working with.

Summarizing dataframes – continued

  • We can also obtain a quick statistical summary of the data with the “summary()” function:
            coat       weight      likes_string      age     
 tortoiseshell:2   Min.   :3.30   Min.   :0.0   Min.   :2.0  
 NA's         :3   1st Qu.:3.30   1st Qu.:1.0   1st Qu.:3.0  
                   Median :4.10   Median :1.0   Median :5.0  
                   Mean   :4.58   Mean   :0.8   Mean   :5.6  
                   3rd Qu.:5.20   3rd Qu.:1.0   3rd Qu.:9.0  
                   Max.   :7.00   Max.   :1.0   Max.   :9.0  
  • The summary furthermore tells us how many missing values are present.

Removing rows

  • We will often want to remove cases with missing data;

    • this can be performed automatically with “na.omit()”
           coat weight likes_string age
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

Removing rows or columns

  • We can also remove rows or columns by index, using a “-”
           coat weight likes_string age
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9
  weight likes_string age
1    4.1            1   2
2    7.0            0   3
3    5.2            1   5
4    3.3            1   9
5    3.3            1   9

File IO

  • Basic file input/ output (IO) can be done with functions such as:
write.csv(x = cats, file = "feline-data.csv", row.names = FALSE)
cats_from_file <- read.csv(file = "feline-data.csv")
           coat weight likes_string age
1          <NA>    4.1            1   2
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

File IO – continued

  • We will suppose that we wish to analyze a new set of cat data that a friend gave us:
coat,   weight,     likes_string
calico, 2.1,        1
black,  5.0,        0
tabby,  3.2,        1
tabby,  2.3 or 2.4, 1
  • Our friend was uncertain about the weight of the last cat and placed two values into the CSV.

  • We suppose that this is written in the file “feline-data_v2.csv” which we will read with “read.csv()”,

cats_from_file <- read.csv(file="feline-data_v2.csv")
[1] "2.1"        "5.0"        "3.2"        "2.3 or 2.4"
  • Notice that weight looks much different than before…

Factors – continued

  • When R read the inhomogeneous data in the weights column, it first converted the values into character type;

    • when the character vector was seen by R in a dataframe, it then converted it automatically to a factor vector.
  • Converting characters to factors automatically can be suppressed with additional arguments:

cats_from_file <- read.csv(file="feline-data_v2.csv", stringsAsFactors=FALSE)
[1] "2.1"        "5.0"        "3.2"        "2.3 or 2.4"
  • However, this illustrates in general how erroneously entered data can cause many issues with type conversions.