DataFrames and factors

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- A review of DataFrames
- Lists
- Factors
- Manipulating dataframes
- Basic file I/O

Data structures – a quick review

We will start by picking up with the “cats” dataframe we studied in the last activity.

cats <- data.frame(coat = c("calico", "black", "tabby"), 
                    weight = c(2.1, 5.0, 3.2), 
                    likes_string = c(1, 0, 1))

Notice that the arguments of the function “data.frame()” are three expressions associating a name with a vector.

Dataframes

Printing the variable “cats”, we see what tabular data looks like in a dataframe:

cats

    coat weight likes_string
1 calico    2.1            1
2  black    5.0            0
3  tabby    3.2            1

The assignment of the vectors to names in the arguments assigned the column names.
Each column consists of a vector of uniform data type.
If we want to extract a named column from a dataframe, this can be done with the “$” sign and the column name:

cats$weight

[1] 2.1 5.0 3.2

Each row, on the other hand, consists of multiple measurements (of different data types) corresponding to one specific case of the data set.

Dataframes – continued

We might suppose that the scale used to measure the cats' weights was off by two kgs.
In this case, we can re-assign values into the column weight as follows:

cats$weight

[1] 2.1 5.0 3.2

cats$weight <- cats$weight + 2

We can verify that the assignment went into the column for weight in “cats”,

cats

    coat weight likes_string
1 calico    4.1            1
2  black    7.0            0
3  tabby    5.2            1

Lists

A data structure related to vectors are lists.
Lists function as containers for heterogeneous data, allowing different types:

list_example <- list(1, "a", TRUE, 1+4i)
list_example

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

Lists – continued

In the last example no type coercion has taken place;

typeof(list_example[[1]])

[1] "double"

typeof(list_example[[2]])

[1] "character"

all of the original types have been respected, but because the data is allowed to be inhomogeneous we can't use vector operations on a list.

list_example + 2

Error in list_example + 2: non-numeric argument to binary operator

Here we see an error message because the “+” operator only knows how to operate on numeric arguments, or ones that can be coerced into one.

Lists – continued

Recall our dataframe cats,

cats

    coat weight likes_string
1 calico    4.1            1
2  black    7.0            0
3  tabby    5.2            1

We know that dataframes contain homogeneous data in each column, but each row may be inhomogeneous.
Q:What kind of data structure do you think a dataframe is? Can it be a vector? Why or why not?
A: A dataframe cannot be a vector because of coercion rules — instead it operates as a list of vectors:

typeof(cats)

[1] "list"

typeof(cats$weight)

[1] "double"

Factors

Consider now the vector “coat” in the dataframe:

cats$coat

[1] "calico" "black"  "tabby"

typeof(cats$coat)

[1] "character"

as.numeric(cats$coat)

[1] NA NA NA

Q: can you hypothesize what the meaning is of this data? What is a level, and why are the coats “integer”?
A: R likes to treat character strings in dataframes as categorical variables;
- in this case, the categories are “black”, “calico” and “tabby”;
- each integer value is encoding whether the case (or row) belongs to category 1, 2 or 3, where category labels are sorted alphanumerically.

Manipulating dataframes

Let's suppose that we need to include more information on our cats in our analysis;
- a friend has provided ages of all the cats for us:

age <- c(2, 3, 5)

We want to combine this into our dataframe, which can be done with “cbind()”

cats <- cbind(cats, age)
cats

    coat weight likes_string age
1 calico    4.1            1   2
2  black    7.0            0   3
3  tabby    5.2            1   5

This function introduces the new vector as an additional column in the dataframe;
- the variable name is defined the column name, and we reassign the dataframe to cats recursively.

Adding columns to dataframes

Q: can you hypothesize what will be the output if we try to combine the following vector with the dataframe?

age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)

A: dataframes, like matrices, need to have consistent dimensions of the data;
- this new column is too long, so we will get an error:

age <- c(2, 3, 5, 8, 9)
cats <- cbind(cats, age)

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 5

Dataframe dimensions

We can examine the dimensions of a dataframe with standard functions:

dim(cats)

[1] 3 4

In the above we see the standard, matrix style dimensions of the dataframe.
- these can also be extracted individually with “nrow” and “ncol”:

nrow(cats)

[1] 3

ncol(cats)

[1] 4

Dataframe indices

Entries of dataframes can be accessed directly using matrix indexing:

cats[2,4]

[1] 3

This can also be performed with slices:

cats[1:2, 2:3]

  weight likes_string
1    4.1            1
2    7.0            0

Dataframe indices

Additionally, we can use specialized notation for accessing an entire row or column:

cats[,1]

[1] "calico" "black"  "tabby"

cats[1,]

    coat weight likes_string age
1 calico    4.1            1   2

Here, the blank in the place of the index tells R to extract the entire row or column.

Adding rows to dataframes

Let's suppose we have examined a new cat and we want to add a case to our dataframe:

newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
cats

           coat weight likes_string age
1        calico    4.1            1   2
2         black    7.0            0   3
3         tabby    5.2            1   5
4 tortoiseshell    3.3            1   9

Notice, the “NA” value in the above for the coat of the fourth cat.
- While the row was added successfully, it produces a “Not Available”, missing data entry.
Factors, like other vectors, are strict in R;
- when we attempt to add a value that is not recognized as one of the categories, R treats this as missing data.

Adding levels to factors

We can access the levels of a factor vector with the “levels()” function:

levels(cats$coat)

NULL

If we want to re-assign new levels to a factor, we can do so recursively

levels(cats$coat) <- c(levels(cats$coat), "tortoiseshell")
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
cats

           coat weight likes_string age
1          <NA>    4.1            1   2
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

Notice, “tortiseshell” was now accepted as a category, but the NA value remains.

Summarizing dataframes

In general, we want to know if a dataframe has missing values, and what kind of variables are in it.
Several common functions allow this, including
- “str()” or the structure function:

str(cats)

'data.frame':   5 obs. of  4 variables:
 $ coat        : Factor w/ 1 level "tortoiseshell": NA NA NA 1 1
 $ weight      : num  4.1 7 5.2 3.3 3.3
 $ likes_string: num  1 0 1 1 1
 $ age         : num  2 3 5 9 9

This tells us what the dimensions are, the column names are, and what types of variables we are working with.

Summarizing dataframes – continued

We can also obtain a quick statistical summary of the data with the “summary()” function:

summary(cats)

            coat       weight      likes_string      age     
 tortoiseshell:2   Min.   :3.30   Min.   :0.0   Min.   :2.0  
 NA's         :3   1st Qu.:3.30   1st Qu.:1.0   1st Qu.:3.0  
                   Median :4.10   Median :1.0   Median :5.0  
                   Mean   :4.58   Mean   :0.8   Mean   :5.6  
                   3rd Qu.:5.20   3rd Qu.:1.0   3rd Qu.:9.0  
                   Max.   :7.00   Max.   :1.0   Max.   :9.0

The summary furthermore tells us how many missing values are present.

Removing rows

We will often want to remove cases with missing data;
- this can be performed automatically with “na.omit()”

na.omit(cats)

           coat weight likes_string age
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

Removing rows or columns

We can also remove rows or columns by index, using a “-”

cats[-1,]

           coat weight likes_string age
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

cats[,-1]

  weight likes_string age
1    4.1            1   2
2    7.0            0   3
3    5.2            1   5
4    3.3            1   9
5    3.3            1   9

File IO

Basic file input/ output (IO) can be done with functions such as:

write.csv(x = cats, file = "feline-data.csv", row.names = FALSE)
cats_from_file <- read.csv(file = "feline-data.csv")
cats_from_file

           coat weight likes_string age
1          <NA>    4.1            1   2
2          <NA>    7.0            0   3
3          <NA>    5.2            1   5
4 tortoiseshell    3.3            1   9
5 tortoiseshell    3.3            1   9

File IO – continued

We will suppose that we wish to analyze a new set of cat data that a friend gave us:

coat,   weight,     likes_string
calico, 2.1,        1
black,  5.0,        0
tabby,  3.2,        1
tabby,  2.3 or 2.4, 1

Our friend was uncertain about the weight of the last cat and placed two values into the CSV.
We suppose that this is written in the file “feline-data_v2.csv” which we will read with “read.csv()”,

cats_from_file <- read.csv(file="feline-data_v2.csv")
cats_from_file$weight

[1] "2.1"        "5.0"        "3.2"        "2.3 or 2.4"

Notice that weight looks much different than before…

Factors – continued

When R read the inhomogeneous data in the weights column, it first converted the values into character type;
- when the character vector was seen by R in a dataframe, it then converted it automatically to a factor vector.
Converting characters to factors automatically can be suppressed with additional arguments:

cats_from_file <- read.csv(file="feline-data_v2.csv", stringsAsFactors=FALSE)
cats_from_file$weight

[1] "2.1"        "5.0"        "3.2"        "2.3 or 2.4"

However, this illustrates in general how erroneously entered data can cause many issues with type conversions.