Learning Objectives

Presentation of the Survey Data

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv file: each row holds information for a single animal columns represent:

To get the data in a place that R can find it:

To load our survey data, we need to locate the surveys.csv file. We will use read.csv() to load into memory (as a data.frame) the content of the CSV file.

surveys <- read.csv('data/surveys.csv')

At this point, make sure all participants have the data loaded

This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: surveys

Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data.frame using the function head():

head(surveys)
  record_id month day year plot species sex wgt
1         1     7  16 1977    2      NL   M  NA
2         2     7  16 1977    3      NL   M  NA
3         3     7  16 1977    2      DM   F  NA
4         4     7  16 1977    7      DM   M  NA
5         5     7  16 1977    3      DM   M  NA
6         6     7  16 1977    1      PF   M  NA

What are data frames?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.

A data.frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

A data.frame can be created by the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.

Inspecting data.frame objects

Let’s now check the structure of this data.frame in more details with the function str():

str(surveys)
'data.frame':   35549 obs. of  8 variables:
 $ record_id: int  1 2 3 4 5 6 7 8 9 10 ...
 $ month    : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day      : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year     : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot     : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species  : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
 $ sex      : Factor w/ 6 levels "","F","M","P",..: 3 3 2 3 3 3 2 3 2 2 ...
 $ wgt      : int  NA NA NA NA NA NA NA NA NA NA ...

The first line tells us that we have 35549 observations of 8 variables. We also see the data types for each of the columns in our data.frame.
You can also see this information in the Environment tab if you click the blue arrow button to unfold this information:

a data.frame in the Environment tab

If you want to view the data, click on the grid icon on the right hand side of the line with surveys on it. This will open a new tab in the scripts area. You can scroll around, but you cannot edit.

NOTE: Values that were blank in the CSV are now listed as NA. This is what R uses for missing data (“not available”).

The $ operator

You’ll notice from the str() command that every column name is preceded with the $ symbol. This is the way to access a single column of the data frame.

For example, you can look at just the species column by typing:

surveys$species

Note it doesn’t print a column, but it prints them out with word wrapping. The numbers in the brackets at the left serve as an index.

Summarizing Functions

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

  • Size:
    • dim() - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow() - returns the number of rows
    • ncol() - returns the number of columns
  • Content:
    • head() - shows the first 6 rows
    • tail() - shows the last 6 rows
  • Names:
    • names() - returns the column names (synonym of colnames() for data.frame objects)
  • rownames() - returns the row names
  • Summary:
  • str() - structure of the object and information about the class, length and content of each column
  • summary() - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Can you answer the following questions??

  • What is the class of the object surveys?
  • How many rows and how many columns are in this object?
  • How many species have been recorded during these surveys?
  • What is the range of years over which this data has been recorded?
  • What is the mean weight of animals recorded?

Indexing and sequences

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. We will start simply with a vector.

animals <- c("mouse", "rat", "dog", "cat")
animals[2]
[1] "rat"

This will print rat, the second value in the vector animals.

R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do.

: is a special function that creates numeric vectors of integer in increasing or decreasing order, test 1:10 and 10:1 for instance.

animals[c(3, 2)]
[1] "dog" "rat"
animals[2:4]
[1] "rat" "dog" "cat"
more_animals <- animals[c(1:3, 2:4)]
more_animals
[1] "mouse" "rat"   "dog"   "rat"   "dog"   "cat"  

Show everything except a certain index by using the minus sign

animals[-1]
[1] "rat" "dog" "cat"
animals[c(-2,-4)]
[1] "mouse" "dog"  

Indexing like this will also work in data frames, but they have two dimensions: rows and columns. If we want to extract some specific data from it, we need to specify the index values for both dimensions. You can think of these as “coordinates”. The syntax for this operation uses brackets, but separates it by a comma: [ row conditions , column conditions]. Row numbers come first, followed by column numbers.

If you have not already downloaded the surveys data and put it in directory "data" within your working directory, please do so now.

surveys[1, 1]   # first element in the first column of the data frame
surveys[1, 6]   # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[1:3, 4:7] # first three elements in the 4th through 7th columns
surveys[3, ]    # the 3rd element for all columns
surveys[, 8]    # the entire 8th column

Failure to specify anything before or after the comma results in the return of the entire row or column.

Factors

From the str() command, you can see that the columns species and sex are of a special class called factor. They are very useful but not necessarily intuitive, and therefore require some attention.

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. This is similar to the way stats programs like JMP or SPSS code the data. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))
sex
[1] male   female female male  
Levels: female male

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m in the alphabet, even though the first element in this vector is "male"). You can check the names of the levels by using the function levels(), and check the number of levels using nlevels():

levels(sex)
[1] "female" "male"  
nlevels(sex)
[1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows you to compare levels:

food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
[1] "high"   "low"    "medium"
food <- factor(food, levels=c("low", "medium", "high"))
levels(food)
[1] "low"    "medium" "high"  

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low", "medium", and "high"" is more descriptive than 1, 2, 3. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).

Converting to factors

Look at the data frame structure again:

str(surveys)
'data.frame':   35549 obs. of  8 variables:
 $ record_id: int  1 2 3 4 5 6 7 8 9 10 ...
 $ month    : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day      : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year     : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot     : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species  : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
 $ sex      : Factor w/ 6 levels "","F","M","P",..: 3 3 2 3 3 3 2 3 2 2 ...
 $ wgt      : int  NA NA NA NA NA NA NA NA NA NA ...

The type for “plot” is integer. But the plots could have been labeled with a lettering scheme rather than numbers. There is no concept that plot 4 + plot 2 = plot 6, but right now R thinks that is just fine. So we should really make it a factor.

You can convert this variable to a factor with the as.factor() command.

surveys$plot <- as.factor(surveys$plot)

Look at the structure of surveys. Now, plot is a factor with 24 levels. To us, that line of code changed the data type from int to `Factor.

What actually happened was that R evaluated the code on the right of the <- (in this case, creating factors of the integers in the plot column), and assigned that value to the object on the left of the <-, whcih already existed, so it overwrote the values. That is ok in this case.

Previous:Intro to R Next: The data.frame class