Learning Objectives
- load external data (CSV files) in memory using the survey table (
surveys.csv
) as an example- understand the concept of a
data.frame
- know how to access any element or section of a
data.frame
- explore the structure and the content of the data in R
- understand what factors are and how to manipulate them
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv
file: each row holds information for a single animal columns represent:
survey_id
month
day
year
plot
species
(a 2 letter code, see the species.csv file for correspondance)sex
(“M” for males and “F” for females)wgt
(the weight in grams).To get the data in a place that R can find it:
"data"
within your working directory for these exercisesTo load our survey data, we need to locate the surveys.csv
file. We will use read.csv()
to load into memory (as a data.frame
) the content of the CSV file.
surveys <- read.csv('data/surveys.csv')
At this point, make sure all participants have the data loaded
This statement doesn’t produce any output because assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: surveys
Wow… that was a lot of output. At least it means the data loaded properly. Let’s check the top (the first 6 lines) of this data.frame
using the function head()
:
head(surveys)
record_id month day year plot species sex wgt
1 1 7 16 1977 2 NL M NA
2 2 7 16 1977 3 NL M NA
3 3 7 16 1977 2 DM F NA
4 4 7 16 1977 7 DM M NA
5 5 7 16 1977 3 DM M NA
6 6 7 16 1977 1 PF M NA
data.frame
is the de facto data structure for most tabular data and what we use for statistics and plotting.
A data.frame
is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str()
function is useful to inspect the data types of the columns.
A data.frame
can be created by the functions read.csv()
or read.table()
, in other words, when importing spreadsheets from your hard drive (or the web).
R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.
Let’s now check the structure of this data.frame
in more details with the function str()
:
str(surveys)
'data.frame': 35549 obs. of 8 variables:
$ record_id: int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 16 16 16 16 16 16 16 16 16 16 ...
$ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
$ plot : int 2 3 2 7 3 1 2 1 1 6 ...
$ species : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
$ sex : Factor w/ 6 levels "","F","M","P",..: 3 3 2 3 3 3 2 3 2 2 ...
$ wgt : int NA NA NA NA NA NA NA NA NA NA ...
The first line tells us that we have 35549 observations of 8 variables. We also see the data types for each of the columns in our data.frame.
You can also see this information in the Environment tab if you click the blue arrow button to unfold this information:
If you want to view the data, click on the grid icon on the right hand side of the line with surveys
on it. This will open a new tab in the scripts area. You can scroll around, but you cannot edit.
NOTE: Values that were blank in the CSV are now listed as NA
. This is what R uses for missing data (“not available”).
You’ll notice from the str()
command that every column name is preceded with the $
symbol. This is the way to access a single column of the data frame.
For example, you can look at just the species column by typing:
surveys$species
Note it doesn’t print a column, but it prints them out with word wrapping. The numbers in the brackets at the left serve as an index.
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columnshead()
- shows the first 6 rowstail()
- shows the last 6 rowsnames()
- returns the column names (synonym of colnames()
for data.frame
objects)rownames()
- returns the row namesstr()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each columnNote: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Can you answer the following questions??
surveys
?If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. We will start simply with a vector.
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
[1] "rat"
This will print rat, the second value in the vector animals
.
R indices start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do.
:
is a special function that creates numeric vectors of integer in increasing or decreasing order, test 1:10
and 10:1
for instance.
animals[c(3, 2)]
[1] "dog" "rat"
animals[2:4]
[1] "rat" "dog" "cat"
more_animals <- animals[c(1:3, 2:4)]
more_animals
[1] "mouse" "rat" "dog" "rat" "dog" "cat"
Show everything except a certain index by using the minus sign
animals[-1]
[1] "rat" "dog" "cat"
animals[c(-2,-4)]
[1] "mouse" "dog"
Indexing like this will also work in data frames, but they have two dimensions: rows and columns. If we want to extract some specific data from it, we need to specify the index values for both dimensions. You can think of these as “coordinates”. The syntax for this operation uses brackets, but separates it by a comma: [ row conditions , column conditions]
. Row numbers come first, followed by column numbers.
If you have not already downloaded the surveys data and put it in directory "data"
within your working directory, please do so now.
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[1:3, 4:7] # first three elements in the 4th through 7th columns
surveys[3, ] # the 3rd element for all columns
surveys[, 8] # the entire 8th column
Failure to specify anything before or after the comma results in the return of the entire row or column.
From the str() command, you can see that the columns species
and sex
are of a special class called factor
. They are very useful but not necessarily intuitive, and therefore require some attention.
Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.
Factors are stored as integers, and have labels associated with these unique integers. This is similar to the way stats programs like JMP or SPSS code the data. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
sex <- factor(c("male", "female", "female", "male"))
sex
[1] male female female male
Levels: female male
R will assign 1
to the level "female"
and 2
to the level "male"
(because f
comes before m
in the alphabet, even though the first element in this vector is "male"
). You can check the names of the levels by using the function levels()
, and check the number of levels using nlevels()
:
levels(sex)
[1] "female" "male"
nlevels(sex)
[1] 2
Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows you to compare levels:
food <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(food)
[1] "high" "low" "medium"
food <- factor(food, levels=c("low", "medium", "high"))
levels(food)
[1] "low" "medium" "high"
In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low"
, "medium"
, and "high"
" is more descriptive than 1
, 2
, 3
. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the species in our example data set).
Look at the data frame structure again:
str(surveys)
'data.frame': 35549 obs. of 8 variables:
$ record_id: int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 16 16 16 16 16 16 16 16 16 16 ...
$ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
$ plot : int 2 3 2 7 3 1 2 1 1 6 ...
$ species : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
$ sex : Factor w/ 6 levels "","F","M","P",..: 3 3 2 3 3 3 2 3 2 2 ...
$ wgt : int NA NA NA NA NA NA NA NA NA NA ...
The type for “plot” is integer. But the plots could have been labeled with a lettering scheme rather than numbers. There is no concept that plot 4 + plot 2 = plot 6, but right now R thinks that is just fine. So we should really make it a factor.
You can convert this variable to a factor with the as.factor()
command.
surveys$plot <- as.factor(surveys$plot)
Look at the structure of surveys
. Now, plot is a factor with 24 levels. To us, that line of code changed the data type from int
to `Factor.
What actually happened was that R evaluated the code on the right of the <-
(in this case, creating factors of the integers in the plot column), and assigned that value to the object on the left of the <-
, whcih already existed, so it overwrote the values. That is ok in this case.
Previous:Intro to R Next: The data.frame class