Subsetting data

In particular for larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the weights from our datasets, we can use: surveys$wgt. You can use names(surveys) or colnames(surveys) to remind yourself of the column names.

In some cases, you may way to select more than one column. You can do this using the square brackets: surveys[, c("wgt", "sex")].

When analyzing data, though, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per plot.

One way to do this is to select the data we want, and create a new temporary array, using the subset() function. For instance, if we just want to look at the animals of the species “DO”:

surveys_DO <- subset(surveys, species == "DO")

subset() is a function in “base R” which means that it comes automatically when you install R. Functions in R have a name, and then take arguments in parentheses. Type ?subset to call up the help for the subset function. You’ll see some examples of useage, and then you’ll see a listing of Arguments that subset() can take.

The first argument, x is the object to be subsetted, in our case the data frame surveys. The second is subset, which the help tells us is a logical expression. We put in here species=="DO", which is a condition that tests for the column species to be equal to “DO.” The third argument is select, which tells the columns to select from the data frame.

DO_weights <- subset(surveys, species == "DO", select=c("sex","wgt"))
head(DO_weights)

##     sex wgt
## 68    F  52
## 292   F  33
## 294   F  50
## 317   F  48
## 323   F  31
## 337   F  41

If we leave it blank, the function will return all columns (default).

In R, a test for “equals”" is two equal signs, because we use a single equal sign for assignment sometimes. Here are some other operators that R uses:

Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	exactly equal to
!=	not equal to
!x	Not x
x \| y	x OR y
x & y	x AND y
isTRUE(x)	test if X is TRUE

Challenge

Use subset() to print data collected since 1990.
How many animals were male and of species “PE”?
How many individuals of the species “DM” or “SS”?
What value is returned by surveys_DO$month[2] (guess before you execute!)

Adding a column to our dataset

Sometimes, you may have to add a new column to your dataset that represents a new variable. The easiest way to do this is to use the $ symbol before a column name that doesn’t exist yet

surveys$logwgt <- log(surveys$wgt)

Look at the data set now (use the grid icon in the Environment tab). There is a new column, and the values are equal to the log of the values in the wgt column.

If you assign values to a column that doesn’t exist yet, R will create a new column with those values. Careful: If you had typed surveys$wgt <- log(surveys$wgt) you would have overwritted the original column.

Removing columns

Just like you can select columns by their positions in the data.frame or by their names, you can remove them similarly.

To remove it by column number:

surveys_noDate <- surveys[, -c(2:4)]
colnames(surveys)

## [1] "record_id" "month"     "day"       "year"      "plot"      "species"  
## [7] "sex"       "wgt"       "logwgt"

colnames(surveys_noDate)

## [1] "record_id" "plot"      "species"   "sex"       "wgt"       "logwgt"

The easiest way to remove by name is to use the subset() function. This time we need to specify explicitly the argument select, since the default is to subset on rows (as above). The minus sign indicates the names of the columns to remove (note that the column names should not be quoted):

surveys_noDate2 <- subset(surveys, select=-c(month, day, year))
colnames(surveys_noDate2)

## [1] "record_id" "plot"      "species"   "sex"       "wgt"       "logwgt"

Removing rows

Typically rows are not associated with names, so to remove them from the data.frame, you can do:

surveys_missingRows <- surveys[-c(10, 50:70), ] # removing rows 10, and 50 to 70

Building data frames in R

You can build a data frame from a series of vectors right in R.

land_animals <- data.frame(animal=c("dog", "cat", "bunny", "snake"),
                           feel=c("furry", "furry", "fluffy", "smooth"),
                           weight=c(45, 8, 5.5, 1.2))
                           
sea_animals <- data.frame(animal=c("fish", "sea cucumber", "sea urchin"),
                           feel=c("smooth", "squishy", "spiny"),
                           weight=c( 8, 1.1, 0.8))

more_info <- data.frame(legs=c(4,4,4,0,0,0,0), eyes=c(2,2,2,2,2,0,0))

land_animals

##   animal   feel weight
## 1    dog  furry   45.0
## 2    cat  furry    8.0
## 3  bunny fluffy    5.5
## 4  snake smooth    1.2

sea_animals

##         animal    feel weight
## 1         fish  smooth    8.0
## 2 sea cucumber squishy    1.1
## 3   sea urchin   spiny    0.8

more_info

##   legs eyes
## 1    4    2
## 2    4    2
## 3    4    2
## 4    0    2
## 5    0    2
## 6    0    0
## 7    0    0

You can stack data frames next to each other using cbind() (column bind) and on top of each other using rbind (row bind).

With rbind() the number of columns and their names must be identical between the two objects:

all_animals <- rbind(land_animals, sea_animals)
all_animals

If not, you get an error:

rbind(land_animals, more_info)

## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

With cbind() additional columns must have the same number of elements as there are rows in the data.frame.

all_animals <- cbind(all_animals, more_info)
all_animals

RECYCLING

Danger if you don’t have matching number of rows when doing cbind, R tries to “recycle” if the length of the longer object is a multiple of the shorter one. Here, we try to cbind sea_animals (with 3 elements) with more_info with the first row removed so it has 6 elements. Since 6 is a multiple of 3, R will recycle values in a way to make it “work out:”

cbind(sea_animals, more_info[-1,])

So R wrote out the sea_animals twice to allow all 6 elements of more_info to fit in the data.frame. This could be very bad if you’re not aware of it.

If the longer object is not a multiple of the shorter one, R will return an error:

cbind(land_animals, more_info)

## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 4, 7

Previous:Starting with Data Next: Analyzing and Plotting

The data.frame class

Manipulating Data