Learning Objectives
- understand how to deal with missing data
- being able to generate summary statistics from the data
- calculate basic statistics across levels of a factor (here species)
- generate plots from the data using the average weights of the animals as an example
If you have not already downloaded the surveys data and put it in directory "data"
within your working directory, please do so now.
Let’s get a closer look at our data. For instance, we might want to know how many animals we trapped in each plot, or how many of each species were caught.
The function table()
, tells us how many of each species we have:
table(surveys$species)
R has a lot of built in statistical functions, like mean()
, median()
, max()
, min()
, var()
, sd()
. Let’s start by calculating the average weight of all the animals using the function mean()
:
mean(surveys$wgt)
[1] NA
Hmm, we just get NA
. That’s because we don’t have the weight for every animal and missing data is recorded as NA
. By default, all R functions operating on a vector that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it.
When dealing with simple statistics like the mean, the easiest way to ignore NA
(the missing data) is to use na.rm=TRUE
(rm
stands for remove):
mean(surveys$wgt, na.rm=TRUE)
[1] 42.67243
In some cases, it might be useful to remove the missing data from the vector. For this purpose, R comes with the function na.omit
:
wgt_noNA <- na.omit(surveys$wgt)
For some applications, it’s useful to keep all observations, for others, it might be best to remove all observations that contain missing data. The function complete.cases()
removes any rows that contain at least one missing observation:
surveys_complete <- surveys[complete.cases(surveys), ]
Note: What this is actually doing is rather interesting. The simple line
complete.cases(surveys)
returns a list of TRUE
/FALSE
values for each row of surveys
, showing whether or not it is a complete case. When you put such a list into the brackets [,]
to index a matrix, it returns the rows that are listed as TRUE
.
The new data.frame surveys_complete
now has only complete data. Lets see how many observations of each species was retained:
table(surveys_complete$species)
Many species have 0 values in this table, meaning there are no records of this species left (i.e. none had the weight recorded). Yet they still show up in this table. This is because R “remembers” all species that were found in the original dataset, even though they aren’t included in this data set. This could get annoying later on (with plotting, etc…).
To remove the NA
and make things clearer, we can redefine the levels for the factor “species” by re-establishing this as a factor.
surveys_complete$species <- factor(surveys_complete$species)
An equivalent way to do this is to use the droplevels() function. This will work if you have many columns from the same data frame that need excess factors removed.
surveys_complete <- droplevels(surveys_complete)
Now see how things are looking:
table(surveys_complete$species)
Make sure everyone has a clean copy of surveys_complete.
If you have been lost/confused/in the bathroom, you can catch up now with:
##CLEAR WORKPACE????
#surveys <- read.csv('data/surveys.csv')
surveys$logwgt <- log(surveys$wgt)
surveys$plot <- as.factor(surveys$plot)
surveys_complete <- surveys[complete.cases(surveys), ]
surveys_complete <- droplevels(surveys_complete)
Now save the script!
To determine the number of elements found in a vector, we can use use the function length()
(e.g., length(surveys$wgt)
). Using length()
, how many animals have not had their weights recorded?
What is the range (minimum and maximum) weight?
What is the median weight for the males?
What if we want the maximum weight for each species, or the mean for each plot?
R comes with convenient functions to do this kind of operations, functions in the apply
family.
For instance, tapply()
allows us to repeat a function across each level of a factor. The format is:
tapply(columns_to_do_the_calculations_on, factor_to_sort_on, function)
If we want to calculate the maximum for each species (using the complete dataset):
tapply(surveys_complete$wgt, surveys_complete$species, max)
If we want to calculate the mean for each plot:
tapply(surveys_complete$wgt, surveys_complete$plot, mean)
surveys_summary
) that contains as columns:species
the 2 letter code for the species namesmean_wgt
the mean weight for each speciessd_wgt
the standard deviation for each speciesmin_wgt
the minimum weight for each speciesmax_wgt
the maximum weight for each speciesAnswers
species_max <- tapply(surveys_complete$wgt, surveys_complete$species, max)
species_min <- tapply(surveys_complete$wgt, surveys_complete$species, min)
species_mean <- tapply(surveys_complete$wgt, surveys_complete$species, min)
species_sd <- tapply(surveys_complete$wgt, surveys_complete$species, sd)
nlevels(surveys_complete$species) # or length(species_mean)
surveys_summary <- data.frame(species=levels(surveys_complete$species),
mean_wgt=species_mean,
sd_wgt=species_sd,
min_wgt=species_min,
max_wgt=species_max)
The best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s base plotting package.
Let’s use the surveys_summary
data that we generated and plot it. R has built in plotting functions.
barplot(table(surveys_complete$species))
The axis labels are too big though, so you can’t see them all. Let’s change that.
Tip: To learn about a function in R, e.g.
barplot
, we can read its help documention by runninghelp(barplot)
or?barplot
.
barplot(surveys_summary$mean_wgt, cex.names=0.4)
Alternatively, we may want to flip the axes to have more room for the species names:
barplot(surveys_summary$mean_wgt, horiz=TRUE, las=1, cex.names=0.4)
Let’s also add some colors, and add a main title, label the axis:
barplot(surveys_summary$mean_wgt, horiz=TRUE, las=1, cex.names=0.4,
col=c("lavender", "lightblue"), xlab="Weight (g)",
main="Mean weight per species")
#RRGGBB
.)There are lots of different ways to plot things. You can do plot(object)
for most classes included in R base. To explore some of the possibilities:
?barplot
?boxplot
?plot.default
example(barplot)
If you wanted to output this plot to a pdf file rather than to the screen, you can specify where you want the plot to go with the pdf()
function. If you wanted it to be a JPG, you would use the function jpeg()
(other formats available: svg, png, ps).
Be sure to add dev.off()
at the end to finalize the file. For pdf()
, you can create multiple pages to your file, by generating multiple plots before calling dev.off()
.
jpeg("mean_per_species.jpg", height=400, width=500)
barplot(surveys_summary$mean_wgt, horiz=TRUE, las=1,
col=c("lavender", "lightblue"), xlab="Weight (g)",
main="Mean weight per species")
dev.off()
pdf("mean_per_species.pdf")
barplot(surveys_summary$mean_wgt, horiz=TRUE, las=1,
col=c("lavender", "lightblue"), xlab="Weight (g)",
main="Mean weight per species")
dev.off()
variable <- value
to assign a value to a variable in order to record it in memory.dim
gives the dimensions of a data frame.object[x, y]
to select a single element from a data frame.from:to
to specify a sequence that includes the indices from from
to to
.#
to add comments to programs.mean
, max
, min
and sd
to calculate simple statistics.apply
to calculate statistics across the rows or columns of a data frame.plot
to create simple visualizations.Previous:The data.frame class Next: Writing Functions