The rest of the lessons about functions and analyzing multiple data sets will use a new data set about inflammation.
Say we are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets.
The files are stored in a .zip file here. Download it into your /data
directory and unzip it.
Make sure everyone has downloaded the zip file.
The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. The first few rows of our first file look like this:
## 0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
## 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
## 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
## 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
## 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
For each data set, we want to:
To do all that, we’ll have to learn a little bit about programming. But first, let’s make an analysis plan for the first file.
Load the first data file into R using read.csv
:
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
head(dat)
## [1] "data.frame"
## [1] 59
## [1] 40
## [1] 59 40
## [1] 16
Now let’s perform some common mathematical operations to learn about our inflammation data.
What if we need the maximum inflammation for all patients, or the average for each day?
To support this, we can use the apply
function.
apply
allows us to repeat a function on all of the rows (MARGIN = 1
) or columns (MARGIN = 2
) of a data frame.
Thus, to obtain the average inflammation of each patient we will need to calculate the mean of all of the rows (MARGIN = 1
) of the data frame.
avg_patient_inflammation <- apply(dat, 1, mean)
And to obtain the average inflammation of each day we will need to calculate the mean of all of the columns (MARGIN = 2
) of the data frame.
avg_day_inflammation <- apply(dat, 2, mean)
Since the second argument to apply
is MARGIN
, the above command is equivalent to apply(dat, MARGIN = 2, mean)
. We’ll learn why this is so in the next lesson.
Tip: Some common operations have more efficient alternatives. For example, you can calculate the row-wise or column-wise means with
rowMeans
andcolMeans
, respectively.
Let’s take a look at the average inflammation over time. Recall that we already saved them in the variable avg_day_inflammation
. Plotting the values is done with the function plot
.
plot(avg_day_inflammation)
plot(avg_day_inflammation, type="l")
Above, we gave the function plot
a vector of numbers corresponding to the average inflammation per day across all patients. plot
created a scatter plot where the y-axis is the average inflammation level and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 40 days of treatment.
Let’s have a look at two other statistics: the maximum and minimum inflammation per day.
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation, type="l")
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation, type="l")
Note you can show all of these plots together by modifying some R graphics parameters. This is done with the command par()
. Look at the help file for this function either by typing it in the search window, or by typing ?par()
. There are many, many parameters that can be adjusted here.
par(mfrow=c(1,3))
plot(avg_day_inflammation, type="l")
plot(max_day_inflammation, type="l")
plot(min_day_inflammation, type="l")
par(mfrow=c(1,1)) #reset to original
Write this plot to a jpeg called “daily-inflammation.jpg” with dimensions 300 x 600
jpeg("daily-inflammation.jpg", height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflammation, type="l")
plot(max_day_inflammation, type="l")
plot(min_day_inflammation, type="l")
dev.off()
## pdf
## 2
We have successfully plotted the avg, max and minimum inflammation per day for the first file. We would like to do the same for the other 11 the same way, but typing in the same commands repeatedly is tedious and error-prone.
This next challenge has several steps. Think about how you break down a difficult problem into manageable pieces.
analyze
that takes a filename as a parameter and writes a file with the 3 graphs you made earlier (average, min and max inflammation over time). i.e., analyze('data/inflammation-01.csv')
should produce the graphs already shown, while analyze('data/inflammation-02.csv')
should produce corresponding graphs for the second data set. Be sure to give your function a docstring.analyze <- function(filename, imagename){
# x is a file name for a data frame of numbers
# This function reads the csv file with inflammation data, calculates the mean, max and min inflmmation values across patients for each day, and then plots it side by side, and writes this plot to a jpeg file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
max_day_inflammation <- apply(dat, 2, max)
min_day_inflammation <- apply(dat, 2, min)
jpeg(imagename, height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflammation, type="l")
plot(max_day_inflammation, type="l")
plot(min_day_inflammation, type="l")
dev.off()
}
Try it out:
analyze("../../data/inflammation/inflammation-01.csv", "daily-inflammation-01.jpg")
## pdf
## 2
analyze("../../data/inflammation/inflammation-02.csv", "daily-inflammation-02.jpg")
## pdf
## 2
analyze("../../data/inflammation/inflammation-03.csv", "daily-inflammation-03.jpg")
## pdf
## 2
analyze("../../data/inflammation/inflammation-04.csv", "daily-inflammation-04.jpg")
## pdf
## 2
We now have a function called analyze to visualize a single data set. We could use it to explore all 12 of our current data sets like this:
# analyze('data/inflammation-01.csv')
# analyze('data/inflammation-02.csv')
# #...
# analyze('data/inflammation-12.csv')
but the chances of us typing all 12 filenames correctly aren’t great, and we’ll be even worse off if we get another hundred files. What we need is a way to tell R to do something once for each file, and that will be the subject of the next lesson.
for
loop does.The general form of a loop is:
for (variable in collection){
do things with variable
}
We can call the loop variable anything we like, but there must be a curly brace {
at the end of the line starting the loop, and we should indent the body of the loop.
Here is a loop that prints out the squares of a set of numbers
for (i in 1:10){
sq <- i^2
print(paste("The square of ", i, "is", sq))
}
## [1] "The square of 1 is 1"
## [1] "The square of 2 is 4"
## [1] "The square of 3 is 9"
## [1] "The square of 4 is 16"
## [1] "The square of 5 is 25"
## [1] "The square of 6 is 36"
## [1] "The square of 7 is 49"
## [1] "The square of 8 is 64"
## [1] "The square of 9 is 81"
## [1] "The square of 10 is 100"
It’s worth tracing the execution of this little program step by step. Since there are 10 instances in the sequence 1:10
, the indexing variable i
gets assigned to each of those values in turn. Each time, a variable sq
is evaluated The first time around, length is zero (the value assigned to it on line 1) and vowel is ‘a’. The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, vowel is ‘e’ and length is 1, so length is updated to be 2. After three more updates, length is 5; since there is nothing left in ‘aeiou’ for R to process, the loop finishes and the print
statement tells us our final answer.
2**4
. Write a function called expo that uses a loop to calculate the same result. Make the form be: expo(base, power)
expo <- function(base, power){
result <- 1
for (i in 1:power){
result <- result*base
}
return(result)
}
We now have almost everything we need to process all our data files. We need a way to be able to specify the collection that the for loop with operate over.
What we need is a function that finds files whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. We can use this to get the names of all the R files we have created so far:
list.files(pattern = "*.R")
## [1] "00-before-we-start.Rmd" "01-intro-to-R.docx"
## [3] "01-intro-to-R.html" "01-intro-to-R.Rmd"
## [5] "02-starting-with-data.Rmd" "03-data-frames.Rmd"
## [7] "04-analyzing-data.Rmd" "05-func-R-motivation.html"
## [9] "05-func-R-motivation.Rmd" "05-func-R.html"
## [11] "05-func-R.Rmd" "06-loops-R.html"
## [13] "06-loops-R.Rmd" "06-loops-R_files"
## [15] "07-best-practices.Rmd" "07-knitr-R.Rmd"
## [17] "0X-parkingLot.Rmd" "challenge-questions.Rmd"
## [19] "LogFunction.R" "skeleton-lessons.R"
or to get the names of all our CSV data files:
filenames <- list.files(pattern="*.csv", recursive = TRUE)
filenames
## [1] "../../data/inflammation/inflammation-01.csv"
## [2] "../../data/inflammation/inflammation-02.csv"
## [3] "../../data/inflammation/inflammation-03.csv"
## [4] "../../data/inflammation/inflammation-04.csv"
## [5] "../../data/inflammation/inflammation-05.csv"
## [6] "../../data/inflammation/inflammation-06.csv"
## [7] "../../data/inflammation/inflammation-07.csv"
## [8] "../../data/inflammation/inflammation-08.csv"
## [9] "../../data/inflammation/inflammation-09.csv"
## [10] "../../data/inflammation/inflammation-10.csv"
## [11] "../../data/inflammation/inflammation-11.csv"
## [12] "../../data/inflammation/inflammation-12.csv"
As these examples show, list.files()
result is a list of strings, which means we can loop over it to do something with each filename in turn. In our case, the “something” we want is our analyze function. Let’s test it by analyzing the first three files in the list:
filenames <- filenames[1:3]
for (f in 1:length(filenames)){
print (filenames[f])
analyze(filenames[f], paste("daily-inflammation", f, ".jpg"))
}
## [1] "../../data/inflammation/inflammation-01.csv"
## [1] "../../data/inflammation/inflammation-02.csv"
## [1] "../../data/inflammation/inflammation-03.csv"
analyze_all
that takes a filename pattern as its sole argument and runs analyze for each file whose name matches the pattern.for variable in collection
to process the elements of a collection one at a time.length(thing)
to determine the length of something that contains other values.list.files(pattern)
to create a list of files whose names match a pattern.*
in a pattern to match zero or more characters.We have now solved our original problem: we can analyze any number of data files with a single command. More importantly, we have met two of the most important ideas in programming:
Previous:Writing Functions Next: Best Practices for R