Inflammation Data

The rest of the lessons about functions and analyzing multiple data sets will use a new data set about inflammation.

Say we are studying inflammation in patients who have been given a new treatment for arthritis, and need to analyze the first dozen data sets.

The files are stored in a .zip file here. Download it into your /data directory and unzip it.

PAUSE

Make sure everyone has downloaded the zip file.

The data sets are stored in comma-separated values (CSV) format: each row holds information for a single patient, and the columns represent successive days. The first few rows of our first file look like this:

## 0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
## 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
## 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
## 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
## 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

For each data set, we want to:

calculate the average inflammation per day across all patients, and
plot the result.

To do all that, we’ll have to learn a little bit about programming. But first, let’s make an analysis plan for the first file.

Load the first data file into R using read.csv:

dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)

head(dat)

Challenge

What class is this data?
How many patients (rows) does this data contain?
How many days (columns) do we have data for?
What is the inflammation level for patients 30 at day 20?

## [1] "data.frame"

## [1] 59

## [1] 40

## [1] 59 40

## [1] 16

Now let’s perform some common mathematical operations to learn about our inflammation data.

Challenge

What is the minimum inflammation on day 7?
What is the mean inflammation on day 12?
What is the maximum inflammation for patient 5

What if we need the maximum inflammation for all patients, or the average for each day?

To support this, we can use the apply function.

apply allows us to repeat a function on all of the rows (MARGIN = 1) or columns (MARGIN = 2) of a data frame.

Thus, to obtain the average inflammation of each patient we will need to calculate the mean of all of the rows (MARGIN = 1) of the data frame.

avg_patient_inflammation <- apply(dat, 1, mean)

And to obtain the average inflammation of each day we will need to calculate the mean of all of the columns (MARGIN = 2) of the data frame.

avg_day_inflammation <- apply(dat, 2, mean)

Since the second argument to apply is MARGIN, the above command is equivalent to apply(dat, MARGIN = 2, mean). We’ll learn why this is so in the next lesson.

Tip: Some common operations have more efficient alternatives. For example, you can calculate the row-wise or column-wise means with rowMeans and colMeans, respectively.

Plotting

Let’s take a look at the average inflammation over time. Recall that we already saved them in the variable avg_day_inflammation. Plotting the values is done with the function plot.

plot(avg_day_inflammation)

plot(avg_day_inflammation, type="l")

Above, we gave the function plot a vector of numbers corresponding to the average inflammation per day across all patients. plot created a scatter plot where the y-axis is the average inflammation level and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 40 days of treatment.

Let’s have a look at two other statistics: the maximum and minimum inflammation per day.

max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation, type="l")

min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation, type="l")

Note you can show all of these plots together by modifying some R graphics parameters. This is done with the command par(). Look at the help file for this function either by typing it in the search window, or by typing ?par(). There are many, many parameters that can be adjusted here.

par(mfrow=c(1,3))
plot(avg_day_inflammation, type="l")
plot(max_day_inflammation, type="l")
plot(min_day_inflammation, type="l")

par(mfrow=c(1,1)) #reset to original

Write this plot to a jpeg called “daily-inflammation.jpg” with dimensions 300 x 600

jpeg("daily-inflammation.jpg", height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflammation, type="l")
plot(max_day_inflammation, type="l")
plot(min_day_inflammation, type="l")
dev.off()

## pdf 
##   2

Next Steps

We have successfully plotted the avg, max and minimum inflammation per day for the first file. We would like to do the same for the other 11 the same way, but typing in the same commands repeatedly is tedious and error-prone.

Challenges

This next challenge has several steps. Think about how you break down a difficult problem into manageable pieces.

Write a function called analyze that takes a filename as a parameter and writes a file with the 3 graphs you made earlier (average, min and max inflammation over time). i.e., analyze('data/inflammation-01.csv') should produce the graphs already shown, while analyze('data/inflammation-02.csv') should produce corresponding graphs for the second data set. Be sure to give your function a docstring.

analyze <- function(filename, imagename){
  # x is a file name for a data frame of numbers
  # This function reads the csv file with inflammation data, calculates the mean, max and min inflmmation values across patients for each day, and then plots it side by side, and writes this plot to a jpeg file. 
  dat <- read.csv(file = filename, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  max_day_inflammation <- apply(dat, 2, max)
  min_day_inflammation <- apply(dat, 2, min)
  
  jpeg(imagename, height=300, width=600)
  par(mfrow=c(1,3))
  plot(avg_day_inflammation, type="l")
  plot(max_day_inflammation, type="l")
  plot(min_day_inflammation, type="l")
  dev.off()

}

Try it out:

analyze("../../data/inflammation/inflammation-01.csv", "daily-inflammation-01.jpg")

## pdf 
##   2

analyze("../../data/inflammation/inflammation-02.csv", "daily-inflammation-02.jpg")

## pdf 
##   2

analyze("../../data/inflammation/inflammation-03.csv", "daily-inflammation-03.jpg")

## pdf 
##   2

analyze("../../data/inflammation/inflammation-04.csv", "daily-inflammation-04.jpg")

## pdf 
##   2

Next Steps

We now have a function called analyze to visualize a single data set. We could use it to explore all 12 of our current data sets like this:

# analyze('data/inflammation-01.csv')
# analyze('data/inflammation-02.csv')
# #...
# analyze('data/inflammation-12.csv')

but the chances of us typing all 12 filenames correctly aren’t great, and we’ll be even worse off if we get another hundred files. What we need is a way to tell R to do something once for each file, and that will be the subject of the next lesson.

Analyzing Multiple Data Sets

Objectives

Explain what a for loop does.
Correctly write for loops to repeat simple calculations.
Trace changes to a loop variable as the loop runs.
Trace changes to other variables as they are updated by a for loop.
Explain what a list is.
Create and index lists of simple values.
Use a library function to get a list of filenames that match a simple wildcard pattern.
Use a for loop to process multiple files.

For Loops

The general form of a loop is:

for (variable in collection){
    do things with variable
}

We can call the loop variable anything we like, but there must be a curly brace { at the end of the line starting the loop, and we should indent the body of the loop.

Here is a loop that prints out the squares of a set of numbers

for (i in 1:10){
  sq <- i^2
  print(paste("The square of ", i, "is", sq))
}

## [1] "The square of  1 is 1"
## [1] "The square of  2 is 4"
## [1] "The square of  3 is 9"
## [1] "The square of  4 is 16"
## [1] "The square of  5 is 25"
## [1] "The square of  6 is 36"
## [1] "The square of  7 is 49"
## [1] "The square of  8 is 64"
## [1] "The square of  9 is 81"
## [1] "The square of  10 is 100"

It’s worth tracing the execution of this little program step by step. Since there are 10 instances in the sequence 1:10, the indexing variable i gets assigned to each of those values in turn. Each time, a variable sq is evaluated The first time around, length is zero (the value assigned to it on line 1) and vowel is ‘a’. The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, vowel is ‘e’ and length is 1, so length is updated to be 2. After three more updates, length is 5; since there is nothing left in ‘aeiou’ for R to process, the loop finishes and the print statement tells us our final answer.

Challenges

Exponentiation is built into R: 2**4. Write a function called expo that uses a loop to calculate the same result. Make the form be: expo(base, power)

expo <- function(base, power){
  result <- 1
  for (i in 1:power){
    result <- result*base
  }
  return(result)
  
}

Processing Multiple Files

We now have almost everything we need to process all our data files. We need a way to be able to specify the collection that the for loop with operate over.

What we need is a function that finds files whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. We can use this to get the names of all the R files we have created so far:

list.files(pattern = "*.R")

##  [1] "00-before-we-start.Rmd"    "01-intro-to-R.docx"       
##  [3] "01-intro-to-R.html"        "01-intro-to-R.Rmd"        
##  [5] "02-starting-with-data.Rmd" "03-data-frames.Rmd"       
##  [7] "04-analyzing-data.Rmd"     "05-func-R-motivation.html"
##  [9] "05-func-R-motivation.Rmd"  "05-func-R.html"           
## [11] "05-func-R.Rmd"             "06-loops-R.html"          
## [13] "06-loops-R.Rmd"            "06-loops-R_files"         
## [15] "07-best-practices.Rmd"     "07-knitr-R.Rmd"           
## [17] "0X-parkingLot.Rmd"         "challenge-questions.Rmd"  
## [19] "LogFunction.R"             "skeleton-lessons.R"

or to get the names of all our CSV data files:

filenames <- list.files(pattern="*.csv", recursive = TRUE)
filenames

##  [1] "../../data/inflammation/inflammation-01.csv"
##  [2] "../../data/inflammation/inflammation-02.csv"
##  [3] "../../data/inflammation/inflammation-03.csv"
##  [4] "../../data/inflammation/inflammation-04.csv"
##  [5] "../../data/inflammation/inflammation-05.csv"
##  [6] "../../data/inflammation/inflammation-06.csv"
##  [7] "../../data/inflammation/inflammation-07.csv"
##  [8] "../../data/inflammation/inflammation-08.csv"
##  [9] "../../data/inflammation/inflammation-09.csv"
## [10] "../../data/inflammation/inflammation-10.csv"
## [11] "../../data/inflammation/inflammation-11.csv"
## [12] "../../data/inflammation/inflammation-12.csv"

As these examples show, list.files() result is a list of strings, which means we can loop over it to do something with each filename in turn. In our case, the “something” we want is our analyze function. Let’s test it by analyzing the first three files in the list:

filenames <- filenames[1:3]

for (f in 1:length(filenames)){
    print (filenames[f])
    analyze(filenames[f], paste("daily-inflammation", f, ".jpg"))
}

## [1] "../../data/inflammation/inflammation-01.csv"

## [1] "../../data/inflammation/inflammation-02.csv"

## [1] "../../data/inflammation/inflammation-03.csv"

Challenges

Write a function called analyze_all that takes a filename pattern as its sole argument and runs analyze for each file whose name matches the pattern.

Key Points

Use for variable in collection to process the elements of a collection one at a time.
The body of a for loop does not have to, but should be indented.
Use length(thing) to determine the length of something that contains other values.
c(value1, value2, value3, â¦) creates a vector
Vectors are indexed and sliced in the same way as strings and arrays.
vectors are mutable (i.e., their values can be changed in place).
Use list.files(pattern) to create a list of files whose names match a pattern.
Use * in a pattern to match zero or more characters.

Next Steps

We have now solved our original problem: we can analyze any number of data files with a single command. More importantly, we have met two of the most important ideas in programming:

Use functions to make code easier to re-use and easier to understand.
Use vectors and arrays to store related values, and loops to repeat operations on them.

Previous:Writing Functions Next: Best Practices for R