Creating Functions

If we only had one data set to analyze, it might be faster to load the file into a spreadsheet and use that to plot some simple statistics. But if we have the same type of data in many files, or find that we are doing the same type of analyis over and over again, it may help to automate this.

In this lesson, we’ll learn how to write a function so that we can repeat several operations with a single command.

Objectives

  • Define a function that takes parameters.
  • Return a value from a function.
  • Defining a Function

Common statistical Problem:

Your data is non-normal, and you need to make it normal for statistical analysis. Often, people turn to log-transform.

But, your data has some zeros, which are undefined after a log tranform:

        data    logdata
 [1,] 0.0000       -Inf
 [2,] 0.0000       -Inf
 [3,] 0.0000       -Inf
 [4,] 0.0000       -Inf
 [5,] 0.0000       -Inf
 [6,] 0.0000       -Inf
 [7,] 0.0000       -Inf
 [8,] 0.0000       -Inf
 [9,] 0.0000       -Inf
[10,] 0.0000       -Inf
[11,] 0.0000       -Inf
[12,] 0.0000       -Inf
[13,] 0.0029 -5.8430445
[14,] 0.0419 -3.1724695
[15,] 0.1719 -1.7608424
[16,] 0.1868 -1.6777168
[17,] 0.4309 -0.8418792
[18,] 0.4459 -0.8076606
[19,] 0.4809 -0.7320959
[20,] 0.8135 -0.2064094

A common solution is to add one to the data. This keeps the zero values in the data, but the shape of the entire curve changes. The left graph shows the log-transformed data, which strips the zero values. The right graph shows what happens when we log(x+1). The zero-value is marked with a red asterisk.

The lowest three values are marked in blue, orange and green. Note on the left how spread out they are, and that the orange point is closer to the green point than the blue point. On the right, they are squished close together.

This is due to the fact that the smallest non-zero value is equal to 0.0029. Adding 1 to this value drastically changes it’s magnitude, but adding 1 to the maximum value 58.9842 barely changes it at all.

Maybe adding 1 is too much in this case. Perhaps something smaller? Try 0.01.

That is better, but now orange is now midway between blue and green, but blue should be farther. Add something even smaller- Many people like to add half the smallest non-zero value.

How to determine the correct small value to add?

McCune and Grace 2002 (Chapt 9) suggest this method to figured out a small value that 1. Preserves original orders of magnitude in teh data 1. Results in values of zero when initial value is zero.

Given

  • Min(x) is the smallest nonzero value in the data
  • Int(x) is a function that truncates x to an integer by dropping digits after the decimal point
  • c = order of magnitude constant = Int(log(Min(x))
  • d = decimal constant = log-1 (c)

then the transformation is \(b_{i} = log(x_{i} + d) - c\)

Try it out: First, get the data that we’ve been working with.

zerodat<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.002899, 0.0419, 
0.1719, 0.1868, 0.4309, 0.4459, 0.4809, 0.8135, 0.8976, 0.9305, 
1.0506, 1.7144, 2.2262, 2.2651, 2.303, 2.7367, 2.799, 2.8651, 
2.9446, 3.2168, 3.2858, 3.7499, 3.9239, 4.0544, 4.148, 4.3406, 
4.4707, 5.4644, 5.8323, 8.0209, 8.9348, 9.3409, 9.6532, 10.1077, 
12.8205, 21.3723, 27.0396, 58.9842)


c <- floor(log(min(zerodat[zerodat!=0])))
d <- exp(c)
transf <- log(zerodat+d) - c

Now let’s plot the data to see how it looks:

par(mfrow=c(1,2), mar=c(3,3,3,3))
plot(zerodat,log(zerodat), ylim=c(log(min(nonzerodat)), log(max(zerodat)+add)))
points(zerodat[13],log(zerodat[13]), col="blue", pch=16)
points(zerodat[14],log(zerodat[14]), col="orange", pch=16)
points(zerodat[15],log(zerodat[15]), col="green", pch=16)

plot(zerodat,transf)
points(0,transf[1], col="red", pch=8)
points(zerodat[13],transf[13], col="blue", pch=16)
points(zerodat[14],transf[14], col="orange", pch=16)
points(zerodat[15],transf[15], col="green", pch=16)

So now we have a series of three lines that will take a set of data with zeros and log transform it so that order of magnitues are preserved and zero maps to zero:

c <- floor(log(min(zerodat[zerodat!=0])))
d <- exp(c)
transf <- log(zerodat+d) - c

So this set of equations should be applicable to any set of numbers that you might want to log-transform.

In the file “DataWithZeros.xlsx” there are several sheets with data that should be log-transformed but have zeros.

Though you should avoid storing data like this, some researchers receive data this way in such large volumes that it is difficult to “save-as” to .csv files. The following code is provided to show how to work with data in Excel only in these rare circumstances…

It uses a package that doesn’t come in base R. It’s called XLConnect. To install it, you can either type in the concole:

install.packages("XLConnect")

or you can go to the “Packages” tab and click the “Install Button” and type in the name of the package. Then click Install.

Once a package has been installed on your computer, you must load it into the memory of the current R Environment with the library() command.

library(XLConnect)

DataW0 <- loadWorkbook("../../data/biology/DataWithZeros.xlsx")
zero1 <- readWorksheet(DataW0, sheet="Sheet1")
zero2 <- readWorksheet(DataW0, sheet="Second")
zero3 <- readWorksheet(DataW0, sheet="Sheet3")
zero4 <- readWorksheet(DataW0, sheet="Sheet4")

Now we have four data sets to which we can apply the log-transform algorithm. Our first thought might be to find-replace: look for instances where we used to have zerodat and replace with the new data zero1, zero1, etc…

c <- floor(log(min(zerodat[zerodat!=0])))
d <- exp(c)
transf <- log(zerodat+d) - c


c <- floor(log(min(zero1[zero1!=0])))
d <- exp(c)
transf1 <- log(zero1+d) - c

cbind(zero1, transf1)[1:10,]
        x        x
1  9.8073 5.288191
2  3.7413 4.332653
3  0.8434 2.887041
4  6.6711 4.905220
5  0.0000 0.000000
6  0.0000 0.000000
7  0.4784 2.361695
8  0.1167 1.207162
9  1.6320 3.519857
10 0.4518 2.310022

Ok, that seemed to work.

c <- floor(log(min(zero2[zerodat!=0])))
Error in `[.data.frame`(zero2, zerodat != 0): undefined columns selected
d <- exp(c)
transf2 <- log(zero2+d) - c

cbind(zero2, transf2)[1:10,]

Argh. We made a find-replace error. We forgot to replace the second instance of zerodat in the first line, and R threw and error.

This illustrates a problem with this approach. It is very easy to make a mistake in copying. Aslo, R only made an error because zerodat has a length of 50 and zero2 has a length of 35, so the indexing was off. Had the length of the numbers been the same, or had there not been any non-zero elements in positions 36:50 of zerodat, this particular error would not have been thrown, but we still would have gotten bogus data.

A better option would be to write the lines of code in a generic way, referencing a generic variable, say myzerodata and let that variable equal each data object in turn:

myzerodata <- zero1
c <- floor(log(min(myzerodata[myzerodata!=0])))
d <- exp(c)
mytransfdata <- log(myzerodata+d) - c

cbind(myzerodata, mytransfdata)[1:10,]
        x        x
1  9.8073 5.288191
2  3.7413 4.332653
3  0.8434 2.887041
4  6.6711 4.905220
5  0.0000 0.000000
6  0.0000 0.000000
7  0.4784 2.361695
8  0.1167 1.207162
9  1.6320 3.519857
10 0.4518 2.310022
myzerodata <- zero2
c <- floor(log(min(myzerodata[myzerodata!=0])))
d <- exp(c)
mytransfdata <- log(myzerodata+d) - c

cbind(myzerodata, mytransfdata)[1:10,]
           x          x
1     0.0000  0.0000000
2    83.7133  6.4290132
3   725.9106  8.5876133
4   300.9450  7.7073771
5    47.3171  5.8597278
6     9.9216  4.3082625
7     0.2317  0.9977027
8     4.0841  3.4397013
9  5956.7369 10.6923008
10    4.3557  3.5020833

That works much better, because there is less of a chance of a find-replace error.

But, this still takes up three lines of code, and for a data transformation in the middle of an analysis, I think that’s too much.

The beauty of using R for you data analysis is that you can also use it to program, and this allows you to have much more efficient code.

Writing a function

To prepare for the next part of the lesson, clear your workspace, either by typing rm(list=ls()) or by pressing the “Clear” button with the broom picture on the Environment Tab.

We can use these three as the body of a function that we name logw0:

logw0 <- function (myzerodata){
  #This function log-transforms a set of numbers that may include zeros in a way that preserves the original order of magnitude and maps zero to zero.  It follows McCune and Grace 2002 (Page 69). 
  # myzerodata is a numeric vector
  
  c <- floor(log(min(myzerodata[myzerodata!=0])))
  d <- exp(c)
  mytransfdata <- log(myzerodata+d) - c
  return(mytransfdata)
}

A function in R is named object, takes arguments as specified in the parentheses, and executes the set of commands between the curly brackets { }. Note the body of the function is indented to aid in readability. The closing bracket lines up vertically with the function name. Always provide enough documentation at the beginning of the function.

Now we can see if our function works on our data. Let’s read it in again (since we lost it when we cleared the workspace):

library(XLConnect)

DataW0 <- loadWorkbook("../../data/biology/DataWithZeros.xlsx")
zero1 <- readWorksheet(DataW0, sheet="Sheet1")$x
zero2 <- readWorksheet(DataW0, sheet="Second")$x
zero3 <- readWorksheet(DataW0, sheet="Sheet3")$x
zero4 <- readWorksheet(DataW0, sheet="Sheet4")$x

Now we can call the function in a single line with any of our data sets:

logw0(zero1)
 [1] 5.2881907 4.3326528 2.8870408 4.9052202 0.0000000 0.0000000 2.3616952
 [8] 1.2071624 3.5198570 2.3100219 3.3734838 2.3381326 2.9944718 5.6087665
[15] 3.1341699 4.3742944 3.1207892 0.0000000 4.5543452 3.0052731 2.4726843
[22] 4.1766182 3.4119686 4.6850809 4.0822753 0.0000000 2.2494082 5.4389925
[29] 1.7133231 5.5091916 2.1508703 3.5942033 4.4922043 0.0000000 3.2387705
[36] 5.0192253 4.3953252 4.0202496 3.6847556 0.9824966 5.8104435 2.2182119
[43] 2.0705913 4.5076902 0.0000000 3.5226477 0.0000000 2.3421950 3.8281094
[50] 5.3890551
logw0(zero2)
 [1]  0.0000000  6.4290132  8.5876133  7.7073771  5.8597278  4.3082625
 [7]  0.9977027  3.4397013 10.6923008  3.5020833  6.9229803  9.7580240
[13]  0.0000000  0.0000000  6.0551763  3.0068022  4.0986601  0.0000000
[19]  5.9820645  2.9527134  4.9241361  0.0000000  5.6312989  5.7595554
[25]  6.3002950  6.0506493  0.0000000  6.6811088  7.6578483  0.0000000
[31]  5.6206444  5.8553709  5.4136178  0.0000000  0.0000000

We can also use it in the middle of other commands, just like we used the log() funtion:

plot(zero3, logw0(zero3))

This is a much cleaner way to work with this way of log-transforming, particularly if we have to do it with a lot of data in our workspace.

Note Look in the environment tab. There are no intermediate variables c and d. R executes the lines inside of the logw0 function and creates these variables temporarily and stores it in somthing called a stack frame taht only exists for the duration of time that R needs to execute the function. Then it throws those variables away. This is a great way to keep your environment clean and free from all the temporary variables.

logw0

If you are writing and using a lot of custom functions, it can be cumbersome to fill up the first half of your analysis script by defining all these functions. An easy away around this is to save it in a separate R-script (File –> New File –> R Script), say we name it “LogFunction.R”.

Then at the beginning of a script, we can load this function in a single line with the code

source("LogFunction.R")

The source function will execute all commands in that file quietly, so you don’t have to see it working.

Previous:Analyzing and Plotting Next: Analyzing multiple datasets