Welcome to Data Carpentry!
2015-01-14 Cornell

We'll use this etherpad during the workshop to take notes, have an informal way of asking questions, and keep track of links. Please fill in your name in the very top right (and feel free to pick a new color for yourself). You can write in this area to take notes, or feel free to ask questions, post comments, etc in the chat area. 

Useful Links:
Homepage: http://emudrak.github.io/2015-01-15-cornell/
Data: http://emudrak.github.io/2015-y01-15-cornell/data/biology/
Cheat sheets: http://emudrak.github.io/2015-01-15-cornell/lessons/ref/index.html
 

Good morning carpenters! :)
I'm going to take notes as Jeremia is talking so hopefully no one ever gets too far behind. Feel free to ask questions on the fly in the little chat corner to the right. 

Go to the data link above and download anything that ends with ".csv"
Put up a pink sticky if you're having any problems finding or downloading the csv files


Using Excel
We all use excel, how can we use excel in the best way possible?
The issue with excel is it's hard to do reproducible science using excel.

Open excel

Our goal for this first lesson on excel is to show you how to best store your data in excel so that it is easy to use that data later, either by passing it to analysis scripts or another person (including future you).

Some rules:
- Each cell should be one unit of information
- Each column should represent one variable
- Each row should be one observation
- Always better to store independent data in independent files

open species.csv and plots.csv in excel

Question: Why should you enter zeros rather than just leaving cells blank? Zeros are observations, blank means missing. If you leave zeros blank, you won't know whether those are missing values or zeros. Those are two different things!

How should we indicate nulls?
- -999
- .
- NA
- NULL
- leave things blank (problem: missing vs. not yet)

Some Dos and Don'ts
Don't enter unit, and it makes the data type to be "general"
Don't merge cells
Don't combine multiple readings in one column, for example [2 male, 3 female] (have a separate column for total males and total females)
Do keep headers short and without spaces, starting with numbers, or special characters
Be careful pasting from other programs into excel - especially word! Paste special and make sure formatting isn't carried over

Dates in excel
- There are many ways to format dates in excel
- If you change a date to a number, you'll see an odd number, and it's not clear how that relates to what you've entered.
- Excel counts up from Dec 31st 1899 by one for each day. This CAN change depending on your version of excel. 
- If you're not careful, excel can sometimes export the number rather than the date you were expecting. 
- Some testimonials of how excel dates have been jerks to us: 
       1. If you're working with historical data, you might run into errors (anything before start date gets "1") 
       2. If excel thinks some text is a date, it'll automatically convert this to a date. A common example is with gene names if you're working with genomic data. It'll recognize some gene names as dates and auto convert those to dates and then when you export it you're exporting something totally different than the gene name. 
       3. My personal testimonial: I had data where I had people's birthday's and date of collection for some sample in two different columns. I needed to share my spreadsheet with a collaborator and did so using google spreadsheets. Google has a different way to deal with dates and it totally messed up all of those dates I had entered. When they calculated age at time of collection, some of the people were too young to be in the study and they thought I was doing something sketchy! I wasn't, it was just date issues! Don't let your collaborators think you're an idiot like I did!
- Use ISO format: 2015-01-15 (year, month, date), fix at text.

Challenge! In the surveys-exercise-extract_month.csv dataset, extract year and month and put those in their own individual columns. Once you're done, put up a blue sticky. If you're stuck, put up a pink sticky. How did you guys solve this? 
Method 1: Select all the dates, text to columns under the data tab, make sure Delimited is checked, hit next, don't choose any delimiter, click next, under column data format, highlight Date and change to DMY. This converts them to date format, and our formula in the other cell works (=YEAR())

Download the aphid_data_Bahlai_2014.xlsx file from the data link
- Open up a blank page in a text editor -> we're going to keep track of what we're doing
- Look across the second row. Some cells are filled in, others are derived.
- Let's save a copy as aphid_data_Bahlai_2014_visualize.xlsx so we can muck with it but still keep the original version of the data separate
- Let's copy all of the data and paste special (select values) so that the derived numbers are static
- If we want to do a sort on the data, we can go to Data -> Sort. It should recoginize that we have headers on our columns, and we can pick what to sort by. 
- Let's start by sorting males from smallest to largest. This is an easy way to look at outliers/extreme values
- Let's make a pivot table and compare some variables. You can go to the Analyze tab on Windows or the Data tab on Mac (at least my version) and click pivot table. 

Soooo, pivot tables were fun...let's export our data so that we can use the data in other programs (for example, summarizing using R might be easier than using a pivot table). 

You can go to 'Save As' and select a text format, either .csv (which will put commas in between the fields) or tab-delimited (puts tabs in between the fields). Click ok when it pops up a warning

What happens if we're saving as comma delimited, but there are commas in some of our cells? When you save from excel, it will put those cells in quotes, so that it won't split on that comma when read later. 

Always use a csv, txt parser when you're opening up the file in another program. People have thought a lot about the issues that can crop up and have implemented solutions in those parsers. 




Open Refine
Go to: http://openrefine.org/

If mac people are having any issues with where it's saying the program is broken, follow the instructions here to change security settings: https://github.com/OpenRefine/OpenRefine/issues/590

http://127.0.0.1:3333/ 

Text facet is very helpful to merge similar data-- using text facet--cluster--methods--nearest neightbor--PPM

Numeric facet

Text transform
value.replace ("~", " ")

Multiple facets used at the same time.

Under endowment, we have numbers and we also have people that wrote "10 billion", etc. How can we transform these so that we can actually work with it?
1. Get rid of "US $" stuff -> under the 'text transform' for the endowment column:
value.replace("US $", "").replace("US$, "")
2. Let's do another text transform:
toNumber(value.replace(" million", ""))*1000000
---breaking that down:
-------toNumber: will transform this cell to be numeric rather than text
-------value.replace: will take any instance of " million" and replace it with nothing
-------*1000000: multiply the remaining value in the cell by one million. This gives us the actual value in dollars now in a numeric value. 

We can also plot. Under endowment, you can click....shoot. I got behind...How do we pull up a plot?
Facet--scatterplot facet 
Thanks :)

Under the Undo/Redo tab, you'll see a history of the steps you've taken to transform your code. You can extract this history and you can reapply the same steps to other files. 

What you're doing isn't changing your original data, you must click the export button and that will save your edited table. This table can be opened in excel, R, whatever program you want. 

LUNCH!!! Be back by 1:30!




SQL
Welcome back from lunch. This afternoon we'll be learning a databasing language: SQL.
Make sure you still have the species.csv, plots.csv, and surveys.csv available. 

Open survey.csv in Excel to look at it. Then close it again...

Install SQLite manager in firefox. Go to course webpage: http://emudrak.github.io/2015-01-15-cornell/
Open the menu on firefox.  It should be in upper right corner as 3 bars.  Click on add-ons, search for SQLite.  Pick SQLite Manager 0.8.1.  Restart Firefox. 

To Open SQLite Manger, look under the tools menubar.  If you can't find the menu listed with "File | Edit | View | History | Bookmarks | Tools | Help|", Right-click on the blue area above the tabs, and show the menu bar by making sure that "Menu bar" is clicked. Then you should be able to look under the "Tools" menu bar. 

*Make a new database*
DataBase --> New Database.  Call it Mammal_survey.  Save it in the same folder ("DC_Workshop"). 

*Import a File* 
Database --> Import.  Select CSV, Use Select file to browse to your file. This time, choose plots.csv and click open.  Make sure it will process the file correctly: Choose "First row contains column names" and Fields separated by column. 

A new table called plots will be created. Say OK here. 
The Create Table window will appear.  Here is where we can specify the data type (Integer, double, float...) 

If you can't get things imported from csv, download the portal_mammals.sqlite file from the Data page (http://emudrak.github.io/2015-01-15-cornell/data/biology/). Then, in SQLite, go to Database -> Connect Database and click on the portal_mammals.sqlite. 

Under Tables, You will see three lines: plots, species, surveys. Use the triangle to unfold this menu to look at the table headings in each file. 

The species column in the surveys 

Tab:  Execute SQL.  In the box under "Enter SQL" there will be an example query

SELECT year FROM surveys; 

Click Run SQL.  The window underneath shoudl show a bunch of years.

"SELECT" and "FROM" are command words, and are conventionally capitalized. "year" and "surveys" are columns from our tables, so we leave them lowercase.  This is convention and for readability. 


SELECT year, month, day FROM surveys;

Wildcards:  will match anything.  a "*" surves as our wildcard. 

SELECT * FROM surveys

this will give us the entire table back. 

to filter, use the keyword "DISTINCT"

SELECT DISTINCT species FROM surveys;

This will give only unique values. 

SELECT DISTINCT year, species FROM surveys;

warning: this returns distinct pairs years and species. So you'll have duplicate years, but it's because they are with different species. 

We can do a calculation on anything that is a number. 

SELECT year, month, day, wgt/1000.0 FROM surveys;

SQLManager highlights derived properties- a brighter green. The derived column is named according to the cor

SELECT year, month, day, ROUND(wgt/1000.0 ,2) FROM surveys;

The derived column is named according to the formula.  This is annoying to deal with later, so rename it in the same step: 

SELECT year, month, day, ROUND(wgt/1000.0) AS wgtkg FROM surveys;

Challenge: Write a query that returns the year, month, day, speciesID and weight in mg
SELECT year, month, day, species, wgt*1000 AS wgtmg from surveys;

Filtering  with keyword WHERE

SELECT * FROM surveys WHERE species="DM";

SELECT * FROM surveys WHERE year >= 2000;

combine selection criteria with AND and OR

SELECT * FROM surveys WHERE (year >= 2000) AND (species = "DM");

SELECT * FROM surveys WHERE (year >= 2000) AND (species = "DM") OR (species="DO") OR (species="DS");

EXERCISE: Write a query that returns The day, month, year, species ID, and weight (in kg) for individuals caught on Plot 1 that weigh more than 75 g

SELECT day, month, year, species, wgt FROM surveys WHERE (plot= 1) AND (wgt> 75);

Click on surveys and look at the structure tab.  You'll see the data type for plot is a number. So you don't need quotes around numerical values.  

Exporting Tables
If we like this table, we can use the Actions button to save the results of this query as a .csv file.  This is useful if your raw data is huge and you only want to use a smaller subset of it in an analysis program.

The IN operator
The chain 
(species = "DM") OR (species="DO") OR (species="DS")
works fine, but we cna do this quicker with an IN operator

SELECT * FROM surveys WHERE (year>=2000) and (species IN ("DM","DO","DS"))

Sorting with the ORDER keyword
SELECT * FROM species ORDER BY taxa ASC;
"ASC" means ascending - the default. If you dont put anything, it assumes this. 
"DESC" means descending

Sort by multiple columns, with listed priority, from left to right in the written query.

SELECT * FROM species ORDER BY taxa ASC, genus ASC, species ASC;


Group records 

SELECT COUNT(*) FROM surveys;

gives row count of the entire table

SELECT COUNT(*), SUM(wgt) FROM surveys;

gives the row count and the total of the wgt column

SELECT ROUND(SUM(wgt)/1000.0, 3) FROM surveys;

SELECT species, COUNT(*) FROM surveys GROUP BY species;

Exercise: return the individuals counted/year and average weight/year

SELECT year, species, COUNT(species), ROUND(avg(wgt),2) FROM surveys GROUP BY year, species;

Sometimes it helps to separate a command like this by lines to make it more readable

SELECT year, species, COUNT(species), ROUND(avg(wgt),2) 
FROM surveys 
GROUP BY year, species;


Joining Tables 
We know some of the data in species table relates to the data in the surveys tables

SELECT * 
FROM surveys
JOIN species ON surveys.species=species.species_id;

The nomenclature:  tableName.columnName
This is helpful because sometimes tables have the same column names. 

If we want to select certain columns for this query, we must use this nomenclature: 

SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species ON surveys.species=species.species_id;
(the ids are used to make sure they are corresponded between different databases)

SELECT plots.plot_type, AVG(surveys.wgt)
FROM surveys
JOIN plots 
ON surveys.plot=plots.plot_id
GROUP BY plots.plot_type;


RStudio
Note about RStudio: your windows might not be in the exact same orientation as Erika's, but you should see four windows
The console is where code is actually entered and run
The script area is a place where we can record what we want to do. We can run these directly by hitting run (it'll send that line or whatever is highlighted to the console)

shortcut "<-"------alt+"-"

To assign to a variable: weight_kg <- 55
--- weight_kg is the name of our variable
--- 55 is the value we assigned to that object
You can think of a variable just like a variable in math: weight_kg represents whatever you've stored in it. eg: weight_kg*2 = 110

You can store variables in variables:
weight_lb <- 2.2*weight_kg

If you now change weight_kg, it won't change weight_lb

weights <- c(50, 60, 65, 82)
--- The c() stands for concatenate. This allows you to enter multiple values assigned to an object

We can convert all four of those weights to lbs at once:
weights_lb <- 2.2*weights

You can also store non-numbers:
animals <- c("mouse", "rat", "dog")

How many values are in my variable?
length(animals)
--- length will tell you how many values are assigned to that variable

weight <- c(weights, 90)
--- you can also add on to a vector using the c() command. This command takes the variable weights (which is (50, 60, 65, 82), and then adds 90 to the end, resulting in (50, 60, 65, 82, 90)

How do you know whether your variable is a number or character vector?
class(weights)
"numeric"
class(animals)
"character"

R has many different types of variables: 
1. numeric: these are numbers
2. intergers: whole numbers
3. logical: TRUE/FALSE
4. character: letters, letters and numbers



_______________________________________________________________________
Friday - 1/16/15
Good morning all!

R

Open up RStudio
File -> New Project -> New Directory -> Empty Project -> Directory name = "DC_Rproject", put that into Documents -> Click create project

In the console, you can now see where your project is located on your computer (up in the console, it should say something like "~/DC_Erika/DC_Rproject"

Under Files, ask for a new Folder. Call it "data". If you look in your finder, you'll see that new folder in your computer. 

Let's get some data: http://emudrak.github.io/2015-01-15-cornell/data/biology/ . Download surveys.csv, inflammation.zip, and DataWithZeros.xlsx. Put those in your newly made data folder. Put your blue stickies up when you've done that, put pink up if you need some help.

File -> R Script

Let's load survey data into R:
surveys <- read.csv("data/surveys.csv")

You should see in your "Environment" pane something called "surveys"

getwd() to get current directory

surveys

You should see a lot of stuff on the screen. It's a big table, so we can't see all of it in the console. In the "Environment" tab, there's a little grid next to "surveys". Click that and it'll bring up a spreadsheet like thing of the data

head(surveys)
This will list the first several lines of a table

R interprets this table as a "data.frame". In R, that means each column can have a different type of data (numeric, character strings, logical True/Falses, etc). 

str(surveys)
This will show you the name of every column and what type of data it is (int = integer). Do this when you load up a table to make sure R is understanding your data correctly. 

If you're loading data in general, you can use:
read.table("yourfile.txt", sep=",") 
The "sep" thing here is where you signify whether it's comma delimited (","), tab ("\t"), or anything else, just put that in quotes. 

Let's say we want to just look at one column of our data, year:
surveys$year
This will list every value in that column

dim(surveys) -> shows dimensions of a data.frame
nrow(surveys) -> how many rows are in my data.frame?
ncol(surveys) -> how many columns are in my data.frame?
head(surveys) -> shows first few lines of data.frame
tail(surveys) -> shows last few lines of data.frame

names(surveys) -> shows you the column names for surveys
summary(surveys) -> shows you general summary information for each column (min, max, mean, etc)
class(surveys) -> shows you what kind of object your variable is (in this case data.frame)

Challenge!
-What is the class of the object surveys? 
A: class(surveys) 
"data.frame"
-How many rows and how many columns are in this object?
A: nrow(surveys) ncol(surveys), dim(surveys), look in the environment pane, str(surveys)
-How many species have been recorded during these surveys?
A: str(surveys) (shows there are 49 levels for that variable), surveys$species (shows there are 49 levels)
-What is the range of years over which this data has been recorded?
A: summary(surveys$year), range(surveys$year)
-What is the mean weight of animals recorded?
A: summary(surveys$wgt)
Blue stickies up when you're done, red if you need a hand!
s

animals <- c("mouse", "rat", "dog", "cat")
animals
(This will list all the animals)
What if we want to only look at say the second animal? We can "index" the second value using brackets:
animals[2]
or the fourth:
animals[4]
if you chose a number that's out of range you'll get an NA
animals[6]
If you want to get multiple values you can use the c() function we used yesterday. Say, the 2nd and 4th value:
animals[c(2,4)]
If you want a range, you can use ":" to specify values between two numbers. Say  you want 2,3,4:
animals[2:4]
What if you want everything but the third value? The "-" sign will show everything but the index you're inputting. Say you want index 1,2,and4 (not 3):
animals[-3]
If you have a data.frame, there are two dimensions: rows and columns. To reference 2 dimensions using indexing, it's always the rows first, a comma, and then the columns. If you only want to look at the first row:
surveys[1,]
If you only want to look at the first column:
surveys[,1]
If you want to look at only the first entry in the second column:
surveys[1,2]
You can use the ":" and "c()" in this as well. First three rows, columns 4-7:
surveys[1:3, 4:7]

Factors in R: when R inputs anything (using read.csv, read.table, etc) that has letters, it'll convert it to a factor. You can think of these like categories. 
What different categories do we have (we call these levels):
levels(surveys$species)

How many levels does my factor have?
nlevels(surveys$species)

Sometimes we call our levels by a number and we don't want R to interpret it as a number. For example, the plots in our data. How can we force R to interpret something as a factor rather than a number?
as.factor(surveys$plot)
This will just print out surveys$plot as converted to factor, but it's not actually going to save it. To save it: 
surveys$plot <- as.factor(surveys$plot)
This is overwriting this column as now recognizing it as a factor. This is only changing the data in R itself. Your actual data file won't be changed. 
A bit of a kink with factors: even if you call your levels a word, R "thinks" about factor levels as numbers. We won't talk about this further, but just keep this in the back of your head for when you're working in R. Sometimes this can crop up when you're doing data analysis in ways you weren't expecting. 

Comment character "#"
Anything written after a "#" on a given line will not be interpreted by R. This is a way to document your code. Use them often! Don't trust that future you will remember current-you's logic!

Let's add a  comment to our script about what we're doing:
# Subsetting data frame

subset(surveys, species=="DO")
What is this doing? We are taking a subset of the dataframe surveys, where we're only looking at entries where species is DO. The "==" is R speak for "is equal to". A single = will not do this (it is actually the same as the assignment operator, <-) 

What we wrote above will print it out to the console. We want to save that, let's assign it to a variable:
surveys_DO <- subset(surveys, species=="DO")

Can we do a subset of "this OR that"? 
surveys_DO <- subset(surveys, (species=="DO") | (species=="AB"))

What happens when you see a "+" at the beginning of the line? That means you haven't finished your statement. Check your parenthases, quotes, etc. Make sure you've closed everything!

In R, to say "OR", use the "|" symbol (above your enter key, hit shift)
If you want to say "AND", use "&"

What if you want to look at everything BUT some value? You can use "!=". This means "not equal to". So if we want to look at everything but DO:
surveys_notDO <- subset(surveys, species != "DO")

What if we want to look at only some of the columns while subsetting on DO?
DO_weights <- subset(surveys, species == "DO", select=c("sex", "wgt"))
This is saying, we're subsetting from surveys the columns sex and weight for only the species that are labeled DO

How to get function help in R: question mark and then name of function
?subset

This will pop up a help window in a frame in Rstudio. Some helpful things to look at: the arguments descriptions and details. At the bottom there are usually examples of how to use the function, but honestly I find those usually aren't very helpful! 

Little tip about arguments for function in R: if you specify the argument name, you can input things in any order into the funciton when you call it. 

COFFEEE! Be back by 10:40 please. 

Clean environment occasionally to make sure you will not use a mistaken object.

When we're working with R, what we want to do is build scripts. These scripts contain a list of all commands we need to run to get a result. You should be able to give your data and a script to your colleague and they should be able to run the exact same analysis that you did. 

Challenge: 

2. What is the range (minimum and maximum) weight?
A. min(surveys_complete$wgt) max(surveys_complete$wgt)
3. What is the median weight for the males?
A. temp <- subset(surveys_complete, sex="M", select=wgt)
median(temp$wgt)
answer 2: median(subset(surveys_complete, sex=="M")$wgt)
answer 3: median(surveys_complete[surveys_complete$sex=="M", "wgt"])
Blue sticky up when you're done, pink if you need help.


# tapply(columns to calculate on, factors to group on, function) 
Apply statements are a fast way to do the same thing over and over to each row or column in a dataframe, item in a list, etc. tapply specifically will perform any function you specify on the columns you specify, grouping by the factor you specify. 

specieswgts <- tapply(surveys_complete$wgt, surveys_complete$species, mean)

#STOP###

Group the commands that we actually need to build the objects that we are working with : 

#This is all I really need. 

surveys <- read.csv("data/surveys.csv")
surveys$plot <- as.factor(surveys$plot)
surveys$logwgt <- log(surveys$wgt)
surveys_complete <- surveys[complete.cases(surveys), ]
surveys_complete <- droplevels(surveys_complete)
specieswgts <- tapply(surveys_complete$wgt, surveys_complete$species, mean)

barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1)


jpeg(filename="species_mean_wgts.jpg", width=400, height=500)
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1, main="Mean weight per species", xlab="weight in kg", ylab="species")
dev.off()



Challenge:
Calculate the maximum animal weight in each plot

A. tapply(surveys_complete$wgt, surveys_complete$plot, max)

Let's plot!

barplot(specieswgts)
This will show up in your plot section. Barplot will plot each level on the x-axis, and the value on the y-axis. 

On our plot, our labels are too big and they aren't all showing up. 
If we look in 
?barplot

cex.names will alter the size of the labels essentially
barplot(specieswgts, cex.names=0.8)

Still not small enough

barplot(specieswgts, cex.names=0.5)

Something to note about the helpfile: you'll see that the arguments are assigned defaults. If you don't specify an argument, it will specify a default for any argument that has defaults specified. If you look at barplot, height does NOT have anything for a default value. If you don't enter anything for height, it won't know what to do and throw an error. 

What if we want the bars coming off of the y-axis:
barplot(specieswgts, cex.names=0.5, horiz=TRUE)

Now the labels aren't right side up though. We can rotate those using "las"
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1)

When you're plotting, always put a title and axes labels on your graph! 
You can specify a title using the "main" parameter, and the x axis with "xlab", and the y axis with "ylab"

You can save these images from R script using export -> Save Plot as Image. 

However, you can also save things using a command. You can save things to a jpeg:

jpeg(filename="species_mean_wgts.jpg", width=400, height=500)
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1, main="Mean weight per species", xlab="weight in kg", ylab="species")
dev.off()

What happens here is that the jpeg commands opens a connection to a new file called species_mean_wgts on your computer. Any command you type after that, if it produces an image/plot/bargraph/whatever will output that to the file rather than the little window in the bottom of the RStudio. You need to close the connection to this file when you're done writing things out using dev.off(). This stands for device off and it just means it's closing this connection. Note: if you do something like this:
jpeg("file.jpg")
bargraph(specieswgts)
bargraph(specieswgts*2)
dev.off()

It will only display the second graph. You essentially write over your first graph. Either you need to save them as separate files or you need to specify you're going to print more than one graph to your file (we won't cover that here). 





Afternoon

We're going to continue working in R, but close out and reopen to refresh the program. 
Unzip inflammation.zip files:
Windows: right click, extract all, when it shows the file name, delete where it says inflammation
On Mac, double click it, but you'll have to move the individual files back up to the data folder, rather than where you're sitting in the inflammation folder. Delete the empty inflammation folder. 

Each row is a patient, each column is a day

Each file has the same format, so we can set up an analysis to streamline analyzing all of these files, rather than analyzing each one by hand individually.

Challenge: read inflammation-01.csv into dat

dat <- read.csv(file = "data/inflammation-01.csv", header=FALSE)

class(dat)
nrow(dat)
ncol(dat)
dat[30,20]

apply() # This will apply a function multiple times to a dataframe

apply(dat, MARGIN=1, mean) # The mean inflammation for each patient over the experiment
This is saying for each row in dat, find the mean. If you want to do that with the columns, you can do:
apply(dat, MARGIN=2, mean) # The mean inflammation over all patients for each day of the experiment 

avg_day_inflam <- apply(dat, MARGIN=2, mean)

plot(avg_day_inflam) # gives us a little scatter plot where the day is along the X, the average is along the Y axis

This plotted little circles, but maybe we want a line:
plot(avg_day_inflam, type="l")

max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)

plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")

Let's rearrange our script so that it's more organized: If you're behind copy these into your script:
dat <- read.csv(file = "data/inflammation-01.csv", header=FALSE)

avg_day_inflam <- apply(dat, MARGIN=2, mean)
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)

plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")

Let's re-run all of this stuff.
What if you want to see all of these plots at once?
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")

Let's reset that par, since we might not always want three graphs right next to each other:
par(mfrow=c(1,1))

How can we save a jpeg imagine of these plots?
jpeg("daily-inflammation.jpg", height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
par(mfrow=c(1,1))
dev.off()

Run this. Once you have this 3 panel plot saved into your folder put your stickies up!

Great, so we analyzed inflammation-01.csv, but now let's do all of this for inflammation-02.csv

filename <- "data/inflammation-02.csv"
picname <- "daily-inflammation-02.jpg"

analyze <- function(filename, picname) {
    dat <- read.csv(filename, header=FALSE)
    
    avg_day_inflam <- apply(dat, MARGIN=2, mean)
    max_day_inflam <- apply(dat, MARGIN=2, max)
    min_day_inflam <- apply(dat, MARGIN=2, min)
    
    jpeg(picname, height=300, width=600)
    par(mfrow=c(1,3))
    plot(avg_day_inflam, type="l")
    plot(min_day_inflam, type="l")
    plot(max_day_inflam, type="l")
    par(mfrow=c(1,1))
    dev.off()
}

analyze(filename="data/inflammation-03.csv", picname="daily-inflammation-03.jpg")


For Loops:
Generally, for loops do the following:
FOR everything in my collection, do something I want you to do.

for (i in 1:10) {
    sq <- i^2
    print(paste("The square of", i, "is", sq))
}

aside about paste(): paste automatically puts a space wherever you've put a comma in your string of things you want to paste together. You have designate something other than a space by using sep="" and put whatever in between "". say a dash: 
paste("The squre of",i,"is",sq, sep="-") 
will look something like:
The square of-1-is-1

list.files(recursive=TRUE) # This will give a list of all the files that are in all the folders in the current working directory

Maybe you don't want all the files, maybe you only want to look at files that are comma separated: 

filelist <- list.files(pattern="inflammation-.csv", recursive=TRUE) 

for (f in 1:length(filelist)) {
        analyze(filename=filelist[f], picname=paste("daily-inflam",f,".jpg", sep=""))
}


So, the entire bunch of code that we'll need is below: 


analyze <- function(filename, picname){

dat <- read.csv(filename, header=FALSE)

avg_day_inflam <- apply(dat, MARGIN=2, mean )
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)

jpeg(picname, height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
par(mfrow=c(1,1))
dev.off()
}

filelist <- list.files(pattern="inflammation-", recursive=TRUE)

for (f in 1:length(filelist)){
  analyze(filename=filelist[f], picname=paste("daily-inflam", f, ".jpg", sep=""))
}








############ Command Line Lesson #######################

PATH=$PATH:"C:\Program Files\R\R-3.1.1\bin\x64"



PATH=$PATH:"C:\Program Files\R\R-3.1.1\bin\x64"

To check this worked, type "Rscript" and you should get the bunch of text on the screen

Tools up until now: point & click
Many of the commands we have used can be tied together with the command line


Work on "bash" for this.  Windows needs to download it through Git, Mac has it already installed. 
This is what you use to access computer clusters on campus (if you do that...)

We can move around in our folders in bash just like do in our windows browser

DO THIS: Go to downloads page
(http://emudrak.github.io/2015-01-15-cornell/data/biology/),
 get "filesystem.zip", put it on desktop and unzip it. 

Put up sticky once you have this downloaded and unpacked

whoami

pwd - "present working directory" - where you are "sitting' right now

cd - "change directory"

ls - "listing" lists all the files in the directory right now

ls -F 
shows which is a file and which is a Folder.  Folders have "/" after them.

move into the filesytem folder: 
cd filesystem

Everything you see in the command line should mirror everything you see in the windows explorer when you click through folders (like you probably are used to doing).

Let's look at the users folder: 

cd users

Now the path has changed to say "~/Desktop/filesystem/users"

ls
You should see  "backup" "nelle" "thing"

Moveinto the "nelle" folder

cd nelle

Bash shortcuts: 

"Tab completion"
If you want to look in the folder "north-pacific-gyre", but you are likely to make typos. With tab completion, you are able to start writing the thing and press "tab" and it will complete it with the rest of the line if you can.


cd 
will take you all the way to the home directory

full path vs. relative path
full path is being very explicit from home directory 
"c:/users/elm26/Documents/MyAnalysis" 
THe home directory is sometimes abbreviated by "~"
a relative path is instructions from where you are sitting right now. 
relative paths make it easier to move a bunch of files in a complicated folder structure between users.  You could put the outermost folder on a thumbdrive nad give it to your friend, and all the relative paths would still work. 

"Up directory"
cd ..
Will back up one directory from where you are.  "Back up" or move outwards, etc...

Discussion of File organization
have different folders for different distinct data types, stages of analysis, etc...

Just like you can make new folders in the graphical windows 

put yourself in the "nelle" directory

make a new folder within this folder, call it thesis.  This command is "mkdir"

mkdir thesis

now look in your graphical windows broswer. You should see a new folder named thesis. 

We are going to write some text files for this directory using nano,, which is a simple text editor

type
nano draft.txt

you'll get into another window that is the text editing window with the heading "draft.txt" at the top. 

type something in here.  then press control X to get out of the nano program.  It will ask you if you wanted to save. 

now youre back int the shell,
type
ls
and you will see the new draft.txt file that you just made

remove a file with 
rm draft.txt
**** WARNING **** THis will delete it FOREVER - there is no recycle bin here!


cat will print out the start of the file 


make an original draft  (draft.txt)

Copy it and name it sometihng else 

cp draft_v1.txt draft_v2.txt

open this new file and change something about it
nano draft_v2.txt

type ls to see both the original file and the new file


back up into the nelle folder
cd ..

move into the molecules folder

cd molecules

ls

you shoudl see a bunch of files

wc cubane.pdb

wc - "word count" will show the word count- how many lines and words are in teh file. 

use the wildcard character "*" to represent any character

so then 
wc *.pdb

will print out the word count for anything that matches ends with .pdb.  

wc p*
will print out the word count for anyting that starts iwth "p"

Right now, everything that we are doing is just printing on the screen and not being saved anywhere.  You can redirect this output to another file with the ">" command

wc *.pdb > word_counts.txt

Now instead of printing on the screen, it will put the output into the file word_counts.txt

### Pipes ####
The power of using command line is stringing a bunch of commands together.  This is done with the "pipe" or "bar" character, which is above the enter key, to the right of the "]" hey, and you need shift to get to it. 

head cubane.pdb
shows the top part, just like in R

pipe "head" together with "wordcount": 

head cubane.pdb | wc
this will string the commands together, so it will take the head of cubane.pdb and then do the word count of that.  

For loops in bash


 for i in *.pdb
 do
 echo $i   
 done

Now we are entering more than one line at once, and the prompt changes form $ to >
the start of the loop is "do" and the end of the loop is "done", similar to the { and } brackets in R

Let's do sometihng more complcetaed

$ for in in *.pdb
do
echo $i
wc $i
done


this is useful if you have 20 data files and you want to do the same operation to all of them

ls .. get back to the upper level and show the listed things


Now go to homepage to the Schedule.  Under Worksflwos and Automating Tasks in the shell is a link to the notes that we used 

Get to 6. Command Line programs This has some notes that you will need to use. 

cd into your data folder from this morning 

cd ~/Documents/DC_Erika/DC_Rproject/data   
... or something similar....

make sure you can see all the inflammation files when you type 
ls

back slash "\" can be used to escape the space in a file name

Now follow along on this webpage: http://emudrak.github.io/2015-01-15-cornell/lessons/shell/07-cmdline.html

head -4 inflammation-10.csv | Rscript readings.R --min

########### How to write an R script in Bash ############
be in the data folder

nano session-info.R

Now we have a nano window where we are writing into our file called "session-info.R"
Into this, type 

sessionInfo()

Save using Ctrl-O, and then exit nano with Ctrl-x

ls
you shoudl now see a "session-info.R"

cat session-info.R

Rscript session-info.R


Create a new script called "print-args.R"

nano print-args.R
-------- in nano type...----------
# Comments are put after the pound symbol, just like in R Studio
args <- commandArgs()
cat(args, sep="\n")
--------------------------------

Save using Ctrl-O, and then exit nano with Ctrl-x

now, back in bash, 
ls
you should see the print-args.R

Rscript print-args.R

Rscript print-args.R cat dog mouse

hide the weird R stuff -lets edit our R script to hide those

nano print-args.R 

-------- in nano edit the file...----------
# Comments are put after the pound symbol, just like in R Studio
args <- commandArgs(trailingOnly=TRUE)
cat(args, sep="\n")
--------------------------------


Save using Ctrl-O, change the file naem and type "print-args-trailing.R" and then exit nano with Ctrl-x

Rscript print-args-trailing.R otter unicorn monkey

Make anew file in nano

-------- in nano type..----------
main<- function(){
    args <- commandArgs(trailingOnly=TRUE)
    filename <- args[1]
    dat <- read.csv(file=filename, header=FALSE)
    mean_per_patient <- apply(dat, MARGIN=1, mean)
    cat (mean_per_patient, sep="\n")
    }
    
main()


--------------------------------

Save using Ctrl-O, save it as "readings-01.R" and then exit nano with Ctrl-x
".R" is extension for all R scripts

make sure the "readings-01.R" file is in the directory when you print 'ls"

Try running this script with the first inflammation file: 

Rscript readings-01.R inflammation-01.csv

now set it up so that that information is written to another file using >

Rscript readings-01.R inflammation-01.csv >inflammation-01-csv.out

You can also do this in a for-loop!

for i in *.csv 
do 
echo $i
done

hmm... this shows too many files edit this little script....  now use 
i*.csv
to show anything that starts with "i" and ends in ".csv"

for i in i*.csv 
do 
echo $i
done

now do it for the R script we just wrote...
To change the file names in a batch
for i in i*.csv
do
echo $i
Rscript readings-01.R $i > $i.out
done




up and down arrow to show past typed commands
"history" to show the typed commands


mv draft.txt thesis/
mv draft.txt draft_v1.txt