Welcome to Data Carpentry!
2015-01-14 Cornell
We'll
use this etherpad during the workshop to take notes, have an informal
way of asking questions, and keep track of links. Please fill in your
name in the very top right (and feel free to pick a new color for
yourself). You can write in this area to take notes, or feel free to ask
questions, post comments, etc in the chat area.
Useful Links:
Homepage: http://emudrak.github.io/2015-01-15-cornell/
Data: http://emudrak.github.io/2015-y01-15-cornell/data/biology/
Cheat sheets: http://emudrak.github.io/2015-01-15-cornell/lessons/ref/index.html
Good morning carpenters! :)
I'm
going to take notes as Jeremia is talking so hopefully no one ever gets
too far behind. Feel free to ask questions on the fly in the little
chat corner to the right.
Go to the data link above and download anything that ends with ".csv"
Put up a pink sticky if you're having any problems finding or downloading the csv files
Using Excel
We all use excel, how can we use excel in the best way possible?
The issue with excel is it's hard to do reproducible science using excel.
Open excel
Our
goal for this first lesson on excel is to show you how to best store
your data in excel so that it is easy to use that data later, either by
passing it to analysis scripts or another person (including future you).
Some rules:
- Each cell should be one unit of information
- Each column should represent one variable
- Each row should be one observation
- Always better to store independent data in independent files
open species.csv and plots.csv in excel
Question:
Why should you enter zeros rather than just leaving cells blank? Zeros
are observations, blank means missing. If you leave zeros blank, you
won't know whether those are missing values or zeros. Those are two
different things!
How should we indicate nulls?
- -999
- .
- NA
- NULL
- leave things blank (problem: missing vs. not yet)
Some Dos and Don'ts
Don't enter unit, and it makes the data type to be "general"
Don't merge cells
Don't
combine multiple readings in one column, for example [2 male, 3 female]
(have a separate column for total males and total females)
Do keep headers short and without spaces, starting with numbers, or special characters
Be careful pasting from other programs into excel - especially word! Paste special and make sure formatting isn't carried over
Dates in excel
- There are many ways to format dates in excel
- If you change a date to a number, you'll see an odd number, and it's not clear how that relates to what you've entered.
- Excel counts up from Dec 31st 1899 by one for each day. This CAN change depending on your version of excel.
- If you're not careful, excel can sometimes export the number rather than the date you were expecting.
- Some testimonials of how excel dates have been jerks to us:
1. If you're working with historical data, you might run into errors
(anything before start date gets "1")
2. If excel thinks some text is a date, it'll automatically convert
this to a date. A common example is with gene names if you're working
with genomic data. It'll recognize some gene names as dates and auto
convert those to dates and then when you export it you're exporting
something totally different than the gene name.
3. My personal testimonial: I had data where I had people's birthday's
and date of collection for some sample in two different columns. I
needed to share my spreadsheet with a collaborator and did so using
google spreadsheets. Google has a different way to deal with dates and
it totally messed up all of those dates I had entered. When they
calculated age at time of collection, some of the people were too young
to be in the study and they thought I was doing something sketchy! I
wasn't, it was just date issues! Don't let your collaborators think
you're an idiot like I did!
- Use ISO format: 2015-01-15 (year, month, date), fix at text.
Challenge!
In the surveys-exercise-extract_month.csv dataset, extract year and
month and put those in their own individual columns. Once you're done,
put up a blue sticky. If you're stuck, put up a pink sticky. How did you
guys solve this?
Method 1: Select all the dates, text to
columns under the data tab, make sure Delimited is checked, hit next,
don't choose any delimiter, click next, under column data format,
highlight Date and change to DMY. This converts them to date format, and
our formula in the other cell works (=YEAR())
Download the aphid_data_Bahlai_2014.xlsx file from the data link
- Open up a blank page in a text editor -> we're going to keep track of what we're doing
- Look across the second row. Some cells are filled in, others are derived.
-
Let's save a copy as aphid_data_Bahlai_2014_visualize.xlsx so we can
muck with it but still keep the original version of the data separate
- Let's copy all of the data and paste special (select values) so that the derived numbers are static
-
If we want to do a sort on the data, we can go to Data -> Sort. It
should recoginize that we have headers on our columns, and we can pick
what to sort by.
- Let's start by sorting males from smallest to largest. This is an easy way to look at outliers/extreme values
-
Let's make a pivot table and compare some variables. You can go to the
Analyze tab on Windows or the Data tab on Mac (at least my version) and
click pivot table.
Soooo, pivot tables were fun...let's
export our data so that we can use the data in other programs (for
example, summarizing using R might be easier than using a pivot
table).
You can go to 'Save As' and select a text format,
either .csv (which will put commas in between the fields) or
tab-delimited (puts tabs in between the fields). Click ok when it pops
up a warning
What happens if we're saving as comma delimited, but
there are commas in some of our cells? When you save from excel, it
will put those cells in quotes, so that it won't split on that comma
when read later.
Always use a csv, txt parser when you're
opening up the file in another program. People have thought a lot about
the issues that can crop up and have implemented solutions in those
parsers.
Open Refine
Go to: http://openrefine.org/
If
mac people are having any issues with where it's saying the program is
broken, follow the instructions here to change security settings: https://github.com/OpenRefine/OpenRefine/issues/590
http://127.0.0.1:3333/
Text facet is very helpful to merge similar data-- using text facet--cluster--methods--nearest neightbor--PPM
Numeric facet
Text transform
value.replace ("~", " ")
Multiple facets used at the same time.
Under
endowment, we have numbers and we also have people that wrote "10
billion", etc. How can we transform these so that we can actually work
with it?
1. Get rid of "US $" stuff -> under the 'text transform' for the endowment column:
value.replace("US $", "").replace("US$, "")
2. Let's do another text transform:
toNumber(value.replace(" million", ""))*1000000
---breaking that down:
-------toNumber: will transform this cell to be numeric rather than text
-------value.replace: will take any instance of " million" and replace it with nothing
-------*1000000:
multiply the remaining value in the cell by one million. This gives us
the actual value in dollars now in a numeric value.
We can also plot. Under endowment, you can click....shoot. I got behind...How do we pull up a plot?
Facet--scatterplot facet
Thanks :)
Under
the Undo/Redo tab, you'll see a history of the steps you've taken to
transform your code. You can extract this history and you can reapply
the same steps to other files.
What you're doing isn't
changing your original data, you must click the export button and that
will save your edited table. This table can be opened in excel, R,
whatever program you want.
LUNCH!!! Be back by 1:30!
SQL
Welcome back from lunch. This afternoon we'll be learning a databasing language: SQL.
Make sure you still have the species.csv, plots.csv, and surveys.csv available.
Open survey.csv in Excel to look at it. Then close it again...
Install SQLite manager in firefox. Go to course webpage: http://emudrak.github.io/2015-01-15-cornell/
Open
the menu on firefox. It should be in upper right corner as 3
bars. Click on add-ons, search for SQLite. Pick SQLite
Manager 0.8.1. Restart Firefox.
To Open SQLite
Manger, look under the tools menubar. If you can't find the menu
listed with "File | Edit | View | History | Bookmarks | Tools | Help|",
Right-click on the blue area above the tabs, and show the menu bar by
making sure that "Menu bar" is clicked. Then you should be able to look
under the "Tools" menu bar.
*Make a new database*
DataBase --> New Database. Call it Mammal_survey. Save it in the same folder ("DC_Workshop").
*Import a File*
Database
--> Import. Select CSV, Use Select file to browse to your
file. This time, choose plots.csv and click open. Make sure it
will process the file correctly: Choose "First row contains column
names" and Fields separated by column.
A new table called plots will be created. Say OK here.
The Create Table window will appear. Here is where we can specify the data type (Integer, double, float...)
If you can't get things imported from csv, download the portal_mammals.sqlite file from the Data page (http://emudrak.github.io/2015-01-15-cornell/data/biology/). Then, in SQLite, go to Database -> Connect Database and click on the portal_mammals.sqlite.
Under
Tables, You will see three lines: plots, species, surveys. Use the
triangle to unfold this menu to look at the table headings in each
file.
The species column in the surveys
Tab: Execute SQL. In the box under "Enter SQL" there will be an example query
SELECT year FROM surveys;
Click Run SQL. The window underneath shoudl show a bunch of years.
"SELECT"
and "FROM" are command words, and are conventionally capitalized.
"year" and "surveys" are columns from our tables, so we leave them
lowercase. This is convention and for readability.
SELECT year, month, day FROM surveys;
Wildcards: will match anything. a "*" surves as our wildcard.
SELECT * FROM surveys
this will give us the entire table back.
to filter, use the keyword "DISTINCT"
SELECT DISTINCT species FROM surveys;
This will give only unique values.
SELECT DISTINCT year, species FROM surveys;
warning:
this returns distinct pairs years and species. So you'll have duplicate
years, but it's because they are with different species.
We can do a calculation on anything that is a number.
SELECT year, month, day, wgt/1000.0 FROM surveys;
SQLManager highlights derived properties- a brighter green. The derived column is named according to the cor
SELECT year, month, day, ROUND(wgt/1000.0 ,2) FROM surveys;
The
derived column is named according to the formula. This is
annoying to deal with later, so rename it in the same step:
SELECT year, month, day, ROUND(wgt/1000.0) AS wgtkg FROM surveys;
Challenge: Write a query that returns the year, month, day, speciesID and weight in mg
SELECT year, month, day, species, wgt*1000 AS wgtmg from surveys;
Filtering with keyword WHERE
SELECT * FROM surveys WHERE species="DM";
SELECT * FROM surveys WHERE year >= 2000;
combine selection criteria with AND and OR
SELECT * FROM surveys WHERE (year >= 2000) AND (species = "DM");
SELECT * FROM surveys WHERE (year >= 2000) AND (species = "DM") OR (species="DO") OR (species="DS");
EXERCISE:
Write a query that returns The day, month, year, species ID, and weight
(in kg) for individuals caught on Plot 1 that weigh more than 75 g
SELECT day, month, year, species, wgt FROM surveys WHERE (plot= 1) AND (wgt> 75);
Click
on surveys and look at the structure tab. You'll see the data
type for plot is a number. So you don't need quotes around numerical
values.
Exporting Tables
If we like this
table, we can use the Actions button to save the results of this query
as a .csv file. This is useful if your raw data is huge and you
only want to use a smaller subset of it in an analysis program.
The IN operator
The chain
(species = "DM") OR (species="DO") OR (species="DS")
works fine, but we cna do this quicker with an IN operator
SELECT * FROM surveys WHERE (year>=2000) and (species IN ("DM","DO","DS"))
Sorting with the ORDER keyword
SELECT * FROM species ORDER BY taxa ASC;
"ASC" means ascending - the default. If you dont put anything, it assumes this.
"DESC" means descending
Sort by multiple columns, with listed priority, from left to right in the written query.
SELECT * FROM species ORDER BY taxa ASC, genus ASC, species ASC;
Group records
SELECT COUNT(*) FROM surveys;
gives row count of the entire table
SELECT COUNT(*), SUM(wgt) FROM surveys;
gives the row count and the total of the wgt column
SELECT ROUND(SUM(wgt)/1000.0, 3) FROM surveys;
SELECT species, COUNT(*) FROM surveys GROUP BY species;
Exercise: return the individuals counted/year and average weight/year
SELECT year, species, COUNT(species), ROUND(avg(wgt),2) FROM surveys GROUP BY year, species;
Sometimes it helps to separate a command like this by lines to make it more readable
SELECT year, species, COUNT(species), ROUND(avg(wgt),2)
FROM surveys
GROUP BY year, species;
Joining Tables
We know some of the data in species table relates to the data in the surveys tables
SELECT *
FROM surveys
JOIN species ON surveys.species=species.species_id;
The nomenclature: tableName.columnName
This is helpful because sometimes tables have the same column names.
If we want to select certain columns for this query, we must use this nomenclature:
SELECT surveys.year, surveys.month, surveys.day, species.genus, species.species
FROM surveys
JOIN species ON surveys.species=species.species_id;
(the ids are used to make sure they are corresponded between different databases)
SELECT plots.plot_type, AVG(surveys.wgt)
FROM surveys
JOIN plots
ON surveys.plot=plots.plot_id
GROUP BY plots.plot_type;
RStudio
Note about RStudio: your windows might not be in the exact same orientation as Erika's, but you should see four windows
The console is where code is actually entered and run
The
script area is a place where we can record what we want to do. We can
run these directly by hitting run (it'll send that line or whatever is
highlighted to the console)
shortcut "<-"------alt+"-"
To assign to a variable: weight_kg <- 55
--- weight_kg is the name of our variable
--- 55 is the value we assigned to that object
You
can think of a variable just like a variable in math: weight_kg
represents whatever you've stored in it. eg: weight_kg*2 = 110
You can store variables in variables:
weight_lb <- 2.2*weight_kg
If you now change weight_kg, it won't change weight_lb
weights <- c(50, 60, 65, 82)
--- The c() stands for concatenate. This allows you to enter multiple values assigned to an object
We can convert all four of those weights to lbs at once:
weights_lb <- 2.2*weights
You can also store non-numbers:
animals <- c("mouse", "rat", "dog")
How many values are in my variable?
length(animals)
--- length will tell you how many values are assigned to that variable
weight <- c(weights, 90)
---
you can also add on to a vector using the c() command. This command
takes the variable weights (which is (50, 60, 65, 82), and then adds 90
to the end, resulting in (50, 60, 65, 82, 90)
How do you know whether your variable is a number or character vector?
class(weights)
"numeric"
class(animals)
"character"
R has many different types of variables:
1. numeric: these are numbers
2. intergers: whole numbers
3. logical: TRUE/FALSE
4. character: letters, letters and numbers
_______________________________________________________________________
Friday - 1/16/15
Good morning all!
R
Open up RStudio
File
-> New Project -> New Directory -> Empty Project ->
Directory name = "DC_Rproject", put that into Documents -> Click
create project
In the console, you can now see where your project
is located on your computer (up in the console, it should say something
like "~/DC_Erika/DC_Rproject"
Under Files, ask for a new Folder.
Call it "data". If you look in your finder, you'll see that new folder
in your computer.
Let's get some data: http://emudrak.github.io/2015-01-15-cornell/data/biology/
. Download surveys.csv, inflammation.zip, and DataWithZeros.xlsx. Put
those in your newly made data folder. Put your blue stickies up when
you've done that, put pink up if you need some help.
File -> R Script
Let's load survey data into R:
surveys <- read.csv("data/surveys.csv")
You should see in your "Environment" pane something called "surveys"
getwd() to get current directory
surveys
You
should see a lot of stuff on the screen. It's a big table, so we can't
see all of it in the console. In the "Environment" tab, there's a little
grid next to "surveys". Click that and it'll bring up a spreadsheet
like thing of the data
head(surveys)
This will list the first several lines of a table
R
interprets this table as a "data.frame". In R, that means each column
can have a different type of data (numeric, character strings, logical
True/Falses, etc).
str(surveys)
This will show you the
name of every column and what type of data it is (int = integer). Do
this when you load up a table to make sure R is understanding your data
correctly.
If you're loading data in general, you can use:
read.table("yourfile.txt", sep=",")
The
"sep" thing here is where you signify whether it's comma delimited
(","), tab ("\t"), or anything else, just put that in quotes.
Let's say we want to just look at one column of our data, year:
surveys$year
This will list every value in that column
dim(surveys) -> shows dimensions of a data.frame
nrow(surveys) -> how many rows are in my data.frame?
ncol(surveys) -> how many columns are in my data.frame?
head(surveys) -> shows first few lines of data.frame
tail(surveys) -> shows last few lines of data.frame
names(surveys) -> shows you the column names for surveys
summary(surveys) -> shows you general summary information for each column (min, max, mean, etc)
class(surveys) -> shows you what kind of object your variable is (in this case data.frame)
Challenge!
-What is the class of the object surveys?
A: class(surveys)
"data.frame"
-How many rows and how many columns are in this object?
A: nrow(surveys) ncol(surveys), dim(surveys), look in the environment pane, str(surveys)
-How many species have been recorded during these surveys?
A: str(surveys) (shows there are 49 levels for that variable), surveys$species (shows there are 49 levels)
-What is the range of years over which this data has been recorded?
A: summary(surveys$year), range(surveys$year)
-What is the mean weight of animals recorded?
A: summary(surveys$wgt)
Blue stickies up when you're done, red if you need a hand!
s
animals <- c("mouse", "rat", "dog", "cat")
animals
(This will list all the animals)
What if we want to only look at say the second animal? We can "index" the second value using brackets:
animals[2]
or the fourth:
animals[4]
if you chose a number that's out of range you'll get an NA
animals[6]
If you want to get multiple values you can use the c() function we used yesterday. Say, the 2nd and 4th value:
animals[c(2,4)]
If you want a range, you can use ":" to specify values between two numbers. Say you want 2,3,4:
animals[2:4]
What
if you want everything but the third value? The "-" sign will show
everything but the index you're inputting. Say you want index 1,2,and4
(not 3):
animals[-3]
If you have a data.frame, there are two dimensions: rows and columns. To reference 2 dimensions using indexing, it's always the rows first, a comma, and then the columns. If you only want to look at the first row:
surveys[1,]
If you only want to look at the first column:
surveys[,1]
If you want to look at only the first entry in the second column:
surveys[1,2]
You can use the ":" and "c()" in this as well. First three rows, columns 4-7:
surveys[1:3, 4:7]
Factors
in R: when R inputs anything (using read.csv, read.table, etc) that has
letters, it'll convert it to a factor. You can think of these like
categories.
What different categories do we have (we call these levels):
levels(surveys$species)
How many levels does my factor have?
nlevels(surveys$species)
Sometimes
we call our levels by a number and we don't want R to interpret it as a
number. For example, the plots in our data. How can we force R to interpret something as a factor rather than a number?
as.factor(surveys$plot)
This will just print out surveys$plot as converted to factor, but it's not actually going to save it. To save it:
surveys$plot <- as.factor(surveys$plot)
This
is overwriting this column as now recognizing it as a factor. This is
only changing the data in R itself. Your actual data file won't be
changed.
A bit of a kink with factors: even if you call your
levels a word, R "thinks" about factor levels as numbers. We won't talk
about this further, but just keep this in the back of your head for when
you're working in R. Sometimes this can crop up when you're doing data
analysis in ways you weren't expecting.
Comment character "#"
Anything
written after a "#" on a given line will not be interpreted by R. This
is a way to document your code. Use them often! Don't trust that future
you will remember current-you's logic!
Let's add a comment to our script about what we're doing:
# Subsetting data frame
subset(surveys, species=="DO")
What
is this doing? We are taking a subset of the dataframe surveys, where
we're only looking at entries where species is DO. The "==" is R speak
for "is equal to". A single = will not do this (it is actually the same
as the assignment operator, <-)
What we wrote above will print it out to the console. We want to save that, let's assign it to a variable:
surveys_DO <- subset(surveys, species=="DO")
Can we do a subset of "this OR that"?
surveys_DO <- subset(surveys, (species=="DO") | (species=="AB"))
What
happens when you see a "+" at the beginning of the line? That means you
haven't finished your statement. Check your parenthases, quotes, etc.
Make sure you've closed everything!
In R, to say "OR", use the "|" symbol (above your enter key, hit shift)
If you want to say "AND", use "&"
What
if you want to look at everything BUT some value? You can use "!=".
This means "not equal to". So if we want to look at everything but DO:
surveys_notDO <- subset(surveys, species != "DO")
What if we want to look at only some of the columns while subsetting on DO?
DO_weights <- subset(surveys, species == "DO", select=c("sex", "wgt"))
This is saying, we're subsetting from surveys the columns sex and weight for only the species that are labeled DO
How to get function help in R: question mark and then name of function
?subset
This
will pop up a help window in a frame in Rstudio. Some helpful things to
look at: the arguments descriptions and details. At the bottom there
are usually examples of how to use the function, but honestly I find
those usually aren't very helpful!
Little tip about
arguments for function in R: if you specify the argument name, you can
input things in any order into the funciton when you call it.
COFFEEE! Be back by 10:40 please.
Clean environment occasionally to make sure you will not use a mistaken object.
When
we're working with R, what we want to do is build scripts. These
scripts contain a list of all commands we need to run to get a result.
You should be able to give your data and a script to your colleague and
they should be able to run the exact same analysis that you did.
Challenge:
2. What is the range (minimum and maximum) weight?
A. min(surveys_complete$wgt) max(surveys_complete$wgt)
3. What is the median weight for the males?
A. temp <- subset(surveys_complete, sex="M", select=wgt)
median(temp$wgt)
answer 2: median(subset(surveys_complete, sex=="M")$wgt)
answer 3: median(surveys_complete[surveys_complete$sex=="M", "wgt"])
Blue sticky up when you're done, pink if you need help.
# tapply(columns to calculate on, factors to group on, function)
Apply
statements are a fast way to do the same thing over and over to each
row or column in a dataframe, item in a list, etc. tapply specifically
will perform any function you specify on the columns you specify,
grouping by the factor you specify.
specieswgts <- tapply(surveys_complete$wgt, surveys_complete$species, mean)
#STOP###
Group the commands that we actually need to build the objects that we are working with :
#This is all I really need.
surveys <- read.csv("data/surveys.csv")
surveys$plot <- as.factor(surveys$plot)
surveys$logwgt <- log(surveys$wgt)
surveys_complete <- surveys[complete.cases(surveys), ]
surveys_complete <- droplevels(surveys_complete)
specieswgts <- tapply(surveys_complete$wgt, surveys_complete$species, mean)
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1)
jpeg(filename="species_mean_wgts.jpg", width=400, height=500)
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1, main="Mean weight per species", xlab="weight in kg", ylab="species")
dev.off()
Challenge:
Calculate the maximum animal weight in each plot
A. tapply(surveys_complete$wgt, surveys_complete$plot, max)
Let's plot!
barplot(specieswgts)
This will show up in your plot section. Barplot will plot each level on the x-axis, and the value on the y-axis.
On our plot, our labels are too big and they aren't all showing up.
If we look in
?barplot
cex.names will alter the size of the labels essentially
barplot(specieswgts, cex.names=0.8)
Still not small enough
barplot(specieswgts, cex.names=0.5)
Something
to note about the helpfile: you'll see that the arguments are assigned
defaults. If you don't specify an argument, it will specify a default
for any argument that has defaults specified. If you look at barplot,
height does NOT have anything for a default value. If you don't enter
anything for height, it won't know what to do and throw an error.
What if we want the bars coming off of the y-axis:
barplot(specieswgts, cex.names=0.5, horiz=TRUE)
Now the labels aren't right side up though. We can rotate those using "las"
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1)
When you're plotting, always put a title and axes labels on your graph!
You can specify a title using the "main" parameter, and the x axis with "xlab", and the y axis with "ylab"
You can save these images from R script using export -> Save Plot as Image.
However, you can also save things using a command. You can save things to a jpeg:
jpeg(filename="species_mean_wgts.jpg", width=400, height=500)
barplot(specieswgts, cex.names=0.5, horiz=TRUE, las=1, main="Mean weight per species", xlab="weight in kg", ylab="species")
dev.off()
What
happens here is that the jpeg commands opens a connection to a new file
called species_mean_wgts on your computer. Any command you type after
that, if it produces an image/plot/bargraph/whatever will output that to
the file rather than the little window in the bottom of the RStudio.
You need to close the connection to this file when you're done writing
things out using dev.off(). This stands for device off and it just means
it's closing this connection. Note: if you do something like this:
jpeg("file.jpg")
bargraph(specieswgts)
bargraph(specieswgts*2)
dev.off()
It
will only display the second graph. You essentially write over your
first graph. Either you need to save them as separate files or you need
to specify you're going to print more than one graph to your file (we
won't cover that here).
Afternoon
We're going to continue working in R, but close out and reopen to refresh the program.
Unzip inflammation.zip files:
Windows: right click, extract all, when it shows the file name, delete where it says inflammation
On
Mac, double click it, but you'll have to move the individual files back
up to the data folder, rather than where you're sitting in the
inflammation folder. Delete the empty inflammation folder.
Each row is a patient, each column is a day
Each
file has the same format, so we can set up an analysis to streamline
analyzing all of these files, rather than analyzing each one by hand
individually.
Challenge: read inflammation-01.csv into dat
dat <- read.csv(file = "data/inflammation-01.csv", header=FALSE)
class(dat)
nrow(dat)
ncol(dat)
dat[30,20]
apply() # This will apply a function multiple times to a dataframe
apply(dat, MARGIN=1, mean) # The mean inflammation for each patient over the experiment
This is saying for each row in dat, find the mean. If you want to do that with the columns, you can do:
apply(dat, MARGIN=2, mean) # The mean inflammation over all patients for each day of the experiment
avg_day_inflam <- apply(dat, MARGIN=2, mean)
plot(avg_day_inflam) # gives us a little scatter plot where the day is along the X, the average is along the Y axis
This plotted little circles, but maybe we want a line:
plot(avg_day_inflam, type="l")
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
Let's rearrange our script so that it's more organized: If you're behind copy these into your script:
dat <- read.csv(file = "data/inflammation-01.csv", header=FALSE)
avg_day_inflam <- apply(dat, MARGIN=2, mean)
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
Let's re-run all of this stuff.
What if you want to see all of these plots at once?
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
Let's reset that par, since we might not always want three graphs right next to each other:
par(mfrow=c(1,1))
How can we save a jpeg imagine of these plots?
jpeg("daily-inflammation.jpg", height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
par(mfrow=c(1,1))
dev.off()
Run this. Once you have this 3 panel plot saved into your folder put your stickies up!
Great, so we analyzed inflammation-01.csv, but now let's do all of this for inflammation-02.csv
filename <- "data/inflammation-02.csv"
picname <- "daily-inflammation-02.jpg"
analyze <- function(filename, picname) {
dat <- read.csv(filename, header=FALSE)
avg_day_inflam <- apply(dat, MARGIN=2, mean)
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)
jpeg(picname, height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
par(mfrow=c(1,1))
dev.off()
}
analyze(filename="data/inflammation-03.csv", picname="daily-inflammation-03.jpg")
For Loops:
Generally, for loops do the following:
FOR everything in my collection, do something I want you to do.
for (i in 1:10) {
sq <- i^2
print(paste("The square of", i, "is", sq))
}
aside
about paste(): paste automatically puts a space wherever you've put a
comma in your string of things you want to paste together. You have
designate something other than a space by using sep="" and put whatever
in between "". say a dash:
paste("The squre of",i,"is",sq, sep="-")
will look something like:
The square of-1-is-1
list.files(recursive=TRUE) # This will give a list of all the files that are in all the folders in the current working directory
Maybe you don't want all the files, maybe you only want to look at files that are comma separated:
filelist <- list.files(pattern="inflammation-.csv", recursive=TRUE)
for (f in 1:length(filelist)) {
analyze(filename=filelist[f], picname=paste("daily-inflam",f,".jpg", sep=""))
}
So, the entire bunch of code that we'll need is below:
analyze <- function(filename, picname){
dat <- read.csv(filename, header=FALSE)
avg_day_inflam <- apply(dat, MARGIN=2, mean )
max_day_inflam <- apply(dat, MARGIN=2, max)
min_day_inflam <- apply(dat, MARGIN=2, min)
jpeg(picname, height=300, width=600)
par(mfrow=c(1,3))
plot(avg_day_inflam, type="l")
plot(min_day_inflam, type="l")
plot(max_day_inflam, type="l")
par(mfrow=c(1,1))
dev.off()
}
filelist <- list.files(pattern="inflammation-", recursive=TRUE)
for (f in 1:length(filelist)){
analyze(filename=filelist[f], picname=paste("daily-inflam", f, ".jpg", sep=""))
}
############ Command Line Lesson #######################
PATH=$PATH:"C:\Program Files\R\R-3.1.1\bin\x64"
PATH=$PATH:"C:\Program Files\R\R-3.1.1\bin\x64"
To check this worked, type "Rscript" and you should get the bunch of text on the screen
Tools up until now: point & click
Many of the commands we have used can be tied together with the command line
Work on "bash" for this. Windows needs to download it through Git, Mac has it already installed.
This is what you use to access computer clusters on campus (if you do that...)
We can move around in our folders in bash just like do in our windows browser
DO THIS: Go to downloads page
(http://emudrak.github.io/2015-01-15-cornell/data/biology/),
get "filesystem.zip", put it on desktop and unzip it.
Put up sticky once you have this downloaded and unpacked
whoami
pwd - "present working directory" - where you are "sitting' right now
cd - "change directory"
ls - "listing" lists all the files in the directory right now
ls -F
shows which is a file and which is a Folder. Folders have "/" after them.
move into the filesytem folder:
cd filesystem
Everything
you see in the command line should mirror everything you see in the
windows explorer when you click through folders (like you probably are
used to doing).
Let's look at the users folder:
cd users
Now the path has changed to say "~/Desktop/filesystem/users"
ls
You should see "backup" "nelle" "thing"
Moveinto the "nelle" folder
cd nelle
Bash shortcuts:
"Tab completion"
If
you want to look in the folder "north-pacific-gyre", but you are likely
to make typos. With tab completion, you are able to start writing the
thing and press "tab" and it will complete it with the rest of the line
if you can.
cd
will take you all the way to the home directory
full path vs. relative path
full path is being very explicit from home directory
"c:/users/elm26/Documents/MyAnalysis"
THe home directory is sometimes abbreviated by "~"
a relative path is instructions from where you are sitting right now.
relative
paths make it easier to move a bunch of files in a complicated folder
structure between users. You could put the outermost folder on a
thumbdrive nad give it to your friend, and all the relative paths would
still work.
"Up directory"
cd ..
Will back up one directory from where you are. "Back up" or move outwards, etc...
Discussion of File organization
have different folders for different distinct data types, stages of analysis, etc...
Just like you can make new folders in the graphical windows
put yourself in the "nelle" directory
make a new folder within this folder, call it thesis. This command is "mkdir"
mkdir thesis
now look in your graphical windows broswer. You should see a new folder named thesis.
We are going to write some text files for this directory using nano,, which is a simple text editor
type
nano draft.txt
you'll get into another window that is the text editing window with the heading "draft.txt" at the top.
type
something in here. then press control X to get out of the nano
program. It will ask you if you wanted to save.
now youre back int the shell,
type
ls
and you will see the new draft.txt file that you just made
remove a file with
rm draft.txt
**** WARNING **** THis will delete it FOREVER - there is no recycle bin here!
cat will print out the start of the file
make an original draft (draft.txt)
Copy it and name it sometihng else
cp draft_v1.txt draft_v2.txt
open this new file and change something about it
nano draft_v2.txt
type ls to see both the original file and the new file
back up into the nelle folder
cd ..
move into the molecules folder
cd molecules
ls
you shoudl see a bunch of files
wc cubane.pdb
wc - "word count" will show the word count- how many lines and words are in teh file.
use the wildcard character "*" to represent any character
so then
wc *.pdb
will print out the word count for anything that matches ends with .pdb.
wc p*
will print out the word count for anyting that starts iwth "p"
Right
now, everything that we are doing is just printing on the screen and
not being saved anywhere. You can redirect this output to another
file with the ">" command
wc *.pdb > word_counts.txt
Now instead of printing on the screen, it will put the output into the file word_counts.txt
### Pipes ####
The
power of using command line is stringing a bunch of commands
together. This is done with the "pipe" or "bar" character, which
is above the enter key, to the right of the "]" hey, and you need shift
to get to it.
head cubane.pdb
shows the top part, just like in R
pipe "head" together with "wordcount":
head cubane.pdb | wc
this will string the commands together, so it will take the head of cubane.pdb and then do the word count of that.
For loops in bash
for i in *.pdb
do
echo $i
done
Now we are entering more than one line at once, and the prompt changes form $ to >
the start of the loop is "do" and the end of the loop is "done", similar to the { and } brackets in R
Let's do sometihng more complcetaed
$ for in in *.pdb
do
echo $i
wc $i
done
this is useful if you have 20 data files and you want to do the same operation to all of them
ls .. get back to the upper level and show the listed things
Now
go to homepage to the Schedule. Under Worksflwos and Automating
Tasks in the shell is a link to the notes that we used
Get to 6. Command Line programs This has some notes that you will need to use.
cd into your data folder from this morning
cd ~/Documents/DC_Erika/DC_Rproject/data
... or something similar....
make sure you can see all the inflammation files when you type
ls
back slash "\" can be used to escape the space in a file name
Now follow along on this webpage: http://emudrak.github.io/2015-01-15-cornell/lessons/shell/07-cmdline.html
head -4 inflammation-10.csv | Rscript readings.R --min
########### How to write an R script in Bash ############
be in the data folder
nano session-info.R
Now we have a nano window where we are writing into our file called "session-info.R"
Into this, type
sessionInfo()
Save using Ctrl-O, and then exit nano with Ctrl-x
ls
you shoudl now see a "session-info.R"
cat session-info.R
Rscript session-info.R
Create a new script called "print-args.R"
nano print-args.R
-------- in nano type...----------
# Comments are put after the pound symbol, just like in R Studio
args <- commandArgs()
cat(args, sep="\n")
--------------------------------
Save using Ctrl-O, and then exit nano with Ctrl-x
now, back in bash,
ls
you should see the print-args.R
Rscript print-args.R
Rscript print-args.R cat dog mouse
hide the weird R stuff -lets edit our R script to hide those
nano print-args.R
-------- in nano edit the file...----------
# Comments are put after the pound symbol, just like in R Studio
args <- commandArgs(trailingOnly=TRUE)
cat(args, sep="\n")
--------------------------------
Save using Ctrl-O, change the file naem and type "print-args-trailing.R" and then exit nano with Ctrl-x
Rscript print-args-trailing.R otter unicorn monkey
Make anew file in nano
-------- in nano type..----------
main<- function(){
args <- commandArgs(trailingOnly=TRUE)
filename <- args[1]
dat <- read.csv(file=filename, header=FALSE)
mean_per_patient <- apply(dat, MARGIN=1, mean)
cat (mean_per_patient, sep="\n")
}
main()
--------------------------------
Save using Ctrl-O, save it as "readings-01.R" and then exit nano with Ctrl-x
".R" is extension for all R scripts
make sure the "readings-01.R" file is in the directory when you print 'ls"
Try running this script with the first inflammation file:
Rscript readings-01.R inflammation-01.csv
now set it up so that that information is written to another file using >
Rscript readings-01.R inflammation-01.csv >inflammation-01-csv.out
You can also do this in a for-loop!
for i in *.csv
do
echo $i
done
hmm... this shows too many files edit this little script.... now use
i*.csv
to show anything that starts with "i" and ends in ".csv"
for i in i*.csv
do
echo $i
done
now do it for the R script we just wrote...
To change the file names in a batch
for i in i*.csv
do
echo $i
Rscript readings-01.R $i > $i.out
done
up and down arrow to show past typed commands
"history" to show the typed commands
mv draft.txt thesis/
mv draft.txt draft_v1.txt