Biostatistics Lab 2
In this assignment, we will work on Homework 2 in R. That involves getting your data back into R, so the first part of the instructions will overlap with Lab 1.
- Find R. You might try RStudio this time. http://rstudio.org/ RStudio does not eliminate the need to program in R but it does organize help menus, graphics, your code history, and other things nicely.
Deducer provides a graphical user interface which cuts down on your programming requirements but it is not installed on the STC computers. http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual
- The next hurdle is to get your data into R. So we need a data set. Let's just use the one from your first homework assignment. I copied the data and the basic story behind the data from Carnegie Mellon's StatLib DASL web site:
Original source: Schmitt, Maribeth C., The Effects on an Elaborated Directed Reading Activity on the Metacomprehension Skills of Third Graders, Ph.D. dissertaion, Purdue University, 1987. The data is at the very bottom of this page.
- Open something like Excel and copy and paste the data into it. You might well have to use the Text to Columns function under Data in order to get the data appropriately into the right columns.
- Save the file as tab delimited text called something like HW02.txt somewhere on the computer.
- Make sure you know where HW02.txt is on your computer.
- Open R
- Change directories in R to the directory where HW01.txt lives. In R on a Windows machine, you can do this by going under File and Change dir... or by typing code at the prompt like:
Double versus single quotes, backslashes versus forward slashes, versus double slashes are all variations that might be needed on different operating systems. And clearly you should replace my directory system names with your own. And the "change dir" option is located under another option (not File) in RStudio and on an Apple machine - you will have to look for it.
- Get your data into R and assign it to a variable by typing something like the following at the prompt:
data <- read.table(file='HW02.txt', header=TRUE)
Tricks: "true" has to be all capitals. The file has a header line if you left the line with Treatment and Response at the top of your file. If you took that out, then the header should be FALSE, which is the default value. We will learn about more complicated objects in R like data frames later.
- Just type
to see the data and make sure that it is all there like it should be.
- We are going to assess the data for normality and for equal variances. To do so, we will need to separate out the treated and the control groups. As before, subset the data to different variables:
treatment <- subset(data[,2], data[,1]=="Treated")
control <-subset(data[,2], data[,1]=="Control")
To create a normal probability plot, you can type
Note which axis the data is on. This is actually the reverse of the way we did it in class in Minitab.
- Look at the help menu for qqnorm and put the data on the x-axis like Minitab does it.
- Do the same for the control group.
- To get a p-value of significance, the default test in R has command Shapiro.test. The Shapiro-Wilk test implemented specifically tests for departures from normality. The Anderson-Darling test used as the default in Minitab applies to other distributions as well - it can tests for departures from log normality, for instance, or from a uniform distribution, etc...
- To conduct Levene's test for equal variances requires installing and loading an external package, lawstat. You can do this either through commands or through pull down menus. With commands, you type:
After you install and load the package, you can conduct a Levene's test. Here, the data for both populations need to be in one array and their assignments into groups needs to be in another array that matches up with the first, entry by entry. The numerical values are in the second column of "data" and their category is in the first column of "data." So we can run Levene's test by typing:
- For the fun of it, let's try some more sophisticated graphics. We are going to try to overlay a normal density over the histogram of the treatment group, say. If you just plot each, you get the plots one after the other:
plot(c(0:100), dnorm(c(0:100), 50, 10), type="l")
(Look up help(plot) to find out more about plotting. The x's are the numbers 0-100 and the y's are the normal density at x where the normal distribution in question has mean 50 and standard deviation 10.)
- To overlay the plots, you have to tell R not to refresh the plot window. Ironically, this is done by setting new=TRUE (to tell the screen that it is a new plot and not to refresh the screen):
plot(c(0:100), dnorm(c(0:100), 50, 10), type="l")
- Now the problem is that one plot overwrote the other plot's lables on the axes. So you want to set both lables to the same thing, both plot scales to the same thing, or to not print them at all. You can read about all the option by reading the help menu for plot and par.
There is another problem that the histogram is plotting frequencies (counts) and not probabilities whereas the normal density is plotting probabilities and not counts. Densities should be used for both so that the area under the curves are both 1. Read up on all the commands but I think the following code works:
hist(treatment, freq=FALSE,xlim=c(0,100), ylim=c(0,0.04),)
plot(c(0:100), dnorm(c(0:100), 50, 10), type="l", xlab="",
ylab="", axes=FALSE, frame=FALSE,
xlim=c(0,100), ylim=c(0,0.04), lwd=2)
- Figure out how to change the plot title and change it something of your choosing.