Biostatistics Lab 3

In this lab, we will learn how to use R to conduct an ANOVA analysis and will simulate data to explore the relationship between balance and test correctness (retaining the correct Type I error) when the equal variance assumption is violated.
  1. Find R. You might try RStudio this time. http://rstudio.org/ RStudio does not eliminate the need to program in R but it does organize help menus, graphics, your code history, and other things nicely.

    Deducer provides a graphical user interface which cuts down on your programming requirements but it is not installed on the STC computers. http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual

  2. We will be simulating data for the ANOVA analysis. Here is basic simulation code that generates random normal data from 3 populations with the same mean, variance and sample size. There should only be a significant ANOVA F result 5% of the time. Read through the code and look up any command you do not understand by using the help command in R.
    nx = 20
    ny = 20
    nz = 20
    sigma_x = 1
    sigma_y = 1
    sigma_z = 1
    x <- rnorm(nx, 0, sigma_x)
    y <- rnorm(ny, 0, sigma_y)
    z <- rnorm(nz, 0, sigma_z)
    data <- c(x, y, z)
    pop <- c(sample('x', nx, replace=TRUE),
    sample('y', ny, replace=TRUE),
    sample('z', nz, replace=TRUE))
    
    So the data is in "data" and the population from which each point came is in "pop." An ANOVA is a linear model. There are different ways to see the output, each giving slightly different information. Look at the output from each of the following lines of code:
    lm(data~as.factor(pop))
    anova(lm(data~as.factor(pop)))
    summary(lm(data~as.factor(pop)))
    summary(lm(data~as.factor(pop)))$fstatistic
    summary(lm(data~as.factor(pop)))$fstatistic[1]
    
  3. Run your first simulation. You should get a simulated Type I error of about 5% because all ANOVA assumptions are true and the null hypothesis that there is no difference between the means of the populations is also true. We will count the number of times the simulation gives an F statistic that is larger than it should be at the 5% level. This uses R to find the cut-off level using the quantile function qF - we find the value that puts 5% in the upper tail. Here is the basic code. Again, read through the code and look up any command you do not understand. Comments in R are preceeded with a pound sign.
    ## Put it in a Loop. Count how many F past reject level
    observed_TypeI_error = 0
    loop=5000
    nx = 20  ## to be varied
    ny = 20
    nz = 20  ## to be varied
    sigma_x = 1  ## to be varied
    sigma_y = 1
    sigma_z = 1  ## to be varied
    reject <- qf(0.05, 2, nx+ny+nz-3, lower.tail = FALSE)
    for(i in 1:loop){
    x <- rnorm(nx, 0, sigma_x)
    y <- rnorm(ny, 0, sigma_y)
    z <- rnorm(nz, 0, sigma_z)
    data <- c(x, y, z)
    pop <- c(sample('x', nx, replace=TRUE),
    sample('y', ny, replace=TRUE),
    sample('z', nz, replace=TRUE))
    F = summary(lm(data~as.factor(pop)))$fstatistic[1]
    if(F > reject){observed_TypeI_error = observed_TypeI_error+1}
    }
    observed_TypeI_error = observed_TypeI_error/loop
    observed_TypeI_error
    
    
  4. Now run your own simulation study. (A) Try changing nx to 10 and nz to 50. (B) Then also change sigma_x to 4 and sigma_z to 0.25. (C) Then make the sample sizes all 20 again while leaving the variances unequal. What happens each time? Lack of balance should not be a problem if the variances are equal. If the variances are unequal but the design is balanced there should also be no problem. There should be a problem in case (B). Am I right?
  5. In the example above, the standard deviations were in proportion 4:1:1/4 and the sample sizes where in proportion 10:20:50. Try reversing one of those. What if the standard deviations do not vary quite so much but instead are in proportion 1.5:1:0.8, say?
  6. You can also see if the nonparametric Kruskal Wallis test is a cure. You perform the Kruskal Wallis test by typing something like:
    kruskal.test(data~as.factor(pop))
    
    And, unlike the ANOVA, you can pull off the p-value of the result by typing
    kruskal.test(data~as.factor(pop))$p.value
    
  7. Now, revise the code by counting the number of times the p-value is less than 5% (0.05). Do it for all the cases we considered before. Does a non-parametric test cure problems with unequal variance and lack of balance? It is designed to cure problems with non-normality.