Monday, February 13, 2012

MIT Open Courseware statistics class, and GNU R

It has been a long time since I performed any serious statistical problems. To update my skills within my own time I am participating in the MIT OpenCourseWare undergraduate level statistics class, but with a twist. I will use GNU R and it's packages for the exercises within the book. I have been using GNU R on and off for my own use. This gives me a better opportunity to build skills with GNU R while strengthening my statistical skills.

So the first chapter in the course book covers Venn diagrams. While simple Venn diagrams are easy to produce, readable complex ones require some skill in art or assistance of a program designed to make them. GNU R seems to have two popular packages to perform this task: VennEuler and VennDiagram.

VennEuler is good for getting started with semi-simple diagrams. You will note that the name mentions both Venn and Euler diagrams. They are similar diagrams whose main difference is how three or more overlapping circles handle area with null data. BMC Bioinformatics posted an example between the two diagrams below.

Note that figure A) (Euler diagram) minimizes overlap show the region that does not share data does not appear. Figure B) (traditional Venn diagram) provides a larger overlap of data but must show a region of zero or null data since the data set does not share any common properties among the three regions.

The GNU R code for a simple Venn diagram below
v <- venneuler(c(A=200, B=200, "A&B"=100))
v$labels<- c("Green", "Blue")
text(.5, .6, "my text here")
produces the following Venn diagram.
While VennEuler is good for simple Venn diagrams, the package VennDiagram gives the user greater control over the diagrams. Here is an example of a complex Venn diagram from BMC Bioinformatics.
The code for this diagram follows
    x = list(
        I = c(1:60, 61:105, 106:140, 141:160, 166:175, 176:180, 181:205, 206:220),
        IV = c(531:605, 476:530, 336:375, 376:405, 181:205, 206:220, 166:175, 176:180),
        II = c(61:105, 106:140, 181:205, 206:220, 221:285, 286:335, 336:375, 376:405),
        III = c(406:475, 286:335, 106:140, 141:160, 166:175, 181:205, 336:375, 476:530)
    filename = "1D-quadruple_Venn.tiff",
    col = "black",
    lty = "dotted",
    lwd = 4,
    fill = c("cornflowerblue", "green", "yellow", "darkorchid1"),
    alpha = 0.50,
    label.col = c("orange", "white", "darkorchid4", "white", "white", "white", "white", "white", "darkblue", "white", "white", "white", "white", "darkgreen", "white"),
    cex = 2.5,
    fontfamily = "serif",
    fontface = "bold",
    cat.col = c("darkblue", "darkgreen", "orange", "darkorchid4"),
    cat.cex = 2.5,
    cat.fontfamily = "serif"

Reviewing the sample code and the available documentation for both packages, the VennDiagram package contains a larger library of statements for granular control of the diagram. As I perform the exercises in the statistics book, I will attempt a mix of diagrams from each package and provide examples of what I learned from the chapter and from the use of both GNU R packages.

No comments:

Post a Comment