Si618f08 - week3 qanda - Interspike Codex

Si618f08 - week3 qanda


Q: Any comments on grading? (2008-11-10)

A: I took off a point in HW1 if two lists were not displayed. I took off a point if code was not displayed. For HW1, I took off a maximum of one point even if I found both of these problems. You should look at each other's work. I see numerous exemplary solutions. There are a few things I disagree with. I suggest Manning and Schutze on Information Retrieval for some comments about tf-idf and normalization by dividing document frequency by collection size. I want to reiterate that you need the actual command lines if you want to later repeat this analysis on other data. You should also try to explore if you find problems. One person who dropped the class had trouble retrieving files from the NY Times using curl. I had the same problem, but found it easy to retrieve those files using wget or lynx. I'll demonstrate that in ~/mcq/si618week2/nytimes.

==

Q: Are there any better R tutorials out there? Please? (2007-11-24)

A: Thanks to Lada, we now know about two that have worked for stats students.

the official kickstarting tutorial:

 http://cran.r-project.org/doc/contrib/Lemon-kickstart/index.html

pretty good introductory tutorial: Simple R

 http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf

==

Q: How can I created a sorted bar chart? (2007-11-24)

A: Suppose I have a file containing the following lines of text:


beginning of file named marx


bro

       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       groucho
       chico
       chico
       chico
       chico
       chico
       chico
       harpo
       harpo
       harpo
       harpo
       harpo
       harpo
       harpo
       harpo
       harpo
       harpo
       zeppo
       zeppo
       zeppo
       zeppo
       gummo
       gummo
       gummo

end of file named marx


  1. and I read it into R by switching to the directory
  2. where it is and using read.table with header=TRUE
  3. because bro is a header for the column of names

setwd("/Users/mcq/Desktop/") marx<-read.table("marx",header=TRUE)

  1. attach the data so I can work with it

attach(marx)

  1. convert the single column into two columns, one of
  2. labels, one of counts of occurrences of those labels;
  3. by default, the rle() function will name those two
  4. columns values and lengths, respectively

y<-rle(as.vector(bro))

  1. I could just plot these, but the bars would be sorted
  2. alphabetically by the labels in y$values, so I create
  3. a factor based on y$values

y$fac<-factor(y$values)

  1. now i can attach the lattice library to draw a chart,
  2. using the property of a factor that it can be sorted
  3. by another column---you can think of y$fac as a
  4. version of y$values that can be sorted by y$lengths,
  5. and that's exactly what the reorder function does,
  6. reorders y$fac by y$lengths using the sort function

library(lattice) barchart(reorder(y$fac,y$lengths,sort) ~ y$lengths)

==

Examples of lattice graphics by the inventor are at

http://dsarkar.fhcrc.org/lattice/book/figures.html

==

putting some sideways labels on:

midpts <- barplot(1:10, col="grey90", axes=FALSE) axis(2) axis(1, at=midpts, labels=FALSE)

library(gridBase) vps <- baseViewports() pushViewport(vps$inner, vps$figure, vps$plot)

grid.text(c("one", "two", "three", "four", "five",

           "six", "seven", "eight", "nine", "ten"), 
         x=unit(midpts, "native"), y=unit(-1, "lines"),
         just="right", rot=60)

popViewport(3)


==

Paul Murrell explains how to put two plots, one rotated, side by side; note that push.viewport should be replaced by pushViewport and pop.viewport by popViewport

==

myhist <- histogram(rnorm(50))

  1. Make two square regions side by side

push.viewport(viewport(layout=grid.layout(1, 2, respect=TRUE)))

  1. Go to the left region

push.viewport(viewport(layout.pos.col=1))

  1. Draw the histogram in normal orientation

print(myhist, newpage=FALSE) pop.viewport()

  1. Go to the right region then rotate 90 degrees

push.viewport(viewport(layout.pos.col=2), viewport(angle=90))

  1. Draw the histogram (rotated 90 degrees)

print(myhist, newpage=FALSE) pop.viewport(3)

That only works because I made the regions square (so a region is the same size when it is rotated 90 degrees). Here's a slightly more complex example which will work for non-square regions ...

  1. Create a region in the left half of the page

push.viewport(viewport(x=0, width=0.5, just="left"))

  1. Draw the histogram in normal orientation

print(myhist, newpage=FALSE) pop.viewport()

  1. Create a region in the right half of the page
  2. which is rotated 90 degrees

push.viewport(viewport(x=0.75,

                       # Make the rotated width the same as
                       # the height of the page
                       width=grid.convert(unit(1, "npc"), "npc",
                         "y", "dimension", "x", "dimension"),
                       # Make the rotated height half the
                       # width of the page
                       height=grid.convert(unit(0.5, "npc"), "npc",
                         "x", "dimension", "y", "dimension"),
                       angle=90))
  1. Draw the histogram (rotated 90 degrees)

print(myhist, newpage=FALSE) pop.viewport()

> Also, I'm looking for an equivalent to the S-Plus "subplot" command to > insert a kind of "thumbnail" graphic into a bigger one. How is this best > done in R?

You can use par(fig) and/or par(plt) and par(new) to do some things, but these are not terribly helpful for positioning subplots within another plot.

Again, grid viewports can be used to create subregions with a lot of flexibility. There is a package on CRAN called gridBase which allows you to create a region using a grid viewport, then draw a standard plot in that region (with a few restrictions). There is a vignette with the package that tries to explain how it works -- the last example in there embeds some piecharts as subplots within a scatterplot.

=============================================================

Q: Are there any graphics we can do in R on bart without using a graphical interface? (2007-11-21)

A: One of the earliest graphics types in history (discussed in Tufte's Visual Display of Quantitative Information) was the stem-and-leaf plot. Here's an example in R. > library(MASS) > data(VA) > age [1] 69 64 38 63 65 49 69 68 43 70 81 63 63 52 48 61 42 35 63 56 55 67 63 65 46 53 [27] 69 68 43 55 42 64 65 65 55 66 60 67 53 62 67 72 48 68 67 61 60 62 38 50 63 64 [53] 43 34 66 62 52 47 63 68 45 41 66 62 60 66 38 53 37 54 60 48 52 70 50 62 65 58 [79] 62 64 63 58 64 52 35 63 70 51 40 69 36 71 62 60 44 54 66 49 72 68 62 71 70 61 [105] 71 59 67 60 69 57 39 62 50 43 70 66 61 81 58 63 60 62 42 69 63 45 68 39 66 63 [131] 49 64 65 64 67 65 37 > stem(age)

 The decimal point is at the |
 34 | 000
 36 | 000
 38 | 00000
 40 | 00
 42 | 0000000
 44 | 000
 46 | 00
 48 | 000000
 50 | 0000
 52 | 0000000
 54 | 00000
 56 | 00
 58 | 0000
 60 | 00000000000
 62 | 0000000000000000000000
 64 | 00000000000000
 66 | 0000000000000
 68 | 000000000000
 70 | 00000000
 72 | 00
 74 | 
 76 | 
 78 | 
 80 | 00

> # this output is not very compact, we can resize it > stem(age,scale=0.25)

 The decimal point is 1 digit(s) to the right of the |
 3 | 45567788899
 4 | 01222333345567888999
 5 | 0001222233344555678889
 6 | 00000001111222222222233333333333344444445555555666666677777788888899
 7 | 0000011122
 8 | 11

So you should be able to see that there are two 81 year olds, 5 70 year olds, 3 71 year olds, 2 72 year olds, and so on. Clearly, most of the people are in their sixties. The previous plot only told us that there were two people whose age was in the range of 80 to 82, two people in the range of 72 to 74, and so on.

By the way, the VA dataset tells the story of 137 men with lung cancer in a Veteran's Administration cancer treatment trial. The age variable is their age in years when the trial started. The stime variable is the number of days they lived. Here's a stem-and-leaf plot of stime.

> stem(stime,scale=2)

 The decimal point is 1 digit(s) to the right of the |
  0 | 112347778888001223355688899
  2 | 00112445555790011356
  4 | 23458911122234469
  6 | 1323
  8 | 00234770255799
 10 | 003350112778
 12 | 236239
 14 | 034136
 16 | 2247
 18 | 26
 20 | 016
 22 | 811
 24 | 20
 26 | 08
 28 | 37
 30 | 4
 32 | 
 34 | 07
 36 | 8
 38 | 492
 40 | 1
 42 | 
 44 | 
 46 | 7
 48 | 
 50 | 
 52 | 
 54 | 3
 56 | 
 58 | 7
 60 | 
 62 | 
 64 | 
 66 | 
 68 | 
 70 | 
 72 | 
 74 | 
 76 | 
 78 | 
 80 | 
 82 | 
 84 | 
 86 | 
 88 | 
 90 | 
 92 | 
 94 | 
 96 | 
 98 | 19

As you can see, this plot is greatly enlarged by the six individuals who lived more than 400 days. We can make this plot much smaller by restricting the value of the stime variable to values less than 400. This is similar to a SQLite3 statement like

   select stime where stime < 400

> stem(stime[stime<400],scale=2)

 The decimal point is 1 digit(s) to the right of the |
  0 | 112347778888
  1 | 001223355688899
  2 | 0011244555579
  3 | 0011356
  4 | 234589
  5 | 11122234469
  6 | 13
  7 | 23
  8 | 0023477
  9 | 0255799
 10 | 00335
 11 | 0112778
 12 | 236
 13 | 239
 14 | 034
 15 | 136
 16 | 224
 17 | 7
 18 | 26
 19 | 
 20 | 01
 21 | 6
 22 | 8
 23 | 11
 24 | 2
 25 | 0
 26 | 0
 27 | 8
 28 | 37
 29 | 
 30 | 
 31 | 4
 32 | 
 33 | 
 34 | 0
 35 | 7
 36 | 
 37 | 8
 38 | 49
 39 | 2

Note that we could actually include a condition about one of the other variables instead. Unfortunately, this famous trial (Kalbfleisch & Prentice 1980) was not successful, so the only variables that interact significantly with survival time are Karn, which is a score gauging how impaired a person's life is by an illness, where 100 means unimpaired and 0 means totally impaired, and cell, where there are four values representing the presence of four types of cancerous cells. If you limit the plot to one of those four cell types as below, you see a different picture of survival in days, depending on which cell type the person has.

> stem(stime[cell========1],scale=10)

 The decimal point is 1 digit(s) to the right of the |
  0 | 118
  1 | 015
  2 | 55
  3 | 03
  4 | 24
  5 | 
  6 | 
  7 | 2
  8 | 27
  9 | 
 10 | 0
 11 | 0128
 12 | 6
 13 | 
 14 | 4
 15 | 
 16 | 
 17 | 
 18 | 
 19 | 
 20 | 1
 21 | 
 22 | 8
 23 | 1
 24 | 2
 25 | 
 26 | 
 27 | 
 28 | 3
 29 | 
 30 | 
 31 | 4
 32 | 
 33 | 
 34 | 
 35 | 7
 36 | 
 37 | 
 38 | 9
 39 | 
 40 | 
 41 | 1
 42 | 
 43 | 
 44 | 
 45 | 
 46 | 7
 47 | 
 48 | 
 49 | 
 50 | 
 51 | 
 52 | 
 53 | 
 54 | 
 55 | 
 56 | 
 57 | 
 58 | 7
 59 | 
 60 | 
 61 | 
 62 | 
 63 | 
 64 | 
 65 | 
 66 | 
 67 | 
 68 | 
 69 | 
 70 | 
 71 | 
 72 | 
 73 | 
 74 | 
 75 | 
 76 | 
 77 | 
 78 | 
 79 | 
 80 | 
 81 | 
 82 | 
 83 | 
 84 | 
 85 | 
 86 | 
 87 | 
 88 | 
 89 | 
 90 | 
 91 | 
 92 | 
 93 | 
 94 | 
 95 | 
 96 | 
 97 | 
 98 | 
 99 | 19

> stem(stime[cell========2],scale=4)

 The decimal point is 1 digit(s) to the right of the |
  0 | 24778
  1 | 033688
  2 | 0011245579
  3 | 01
  4 | 
  5 | 1124469
  6 | 13
  7 | 
  8 | 07
  9 | 5799
 10 | 3
 11 | 7
 12 | 23
 13 | 9
 14 | 
 15 | 13
 16 | 
 17 | 
 18 | 
 19 | 
 20 | 
 21 | 
 22 | 
 23 | 
 24 | 
 25 | 
 26 | 
 27 | 
 28 | 7
 29 | 
 30 | 
 31 | 
 32 | 
 33 | 
 34 | 
 35 | 
 36 | 
 37 | 
 38 | 4
 39 | 2

> stem(stime[cell========3],scale=3)

 The decimal point is 1 digit(s) to the right of the |
  0 | 3788
  1 | 289
  2 | 4
  3 | 156
  4 | 58
  5 | 12
  6 | 
  7 | 3
  8 | 034
  9 | 025
 10 | 
 11 | 7
 12 | 
 13 | 2
 14 | 0
 15 | 
 16 | 2
 17 | 
 18 | 6

> stem(stime[cell========4],scale=6)

 The decimal point is 1 digit(s) to the right of the |
  1 | 259
  2 | 
  3 | 
  4 | 39
  5 | 23
  6 | 
  7 | 
  8 | 
  9 | 
 10 | 035
 11 | 1
 12 | 
 13 | 3
 14 | 3
 15 | 6
 16 | 24
 17 | 7
 18 | 2
 19 | 
 20 | 0
 21 | 6
 22 | 
 23 | 1
 24 | 
 25 | 0
 26 | 0
 27 | 8
 28 | 
 29 | 
 30 | 
 31 | 
 32 | 
 33 | 
 34 | 0
 35 | 
 36 | 
 37 | 8
 38 | 
 39 | 
 40 | 
 41 | 
 42 | 
 43 | 
 44 | 
 45 | 
 46 | 
 47 | 
 48 | 
 49 | 
 50 | 
 51 | 
 52 | 
 53 | 
 54 | 
 55 | 3

Without any statistical analysis you can tell from these four plots that there is some kind of relationship between the cell type you have and the number of days you might survive.

==

Q: Is there any way I can do

    sort dat | uniq -c

from within R? (2007-11-21)

A: A tip on Nabble suggests

   x<-sort(dat)
   y<-rle(x)
   barplot(y$lengths,names=y$values)

as an alternative in R. Like uniq, rle expects something already sorted. It outputs two columns, lengths and values. The lengths are the counts and the values are the things being counted, so you can barplot the lengths with the values as labels for each bar.

==

Q: How can I obtain back to back histograms in R? (2007-11-20)

A: A very, very simple version is illustrated in the book, R Graphics by Paul Murrell. Just enter this code into a file called figure330 and say source("figure330") at an R prompt.

       groups <- c("cows", "sheep", "horses", 
                   "elephants", "giraffes")
       males <- sample(1:10, 5)
       females <- sample(1:10, 5)
       
       par(mar=c(0.5, 5, 0.5, 1))
       
       plot.new()
       plot.window(xlim=c(-10, 10), ylim=c(-1.5, 5.5))
       
       ticks <- seq(-10, 10, 5)
       y <- 1:5
       h <- 0.2
       
       lines(rep(0, 2), c(-1.5, 5.5), col="grey")
       segments(-10, y, 10, y, lty="dotted")
       rect(-males, y-h, 0, y+h, col="dark grey")
       rect(0, y-h, females, y+h, col="light grey")
       mtext(groups, at=y, adj=1, side=2, las=2)
       par(cex.axis=0.5, mex=0.5)
       axis(1, at=ticks, labels=abs(ticks), pos=0)
       
       tw <- 1.5*strwidth("females")
       rect(-tw, -1-h, 0, -1+h, col="dark grey")
       rect(0, -1-h, tw, -1+h, col="light grey")
       text(0, -1, "males", pos=2)
       text(0, -1, "females", pos=4)
       
       box("inner", col="grey")

==

Q: Once SI Computing globally installs RSQLite, then what do I do? (2007-11-20)

A: Say

   library(RSQLite)

and both RSQLite and any packages on which it depends, such as DBI, will be loaded.

==

Q: How can I tell where packages are from within R? (2007-11-20)

A: Say

   .find.package("MASS")

at an R prompt. This will return the location of that package. You can then tell which of your local packages are being used and which global packages are being used.

==

Q: Why does yours work and mine doesn't? (2007-11-20)

A: It appears that I reflexively set an environment variable called R_LIBS when I installed RSQLite. You probably don't have that set. The most convenient way to set this is to say

  export R_LIBS=/u/Students/myuniqname/lib/R/

in your .bashrc file. Then you can install any package locally in your ~/lib/R directory and have access to it from within R.

==

Q: I can't use SQLite3 from within R. Why? (2007-11-19)

A: Probably because I forgot to ask SI Computing to install RSQLite or something. Sorry if that was the case. It turns out that you can install a copy in your directory by doing the following.

 mkdir -p ~/lib/R
 wget http://cran.r-project.org/src/contrib/RSQLite_0.6-4.tar.gz
 wget http://cran.r-project.org/src/contrib/DBI_0.2-4.tar.gz
 R CMD INSTALL -l ~/lib/R DBI_0.2-4.tar.gz 
 R CMD INSTALL -l ~/lib/R RSQLite_0.6-4.tar.gz 

Note that R depends on DBI to work with any database, so it has to be installed before RSQLite.