Si618f08 - week3 qanda
Q: Any comments on grading? (2008-11-10)
A: I took off a point in HW1 if two lists were not displayed. I took off a point if code was not displayed. For HW1, I took off a maximum of one point even if I found both of these problems. You should look at each other's work. I see numerous exemplary solutions. There are a few things I disagree with. I suggest Manning and Schutze on Information Retrieval for some comments about tf-idf and normalization by dividing document frequency by collection size. I want to reiterate that you need the actual command lines if you want to later repeat this analysis on other data. You should also try to explore if you find problems. One person who dropped the class had trouble retrieving files from the NY Times using curl. I had the same problem, but found it easy to retrieve those files using wget or lynx. I'll demonstrate that in ~/mcq/si618week2/nytimes.
==
Q: Are there any better R tutorials out there? Please? (2007-11-24)
A: Thanks to Lada, we now know about two that have worked for stats students.
the official kickstarting tutorial:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/index.html
pretty good introductory tutorial: Simple R
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
==
Q: How can I created a sorted bar chart? (2007-11-24)
A: Suppose I have a file containing the following lines of text:
beginning of file named marx
bro
groucho
groucho
groucho
groucho
groucho
groucho
groucho
groucho
groucho
groucho
groucho
groucho
chico
chico
chico
chico
chico
chico
harpo
harpo
harpo
harpo
harpo
harpo
harpo
harpo
harpo
harpo
zeppo
zeppo
zeppo
zeppo
gummo
gummo
gummo
end of file named marx
- and I read it into R by switching to the directory
- where it is and using read.table with header=TRUE
- because bro is a header for the column of names
setwd("/Users/mcq/Desktop/") marx<-read.table("marx",header=TRUE)
- attach the data so I can work with it
attach(marx)
- convert the single column into two columns, one of
- labels, one of counts of occurrences of those labels;
- by default, the rle() function will name those two
- columns values and lengths, respectively
y<-rle(as.vector(bro))
- I could just plot these, but the bars would be sorted
- alphabetically by the labels in y$values, so I create
- a factor based on y$values
y$fac<-factor(y$values)
- now i can attach the lattice library to draw a chart,
- using the property of a factor that it can be sorted
- by another column---you can think of y$fac as a
- version of y$values that can be sorted by y$lengths,
- and that's exactly what the reorder function does,
- reorders y$fac by y$lengths using the sort function
library(lattice) barchart(reorder(y$fac,y$lengths,sort) ~ y$lengths)
==
Examples of lattice graphics by the inventor are at
http://dsarkar.fhcrc.org/lattice/book/figures.html
==
putting some sideways labels on:
midpts <- barplot(1:10, col="grey90", axes=FALSE) axis(2) axis(1, at=midpts, labels=FALSE)
library(gridBase) vps <- baseViewports() pushViewport(vps$inner, vps$figure, vps$plot)
grid.text(c("one", "two", "three", "four", "five",
"six", "seven", "eight", "nine", "ten"),
x=unit(midpts, "native"), y=unit(-1, "lines"),
just="right", rot=60)
popViewport(3)
==
Paul Murrell explains how to put two plots, one rotated, side by side; note that push.viewport should be replaced by pushViewport and pop.viewport by popViewport
==
myhist <- histogram(rnorm(50))
- Make two square regions side by side
push.viewport(viewport(layout=grid.layout(1, 2, respect=TRUE)))
- Go to the left region
push.viewport(viewport(layout.pos.col=1))
- Draw the histogram in normal orientation
print(myhist, newpage=FALSE) pop.viewport()
- Go to the right region then rotate 90 degrees
push.viewport(viewport(layout.pos.col=2), viewport(angle=90))
- Draw the histogram (rotated 90 degrees)
print(myhist, newpage=FALSE) pop.viewport(3)
That only works because I made the regions square (so a region is the same size when it is rotated 90 degrees). Here's a slightly more complex example which will work for non-square regions ...
- Create a region in the left half of the page
push.viewport(viewport(x=0, width=0.5, just="left"))
- Draw the histogram in normal orientation
print(myhist, newpage=FALSE) pop.viewport()
- Create a region in the right half of the page
- which is rotated 90 degrees
push.viewport(viewport(x=0.75,
# Make the rotated width the same as
# the height of the page
width=grid.convert(unit(1, "npc"), "npc",
"y", "dimension", "x", "dimension"),
# Make the rotated height half the
# width of the page
height=grid.convert(unit(0.5, "npc"), "npc",
"x", "dimension", "y", "dimension"),
angle=90))
- Draw the histogram (rotated 90 degrees)
print(myhist, newpage=FALSE) pop.viewport()
> Also, I'm looking for an equivalent to the S-Plus "subplot" command to > insert a kind of "thumbnail" graphic into a bigger one. How is this best > done in R?
You can use par(fig) and/or par(plt) and par(new) to do some things, but these are not terribly helpful for positioning subplots within another plot.
Again, grid viewports can be used to create subregions with a lot of flexibility. There is a package on CRAN called gridBase which allows you to create a region using a grid viewport, then draw a standard plot in that region (with a few restrictions). There is a vignette with the package that tries to explain how it works -- the last example in there embeds some piecharts as subplots within a scatterplot.
=============================================================
Q: Are there any graphics we can do in R on bart without using a graphical interface? (2007-11-21)
A: One of the earliest graphics types in history (discussed in Tufte's Visual Display of Quantitative Information) was the stem-and-leaf plot. Here's an example in R. > library(MASS) > data(VA) > age [1] 69 64 38 63 65 49 69 68 43 70 81 63 63 52 48 61 42 35 63 56 55 67 63 65 46 53 [27] 69 68 43 55 42 64 65 65 55 66 60 67 53 62 67 72 48 68 67 61 60 62 38 50 63 64 [53] 43 34 66 62 52 47 63 68 45 41 66 62 60 66 38 53 37 54 60 48 52 70 50 62 65 58 [79] 62 64 63 58 64 52 35 63 70 51 40 69 36 71 62 60 44 54 66 49 72 68 62 71 70 61 [105] 71 59 67 60 69 57 39 62 50 43 70 66 61 81 58 63 60 62 42 69 63 45 68 39 66 63 [131] 49 64 65 64 67 65 37 > stem(age)
The decimal point is at the |
34 | 000 36 | 000 38 | 00000 40 | 00 42 | 0000000 44 | 000 46 | 00 48 | 000000 50 | 0000 52 | 0000000 54 | 00000 56 | 00 58 | 0000 60 | 00000000000 62 | 0000000000000000000000 64 | 00000000000000 66 | 0000000000000 68 | 000000000000 70 | 00000000 72 | 00 74 | 76 | 78 | 80 | 00
> # this output is not very compact, we can resize it > stem(age,scale=0.25)
The decimal point is 1 digit(s) to the right of the |
3 | 45567788899 4 | 01222333345567888999 5 | 0001222233344555678889 6 | 00000001111222222222233333333333344444445555555666666677777788888899 7 | 0000011122 8 | 11
So you should be able to see that there are two 81 year olds, 5 70 year olds, 3 71 year olds, 2 72 year olds, and so on. Clearly, most of the people are in their sixties. The previous plot only told us that there were two people whose age was in the range of 80 to 82, two people in the range of 72 to 74, and so on.
By the way, the VA dataset tells the story of 137 men with lung cancer in a Veteran's Administration cancer treatment trial. The age variable is their age in years when the trial started. The stime variable is the number of days they lived. Here's a stem-and-leaf plot of stime.
> stem(stime,scale=2)
The decimal point is 1 digit(s) to the right of the |
0 | 112347778888001223355688899 2 | 00112445555790011356 4 | 23458911122234469 6 | 1323 8 | 00234770255799 10 | 003350112778 12 | 236239 14 | 034136 16 | 2247 18 | 26 20 | 016 22 | 811 24 | 20 26 | 08 28 | 37 30 | 4 32 | 34 | 07 36 | 8 38 | 492 40 | 1 42 | 44 | 46 | 7 48 | 50 | 52 | 54 | 3 56 | 58 | 7 60 | 62 | 64 | 66 | 68 | 70 | 72 | 74 | 76 | 78 | 80 | 82 | 84 | 86 | 88 | 90 | 92 | 94 | 96 | 98 | 19
As you can see, this plot is greatly enlarged by the six individuals who lived more than 400 days. We can make this plot much smaller by restricting the value of the stime variable to values less than 400. This is similar to a SQLite3 statement like
select stime where stime < 400
> stem(stime[stime<400],scale=2)
The decimal point is 1 digit(s) to the right of the |
0 | 112347778888 1 | 001223355688899 2 | 0011244555579 3 | 0011356 4 | 234589 5 | 11122234469 6 | 13 7 | 23 8 | 0023477 9 | 0255799 10 | 00335 11 | 0112778 12 | 236 13 | 239 14 | 034 15 | 136 16 | 224 17 | 7 18 | 26 19 | 20 | 01 21 | 6 22 | 8 23 | 11 24 | 2 25 | 0 26 | 0 27 | 8 28 | 37 29 | 30 | 31 | 4 32 | 33 | 34 | 0 35 | 7 36 | 37 | 8 38 | 49 39 | 2
Note that we could actually include a condition about one of the other variables instead. Unfortunately, this famous trial (Kalbfleisch & Prentice 1980) was not successful, so the only variables that interact significantly with survival time are Karn, which is a score gauging how impaired a person's life is by an illness, where 100 means unimpaired and 0 means totally impaired, and cell, where there are four values representing the presence of four types of cancerous cells. If you limit the plot to one of those four cell types as below, you see a different picture of survival in days, depending on which cell type the person has.
> stem(stime[cell========1],scale=10)
The decimal point is 1 digit(s) to the right of the |
0 | 118 1 | 015 2 | 55 3 | 03 4 | 24 5 | 6 | 7 | 2 8 | 27 9 | 10 | 0 11 | 0128 12 | 6 13 | 14 | 4 15 | 16 | 17 | 18 | 19 | 20 | 1 21 | 22 | 8 23 | 1 24 | 2 25 | 26 | 27 | 28 | 3 29 | 30 | 31 | 4 32 | 33 | 34 | 35 | 7 36 | 37 | 38 | 9 39 | 40 | 41 | 1 42 | 43 | 44 | 45 | 46 | 7 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 7 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 19
> stem(stime[cell========2],scale=4)
The decimal point is 1 digit(s) to the right of the |
0 | 24778 1 | 033688 2 | 0011245579 3 | 01 4 | 5 | 1124469 6 | 13 7 | 8 | 07 9 | 5799 10 | 3 11 | 7 12 | 23 13 | 9 14 | 15 | 13 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 7 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 4 39 | 2
> stem(stime[cell========3],scale=3)
The decimal point is 1 digit(s) to the right of the |
0 | 3788 1 | 289 2 | 4 3 | 156 4 | 58 5 | 12 6 | 7 | 3 8 | 034 9 | 025 10 | 11 | 7 12 | 13 | 2 14 | 0 15 | 16 | 2 17 | 18 | 6
> stem(stime[cell========4],scale=6)
The decimal point is 1 digit(s) to the right of the |
1 | 259 2 | 3 | 4 | 39 5 | 23 6 | 7 | 8 | 9 | 10 | 035 11 | 1 12 | 13 | 3 14 | 3 15 | 6 16 | 24 17 | 7 18 | 2 19 | 20 | 0 21 | 6 22 | 23 | 1 24 | 25 | 0 26 | 0 27 | 8 28 | 29 | 30 | 31 | 32 | 33 | 34 | 0 35 | 36 | 37 | 8 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 3
Without any statistical analysis you can tell from these four plots that there is some kind of relationship between the cell type you have and the number of days you might survive.
==
Q: Is there any way I can do
sort dat | uniq -c
from within R? (2007-11-21)
A: A tip on Nabble suggests
x<-sort(dat) y<-rle(x) barplot(y$lengths,names=y$values)
as an alternative in R. Like uniq, rle expects something already sorted. It outputs two columns, lengths and values. The lengths are the counts and the values are the things being counted, so you can barplot the lengths with the values as labels for each bar.
==
Q: How can I obtain back to back histograms in R? (2007-11-20)
A: A very, very simple version is illustrated in the book, R Graphics by Paul Murrell. Just enter this code into a file called figure330 and say source("figure330") at an R prompt.
groups <- c("cows", "sheep", "horses",
"elephants", "giraffes")
males <- sample(1:10, 5)
females <- sample(1:10, 5)
par(mar=c(0.5, 5, 0.5, 1))
plot.new()
plot.window(xlim=c(-10, 10), ylim=c(-1.5, 5.5))
ticks <- seq(-10, 10, 5)
y <- 1:5
h <- 0.2
lines(rep(0, 2), c(-1.5, 5.5), col="grey")
segments(-10, y, 10, y, lty="dotted")
rect(-males, y-h, 0, y+h, col="dark grey")
rect(0, y-h, females, y+h, col="light grey")
mtext(groups, at=y, adj=1, side=2, las=2)
par(cex.axis=0.5, mex=0.5)
axis(1, at=ticks, labels=abs(ticks), pos=0)
tw <- 1.5*strwidth("females")
rect(-tw, -1-h, 0, -1+h, col="dark grey")
rect(0, -1-h, tw, -1+h, col="light grey")
text(0, -1, "males", pos=2)
text(0, -1, "females", pos=4)
box("inner", col="grey")
==
Q: Once SI Computing globally installs RSQLite, then what do I do? (2007-11-20)
A: Say
library(RSQLite)
and both RSQLite and any packages on which it depends, such as DBI, will be loaded.
==
Q: How can I tell where packages are from within R? (2007-11-20)
A: Say
.find.package("MASS")
at an R prompt. This will return the location of that package. You can then tell which of your local packages are being used and which global packages are being used.
==
Q: Why does yours work and mine doesn't? (2007-11-20)
A: It appears that I reflexively set an environment variable called R_LIBS when I installed RSQLite. You probably don't have that set. The most convenient way to set this is to say
export R_LIBS=/u/Students/myuniqname/lib/R/
in your .bashrc file. Then you can install any package locally in your ~/lib/R directory and have access to it from within R.
==
Q: I can't use SQLite3 from within R. Why? (2007-11-19)
A: Probably because I forgot to ask SI Computing to install RSQLite or something. Sorry if that was the case. It turns out that you can install a copy in your directory by doing the following.
mkdir -p ~/lib/R wget http://cran.r-project.org/src/contrib/RSQLite_0.6-4.tar.gz wget http://cran.r-project.org/src/contrib/DBI_0.2-4.tar.gz R CMD INSTALL -l ~/lib/R DBI_0.2-4.tar.gz R CMD INSTALL -l ~/lib/R RSQLite_0.6-4.tar.gz
Note that R depends on DBI to work with any database, so it has to be installed before RSQLite.
