SI 618 HW 6: Homework Assignment - Interspike Codex

SI 618 HW 6: Homework Assignment

Contents

A Study of Document Clustering

An effort to understand how clustering can reveal patterns of similarity and dissimilarity within large sets of data.

Abstract

Create a matrix showing dissimilarity between documents and the terms appearing in those documents.

Use the matrix to create a dendogram in an attempt to reveal the degree of clustering and/or dissimilarity in the data.

Data Used

The data used started with the 243 web pages fetched from Herman Miller. Instead of using the original phrases extracted by Monty Lingua from the converted text, the text was cleaned using the Perl script from homework 5 and these cleaned files were passed through the Monty Lingua extract_lemmatized.py Python script.

Diary of Actions

In order to get workable data, I first extract phrases from files cleaned using my Perl script from homework 5.

I started with an empty database and populate the tables with two delimited data files: one is an index of the documents originally fetched, the other are the phrases extracted by Monty Lingua.

Since my goal is to create a file containing document IDs and term IDs, I need IDs for both so create tables with autoincrement columns then populate them using data files.

First I create a table with an autoincrement ID column:

create table doc_index 
  (id integer primary key autoincrement, label text, url text);

I set the separator, create another table, and import the document labels and URLs:

.separator \t
create table fetchdocsindex (label text, url text);
.import fetchdocsindex.txt fetchdocsindex

In order to get the documents into the table with the auto ID, I use an insert statement whose values come from a select. Note the null placement that allows the autonumber to do it's thing:

insert into doc_index select null, label, url from fetchdocsindex;

I then create a table to hold the phrases, importe the file generated by extract_lemmatized.py and populate a table with an auto ID field similar to above:

create table lemmatized_cleaned (label text, type text, phrase text);
.import lemmatized_cleaned.txt lemmatized_cleaned
create table phrase_index (id integer primary key autoincrement, phrase text);
insert into phrase_index 
   select distinct null, phrase from lemmatized_cleaned;

Now that tables are populated with data, I can write a SQL statement to join them to output ID values for documents and terms, I try it:

select distinct d.id, p.id 
  from doc_index d, phrase_index p, lemmatized_cleaned l 
  where d.label = l.label and l.phrase = p.phrase;

After waiting over a half hour, I decide to take a different approach.

I create an populate a table hoping the reduced join will improve performance:

create table doc_phrases as 
   select 
     d.id id, d.label doc_label, l.phrase phrase 
   from 
     doc_index d, lemmatized_cleaned l 
   where 
     d.label = l.label;

Then I try again, writing the tab delimited IDs to an output file:

.separator \t
.output docid-termid.txt
select d.id, p.id 
  from 
   doc_phrases d, phrase_index p 
  where 
   d.phrase = p.phrase;

This is much faster and produces a file with 21,455 lines.

Upon inspection, I a fair amount of duplicates and realize I should use a distinct in my query to eliminate them:

select 
  distinct d.id, p.id 
from 
  doc_phrases d, phrase_index p 
where 
  d.phrase = p.phrase;

This workes fine and I end up with a file containing 18,040 lines.

Calling dissim

Before I pass my document and term IDs to the dissim program, I verify my counts:

select count(*) from doc_index;
243
select count(*) from phrase_index;
7104

Here is my call to dissim that produces a file containing a dissimilarity matrix:

./dissim -r 243 -c 7104 < docid_termid.txt > dissim_matrix.txt

Using R, I enter the following from an example provided in class (Note the colorFct.R file was fetched from here):

setwd("H:/public-master/projects/si618f08/week6/hw")
source("my.colorFct.R")
rawmatr <- read.table("dissim_matrix.txt",header=FALSE)
matr <- as.matrix(rawmatr)
scalematr <- t(scale(t(matr)))
hr <- hclust(as.dist(1-cor(t(scalematr),method="pearson")),method="complete")
library(lattice)
library(stats)
as.dendrogram(hr)
plot(hr)

The First Dendogram

The first dendogram is a splintered mess and doesn't tell me much. Perhaps the short lines and fractal-like nature means the documents aren't similar?

Figure 1

The First Heatmap

I continue in R, using the following to produce a heatmap:

hr <- hclust(as.dist(1-cor(t(scalematr),method="spearman")),method="complete")
heatmap(matr,Rowv=as.dendrogram(hr), 
  Colv=as.dendrogram(hr), 
  col=my.colorFct(),
  scale="row")

The resulting heatmap doesn't seem to distinguish much...and the clustering around the axis is curious.

Figure 2

A Different Approach

At this point, not being very happy with the dendogram and heatmap produced, I study class notes, the professors qanda file and came up with something else to try.

It is suggested that terms only appearing in one document be filtered out. Hoping that this might improve the clustering, I give it shot.

To shorten processing time, I output the data I want to a file, then import it as table:

.output phrase_doc_counts.txt
.separator \t
select p.id, p.phrase, count(d.doc_label) 
  from 
    phrase_index p, doc_phrases d 
  where 
    p.phrase = d.phrase group by p.phrase;
create table phrase_doc_counts 
  (phrase_id integer, phrase text, count integer);
.import phrase_doc_counts.txt phrase_doc_counts

Then I regenerate the data file for dissim:

.output docid-termid_02.txt
select 
  distinct d.id, p.id 
from 
  phrase_doc_counts pdc, doc_phrases d, phrase_index p 
where 
  pdc.count > 3 
and pdc.phrase = p.phrase 
and pdc.phrase = d.phrase;

The dissimilarity matrix is then regenerated:

./dissim -r 243 -c 7103 < docid-termid_02.txt > dissim_matrix2.txt

Note: The column count is one less because I eliminated a phrase that was empty.

Massaging the commands I used earlier in R to use the newer matrix data file, I end up with a dendogram and heatmap not unlike the first pass.

At this point, I'm just wanting a dendogram that looks less like a fractal and a heatmap that is more interesting than the heating element of an electric stove.

I change the algorithm for the dendogram to use "Spearman" instead of "Pearson" and the dendogram looks a little better.

Figure 3

So I try using both Spearman and Pearson to generate the heatmap and end up with something that looks different (no grouping on the axis).

Figure 4


Summary

The dendogram in Figure 3 created using "Spearman" shows better clustering than the original generated using "Pearson". However, I'm entirely certain that the difference means "better clustering" or not. The fact that it looks different, has longer legs, and less of them, I take to be a positive change.

Figure 3 is cut as follows:

Figure 3, Cut

Based on the even clustering, I decide to cut the dendogram near the top. But honestly, it isn't clear to my why to do this or how it makes the results meaningful. Since the numbers are too compacted to read, it's difficult to reference the values they identify in order to ascertain if the clusters are meaningful or not.

If the heatmap is being interpreted correctly, these documents are either very similar or extremely different (I'm guessing "extremely different").

While this assignment was interesting, I'm not sure my data is conducive to yielding a meaningful dendogram and heatmap. However, my guess is that I'm doing something incorrect and if I can find what that is, or how to more accurately interpret my results, this exercise would be more meaningful.

Shortcomings

My biggest shortcoming for this assignment is overall lack of understanding about how dendograms are created, what they mean, and how to use them to learn something meaningful about large data sets.

Even though I can go through the motions repeating what I see in examples doesn't mean that I've internalized this strategy...and that is the most frustrating part of this assignment. I sense this method could be very useful given better understanding of the overall analytical principles involved.

An alternative approach to this assignment would have been to extract only the noun phrases instead of using all the phrases identified by Monty Lingua. That might have resulted in better clustering and a cleaner reading dendogram.

It would also be helpful to do more to associate the numbers to terms and/or documents. Some learning about applying that text to the dendogram and heatmap would be useful.

Professor Critique

Requested that the raw data be made available. TODO.