Data Retrieval and Analysis - Homework Assignments - Interspike Codex

Data Retrieval and Analysis - Homework Assignments

SI 618 Fall 2008 - Data Retrieval and Analysis

Data Retrieval and Analysis

A class covering basics of large data set collection and analysis. Total fun with Perl, Python, shell scripting, and Linux commands to munge, scrub and study that data.

Homework Assignments

Homework 1: In this homework I study relationships of noun phrase lists generated using text from a subset of pages, documents, and PDF files gathered by a spider targeting university domain web pages dealing with institutional diversity.

Homework 2: In this homework I replicate the efforts from homework 1 but use a SQLite3 database to facilitate the searching, joining, and selecting of data. The goal is to generate two ordered lists of noun phrases that appear most frequently between audiences.

Homework 3: In this homework I will compare the number of documents containing a phrase appearing in two distinct sets. I use SQLite to identify phrases and document counts and R to visual the results as a back to back histogram. See assignment description for more detail.

Homework 4: In this homework I analyze web server logs and identify relationships based on four different attributes. My approach attempts to follow the general approach as defined in the assignment description, but I use different log files.

Homework 5: In this homework I build on earlier work by doing some advanced cleanup on text extracted from fetched web pages. I use Perl to loop through plain text files and scrub them of unwanted stuff in preparation for being passed to Monty Lingua. See assignment description from the instructor.

Homework 6: In this homework I cluster documents compared in homework 3. I use a program called "dissim" to generate a proximity matrix and R to generate a dendrogram of the result. I'll comment on the dendrogram and tell where I might cut it and why that would give a meaningful result.