SI 618 HW 1: Homework Assignment 1 - Interspike Codex

SI 618 HW 1: Homework Assignment 1

Contents

A Study of Phrases

Extraction and analysis of noun phrases (NP) in documents.

Abstract

The general idea of this exercise is to use Perl, Python, and shell scripts to fetch and analyze text from the web then extract meaningful phrases.

By identifying repeated nouns and phrases that appear in collections of text, we can make assessments and educated guesses about matters of importance to the producers of those documents. By comparing dissimilar collections and phrase lists, tensions and areas of key interest can emerge.

Web pages are fetched, analyzed, and converted to meaningful text. This text is then analyzed for nouns and noun phrases. These phrases and words are extracted into a list. After some scrubbing, analysis reveals people, places, and things of importance in the document corpus.

Data Used

The data used starts with a collection of URLs, organized by university, that deal with institutional diversity. Each line in these files begins with letters and numbers identifying author and genre followed by a URL.

Two different data sets are considered, each with a different audience and genre.

AC5 Data

The authors consist of students and other individuals and the academic schools and departments. These are the people doing the learning and teaching, the research and exploration.

The genre are those things related to doing things about something, e.g., meeting minutes, events, conferences, etc.

BD5 Data

The authors consist of university wide administrators, offices, academic affairs, newspapers, human resources, etc. Basically the administration or governing body at an institution.

The genre are those things related to doing things about something, e.g., meeting minutes, events, conferences, etc. This is the same as AC5.

Diary of Actions

Overview c/o Lecture Slides

Following the flow outlined in this image (borrowed from lecture slides) the actions that follow show how to go from a set of page URLs to a list of prominent noun phrases generated from the text contained in those pages.

How many URLs are there in one file, linkcollections/cmich.edu, from Central Michigan University?

cat linkcollections/cmich.txt | wc -l
69

What is in that file?

head -n3 cmich.txt
danieldz|d|4|http://www.lib.cmich.edu/departments/reference/diversity/links.htm
danieldz|d|4|http://www.diversity.cmich.edu/
danieldz|d|4|http://www.lib.cmich.edu/departments/reference/diversity/

Extract to a file all URLs labeled with "b", "d", and "5":

cat linkcollections/*.txt | grep -i \|[bd]\|[5] > bd5_urls.txt

Call the fetchdocs.pl to retrieve the URL target pages and put them in a folder called fetcheddocs/.

cut -d "|" -f 4 bd5_urls.txt \
 | ~/si618f08/divers/fetchconvert/fetchdocs.pl fetcheddocs/

Inspect the fetched docs to identify file types using droidfilelist.pl and calldroid.pl.

# Input: a directory, Output: droidfilelist.xml
droidfilelist.pl fetcheddocs
# Input: droidfilelist.xml, Output: droidoutput.xml
calldroid.pl 

The first few lines of droidfilelist.xml look like this:

<FileCollection>
 <IdentificationFile Name="fetcheddocs/american-001" />
 <IdentificationFile Name="fetcheddocs/american-002" />
 <IdentificationFile Name="fetcheddocs/american-003" />
 <IdentificationFile Name="fetcheddocs/american-004" />

The first few lines of droidoutput.xml:

<?xml version="1.0" encoding="UTF-8"?>
<FileCollection xmlns="http://www.nationalarchives.gov.uk/pronom/FileCollection">
 <DROIDVersion>V1.1</DROIDVersion>
 <SignatureFileVersion>12</SignatureFileVersion>
 <DateCreated>2008-10-28T21:26:53</DateCreated>
 <IdentificationFile IdentQuality="Positive" >
   <FilePath>fetcheddocs/american-001</FilePath>
   <FileFormatHit>
     <Status>Positive (Specific Format)</Status>
     <Name>Hypertext Markup Language</Name>
     <Version>4.0</Version>
     <PUID>fmt/99</PUID>
     <IdentificationWarning>Possible file extension mismatch
     </IdentificationWarning>
   </FileFormatHit>
 </IdentificationFile>

Call convertdocs.pl which takes the pages in fetcheddocs/ and turn them into cleaned up text thanks to the meta information provided by the droid scripts above:

convertdocs.pl
# How many we got?
ls converted/ | wc -l
1153

The converted documents are processed by Monty Lingua and the results written to a file.

python extract_lemmatised.py converted > lemmatized_coverted.txt

Results

The results produced by Monty Lingua are found to contain many lines that aren't really noun phrases. We'll clean those up later. The results are 6.6 MB worth of text.

ls -lh lemmatized_coverted.txt 
-rw-r--r--  1 mhains users 6.6M 2008-10-28 22:05 lemmatized_coverted.txt

A few lines from the top of the lemmatized text looks like this:

american-006    VX      identify
american-006    NX      and leverage difference
american-006    VX      create
american-006    NX      internal environment
american-006    NX      that

We're interested in the NX (noun phrases). Use cut, sort and uniq to get counts and save them to a result file.

cat lemmatized_coverted.txt | grep NX | cut -f 3 | sort | uniq -c > result1

The result file contains counts and noun phrases. Some of the good data looks like this:

10 appreciation
1 apprenticeship , project , and technology
21 approach
3 Approach
1 approach and style
2 appropri-
3 appropriate body   
1 appropriate civil comment

The garbage lines need to be scrubbed out. I write a simple scrubresult.pl script that scrubs the lines and returns those counts and phrases tab delimited for analysis.

scrubresult.pl result1 > phraseset1.txt

And finally, a sample of the results we want to analyze:

   31	graduate school
   86	higher education
   50	higher education
   30	high school
   31	human resource
   40	international student
   33	its name
   77	minority student
   30	mor barak
   155	more information
   46	moved permanently
   34	multicultural affair
   31	nc state
   48	north carolina
   71	other student
   56	our campus
   39	our community
   31	our student
   48	park library building
   51	penn state
   30	question or comment

Summary

Visualizations were produced by Many Eyes. Initial analysis revealed a disproportionate number of the words student, diversity, and university which were then removed.

Figure 1 - BD5

Generated from 66,984 noun phrases from university offices and administration when doing something about or related to institutional diversity.

Figure 2 - AC5

Generated from 19,172 noun phrases from from students and faculty when doing something about or related to institutional diversity.

Both Wordle visualizations were produced with similar settings.

Similarities

To find similarities between the two Wordles lists, words were grouped into three lists; person, place or thing. a statement of focus was formulated by studying the words in the largest list for each collection.

Both are interested in engaging and educating others regarding such things as rights and culture through events, teaching, and recognizing opportunities.

The places in both lists were similar and related to universities and communities surrounding them.

Differences

BD5 was generated from noun phrases authored by university offices and administration about or related to doing regarding institutional diversity. The emergent focus is on engaging and educating the community and people. Opportunities, programs, discussions, workshops, examples...all things related to engagement.

In contrast to nouns appearing in the BD5 sample, the AC5 sample from faculty and students, the focus here is more on individuals, issues, and information. Nouns such as participant, member, person, student, women, present, appear here but very little in the BD5 set.

Shortcomings

This particular exercise was found to be lossy in that each step information was lost. For example, the Many Eyes Wordle seemed not to respect multiple word phrases. Though some data was forfeited, there was enough to provide a starting point for investigation.

A better understanding of the authors and genres could be achieve through more rigorous evaluation of the data. More time could be spent studying interesting parallels, similarities and differences between disparate or similar audiences. Verb phrases could be analyzed in similar ways to noun phrases.

Another approach would be to find important noun phrases by university and map overlaps between similarities. Perhaps comparing similarities and differences between university locations and assumptions about geographical, cultural or societal factors based on location.

Possible Application

Noun phrases for specific author types and genre labels could be used to improve natural language search engines. Search strings could be phrased more as a sentence.

The meta data associated with the data available for analysis provides further influence when compared to others in the collection. Visualizations could reveal which universities focus more on matters of institutional diversity. Size of pages or word count could also be considered. Rigor would have to be applied to harvest all available URLs for such a comparison to have respectable value.

Professor Critique

It would be useful to include links to the ordered noun lists sorted from most to least.

Explore further the notion of meta data and what is lost during conversion and processing.

Provide more comment or detail about cleaning up the output. Dates may have significance so stripping numbers entirely might miss something.

Intended output of the assignment was two word lists showing the most prominent words first.