SI 618 HW 1: Homework Assignment 1
Contents |
A Study of Phrases
Extraction and analysis of noun phrases (NP) in documents.
Abstract
The general idea of this exercise is to use Perl, Python, and shell scripts to fetch and analyze text from the web then extract meaningful phrases.
By identifying repeated nouns and phrases that appear in collections of text, we can make assessments and educated guesses about matters of importance to the producers of those documents. By comparing dissimilar collections and phrase lists, tensions and areas of key interest can emerge.
Web pages are fetched, analyzed, and converted to meaningful text. This text is then analyzed for nouns and noun phrases. These phrases and words are extracted into a list. After some scrubbing, analysis reveals people, places, and things of importance in the document corpus.
Data Used
The data used starts with a collection of URLs, organized by university, that deal with institutional diversity. Each line in these files begins with letters and numbers identifying author and genre followed by a URL.
Two different data sets are considered, each with a different audience and genre.
AC5 Data
The authors consist of students and other individuals and the academic schools and departments. These are the people doing the learning and teaching, the research and exploration.
The genre are those things related to doing things about something, e.g., meeting minutes, events, conferences, etc.
BD5 Data
The authors consist of university wide administrators, offices, academic affairs, newspapers, human resources, etc. Basically the administration or governing body at an institution.
The genre are those things related to doing things about something, e.g., meeting minutes, events, conferences, etc. This is the same as AC5.
Diary of Actions
Following the flow outlined in this image (borrowed from lecture slides) the actions that follow show how to go from a set of page URLs to a list of prominent noun phrases generated from the text contained in those pages.
How many URLs are there in one file, linkcollections/cmich.edu, from Central Michigan University?
cat linkcollections/cmich.txt | wc -l 69
What is in that file?
head -n3 cmich.txt danieldz|d|4|http://www.lib.cmich.edu/departments/reference/diversity/links.htm danieldz|d|4|http://www.diversity.cmich.edu/ danieldz|d|4|http://www.lib.cmich.edu/departments/reference/diversity/
Extract to a file all URLs labeled with "b", "d", and "5":
cat linkcollections/*.txt | grep -i \|[bd]\|[5] > bd5_urls.txt
Call the fetchdocs.pl to retrieve the URL target pages and put them in a folder called fetcheddocs/.
cut -d "|" -f 4 bd5_urls.txt \ | ~/si618f08/divers/fetchconvert/fetchdocs.pl fetcheddocs/
Inspect the fetched docs to identify file types using droidfilelist.pl and calldroid.pl.
# Input: a directory, Output: droidfilelist.xml droidfilelist.pl fetcheddocs # Input: droidfilelist.xml, Output: droidoutput.xml calldroid.pl
The first few lines of droidfilelist.xml look like this:
<FileCollection> <IdentificationFile Name="fetcheddocs/american-001" /> <IdentificationFile Name="fetcheddocs/american-002" /> <IdentificationFile Name="fetcheddocs/american-003" /> <IdentificationFile Name="fetcheddocs/american-004" />
The first few lines of droidoutput.xml:
<?xml version="1.0" encoding="UTF-8"?> <FileCollection xmlns="http://www.nationalarchives.gov.uk/pronom/FileCollection"> <DROIDVersion>V1.1</DROIDVersion> <SignatureFileVersion>12</SignatureFileVersion> <DateCreated>2008-10-28T21:26:53</DateCreated> <IdentificationFile IdentQuality="Positive" > <FilePath>fetcheddocs/american-001</FilePath> <FileFormatHit> <Status>Positive (Specific Format)</Status> <Name>Hypertext Markup Language</Name> <Version>4.0</Version> <PUID>fmt/99</PUID> <IdentificationWarning>Possible file extension mismatch </IdentificationWarning> </FileFormatHit> </IdentificationFile>
Call convertdocs.pl which takes the pages in fetcheddocs/ and turn them into cleaned up text thanks to the meta information provided by the droid scripts above:
convertdocs.pl # How many we got? ls converted/ | wc -l 1153
The converted documents are processed by Monty Lingua and the results written to a file.
python extract_lemmatised.py converted > lemmatized_coverted.txt
Results
The results produced by Monty Lingua are found to contain many lines that aren't really noun phrases. We'll clean those up later. The results are 6.6 MB worth of text.
ls -lh lemmatized_coverted.txt -rw-r--r-- 1 mhains users 6.6M 2008-10-28 22:05 lemmatized_coverted.txt
A few lines from the top of the lemmatized text looks like this:
american-006 VX identify american-006 NX and leverage difference american-006 VX create american-006 NX internal environment american-006 NX that
We're interested in the NX (noun phrases). Use cut, sort and uniq to get counts and save them to a result file.
cat lemmatized_coverted.txt | grep NX | cut -f 3 | sort | uniq -c > result1
The result file contains counts and noun phrases. Some of the good data looks like this:
10 appreciation 1 apprenticeship , project , and technology 21 approach 3 Approach 1 approach and style 2 appropri- 3 appropriate body 1 appropriate civil comment
The garbage lines need to be scrubbed out. I write a simple scrubresult.pl script that scrubs the lines and returns those counts and phrases tab delimited for analysis.
scrubresult.pl result1 > phraseset1.txt
And finally, a sample of the results we want to analyze:
31 graduate school 86 higher education 50 higher education 30 high school 31 human resource 40 international student 33 its name 77 minority student 30 mor barak 155 more information 46 moved permanently 34 multicultural affair 31 nc state 48 north carolina 71 other student 56 our campus 39 our community 31 our student 48 park library building 51 penn state 30 question or comment
Summary
Visualizations were produced by Many Eyes. Initial analysis revealed a disproportionate number of the words student, diversity, and university which were then removed.
Generated from 66,984 noun phrases from university offices and administration when doing something about or related to institutional diversity.
Generated from 19,172 noun phrases from from students and faculty when doing something about or related to institutional diversity.
Both Wordle visualizations were produced with similar settings.
Similarities
To find similarities between the two Wordles lists, words were grouped into three lists; person, place or thing. a statement of focus was formulated by studying the words in the largest list for each collection.
Both are interested in engaging and educating others regarding such things as rights and culture through events, teaching, and recognizing opportunities.
The places in both lists were similar and related to universities and communities surrounding them.
Differences
BD5 was generated from noun phrases authored by university offices and administration about or related to doing regarding institutional diversity. The emergent focus is on engaging and educating the community and people. Opportunities, programs, discussions, workshops, examples...all things related to engagement.
In contrast to nouns appearing in the BD5 sample, the AC5 sample from faculty and students, the focus here is more on individuals, issues, and information. Nouns such as participant, member, person, student, women, present, appear here but very little in the BD5 set.
Shortcomings
This particular exercise was found to be lossy in that each step information was lost. For example, the Many Eyes Wordle seemed not to respect multiple word phrases. Though some data was forfeited, there was enough to provide a starting point for investigation.
A better understanding of the authors and genres could be achieve through more rigorous evaluation of the data. More time could be spent studying interesting parallels, similarities and differences between disparate or similar audiences. Verb phrases could be analyzed in similar ways to noun phrases.
Another approach would be to find important noun phrases by university and map overlaps between similarities. Perhaps comparing similarities and differences between university locations and assumptions about geographical, cultural or societal factors based on location.
Possible Application
Noun phrases for specific author types and genre labels could be used to improve natural language search engines. Search strings could be phrased more as a sentence.
The meta data associated with the data available for analysis provides further influence when compared to others in the collection. Visualizations could reveal which universities focus more on matters of institutional diversity. Size of pages or word count could also be considered. Rigor would have to be applied to harvest all available URLs for such a comparison to have respectable value.
Professor Critique
It would be useful to include links to the ordered noun lists sorted from most to least.
Explore further the notion of meta data and what is lost during conversion and processing.
Provide more comment or detail about cleaning up the output. Dates may have significance so stripping numbers entirely might miss something.
Intended output of the assignment was two word lists showing the most prominent words first.
