Note that the last part of this assignment has to be done using a lab machine.
Create a new directory in your home directory. Let's say you call it my-a6.
Make this your current directory (by running the unix command: cd my-a6). Now run the following unix commands. These commands make links from your directory to a directory containing a bunch of data. This data includes the AI abstracts that you evaluated for relevance judgements in assignment 5. The handouts directory is set up so you can read files in it but you can't overwrite or change them. Making a symbolic link to these files means you can read their contents but you can't change them. This is both to keep you from accidently overwriting the files and to save on disk space, since we won't have everyone making copies of the big data files. There is just one copy of each, and everyone reads that copy. Here is how to make the necessary symbolic links:
You also need to make a copy of one of the files so you can modify it. Do
You are going to run the program on two different sizes of text data. The first has about 10 AIT documents and the second has about 5000.
First edit the foarc file to tell it which input and output files to use. For the first time through, set FilesFile to be short-files.d and KwinvFile to short-kw-inv.d and DocsFile to short-docs.d . You have to keep the tab in between the two fields. If you use the emacs editor to do this it should work. The three edited lines should look like:
If you get a segmentation fault message it probably means you did not edit foarc correctly.
Note: the program is set up to ignore terms that occur only one time in the collection.
After the program finishes. You should have an output file short-kw-inv.d in my-a6.
The second time through change FilesFile to long-files.d and KwinvFile to long-kw-inv.d, DocsFile to long-docs.d and run the program again. It will take a long time to finish; be patient.
Now run the program without using a stop list, on the small data set. To do this, create an empty file called empty.wrd.
Now, in foarc, change FilesFile back to short-files.d and KwinvFile to short-kw-nostop-inv.d, and change StopFile to empty.wrd. Run the program again.
Question 2: How is short-kw-nostop.inv.d different than short-kw-inv.d, both quantitatively and qualitatively?
commit 2 5 4 1 10 1 1 7means that the term commit occured in two documents in the entire collection. Its total frequency (number of occurrences) in the collection was 5. In one document, it occured four times. The document in which this happened was document 10. In one document, it occured one time. This document was document 7.
Question 3: Describe, in the same, manner as above, the information associated with the term intellig for short-kw-inv.d.
Question 4: What are the five most frequent no-stopword terms in short-files? (Hint: use the unix sort command on short-kw-inv.d.)
Recall that the Zipfian distribution is an effect seen when the data is ordered by its rank. First we have to see how often each term occurs. The term that occurs most frequently is assigned rank 1. The term that occurs second most frequently is assigned rank 2, etc. Thus if frog occurs 100 times, and this is the most frequent word, it is assigned rank 1. If toad occurs 96 times, and this is second most frequent, it is assigned rank 2. Say three terms remain, and they all occur 2 times each. They are tied, but we deal with ties arbitrarily, assigning them increasing ranks, 3, 4, and 5.
Rather than writing a program, we can convert this file using a pipeline of unix commands. It took me a little while to figure out how to make this work, so I am giving the sequence of commands to you here for converting short-kw-inv.d:
cat short-kw-inv.d | cut -f 1,3 | sort -k 2,2 -nr | \
awk '{s += 1; print s, "\t", $2, "\t", $1}' > short-zipf.txt
We are going to use short-zipf.txt as input to the visualization program,
and it requires tab-separated input.
Question 5: What does each stage of the pipeline do? Hint: you can produce intermediate temporary results to see what each step does, e.g.,
Run the same command on long-kw-inv.d but put the results on long-zipf.txt. Do the same for short-kw-nostop-inv.d putting the results in short-nostop-zipf.txt.
Now run the visualization program. You have to use the SIMS machines for this as this is the only place it resides. The program is located at:
Question 6: Using the visualization to answer this question: How many terms occur 10 times in short-zipf.txt? Which terms occur 15 times?
Question 7: Use the tabs on the X-axis and Y-axis to change the input the scatter plot. Set Y to Column 1 and X to Column 2. Does this change the shape of the curve very much? Now Change Y back to Column 2 and set X to Column 3. Does alphabetical order of terms correlate to word frequency?
Now load in the data for short-nostop-zipf.txt.
Question 8: how is this different or similar to the graph for short-zipf.txt. Why? Now load in the data for long-zipf.txt.
Play around with the sliders to see different subsets of the data. Note that these are special sliders that let you focus on a subset of the data. Moving both arrows close together on an axis greatly reduces the number of data points you see.
Question 9: Adjust the sliders so you can answer this: about how many terms have frequencies between 480 and 500?
Question 10: (This is the important question.) Compare the fully-expanded initial view of long-zipf with the fully-expanded initial view of short-zipf. Discuss the similarities/differences. Why does this occur?
Question 11: (This is the other important question.) Think about tf*idf term weighting. Discuss how these graphs show why we divide the frequency with which a term occurs in an individual document (tf) by the number of documents it occurs in (df). Why is this a good strategy for ranking documents?
Question 12: Turn in screenshots of the visualization showing short-zipf and long-zipf (separate images are fine).
Last modified Oct 28, 1998 MAH