Chris Pollett> Old Classses >
CS156

( Print View )

Student Corner:
[Submit Sec5]
[Grades Sec5]

[Lecture Notes]
[Discussion Board]

Course Info:
[Texts & Links]
[Description]
[Course Outcomes]
[Outcomes Matrix]
[Course Schedule]
[Grading]
[Requirements/HW/Quizzes]
[Class Protocols]
[Exam Info]
[Regrades]
[University Policies]
[Announcements]

HW Assignments:
[Hw1] [Hw2] [Hw3]
[Hw4] [Hw5] [Quizzes]

Practice Exams:
[Midterm] [Final]

HW#5 --- last modified November 22 2022 17:34:49.

Solution set.

Due date: Dec 5

Files to be submitted:
  Hw5.zip

Purpose: To gain experience with knowledge representation. To understand the computations a probabilistic agent might do. To gain experience with learning algorithms.

Related Course Outcomes:

CLO7 -- Implement at least one machine learning algorithm.

CLO8 -- Students should be able to describe: (a) the frame problem, (b) Possible representations for time and for beliefs.

Specification:

This homework will consist of the questions below, whose answers should be in Hw5.pdf in your submitted Hw5.zip folder, as well as, the coding exercise beneath them.

  1. Pick your favorite sport. Pick your four favorite teams in this sport. Express an imaginary playoff between these four teams concluding with a sole winner using the event calculus describe in class. I.e. in addition to the standard event calculus predicates, choose some predicates whose arguments are events, fluents and times that might be useful for this situation. Give what should hold at the start of the series (Start event)as when as additional statements for your knowledge beside generic successor state axioms (latter in the book).
  2. Consider the following partially explored Wumpus World (ignore how we ended up with a world explored in this fashion):
         
       
    SXB 
    XB  
    Here X indicates explored but nothing detected; B indicates a breeze was felt on that square; S indicates a stench was smelt. For each square on the frontier use the method from the Nov 28 Lecture to get a probability that the given square is safe. Which square should a rational agent who must find the gold search next?
  3. Consider the following table of data concerning whether or not to show up to class the Monday before Thanksgiving:
    Vacation BookedFriends GoingDrank CoffeeGo to Class
    YDNMDNMN
    NYYY
    NYNY
    NNYY
    NNNN
    In the above, DNM means does not matter for this case. Work out by hand (show work) the decision tree that our decision tree learning algorithm would compute for the above training set.

For the coding part of the assignment, I would like you to code a simple hierarchical clustering algorithm on Hash2Vec embeddings of words to see to what degree they capture word meanings. We will go over hierarchical clustering in class, but will describe the minimum needed to understand hash2vec embeddings as part of this homework. First up, given a sentence, a 5-gram from the sentence is a string of 5 consecutive words (lower-cased without punctuation). For example, the sentence: "Dick gave the book to Jane." has the following 5-grams:

_ _ dick gave the
_ dick gave the book
dick gave the book to
gave the book to jane
the book to jane _
book to jane _ _

A 5-gram captures something of the meaning of the middle word in the 5-gram. I.e., that book is the middle word of "gave the book to jane" says something about the meaning of book. To capture something about the meaning of all the 5-grams a word is the middle word of for a collection of text we make use of Hash2Vec embeddings. To keep things simple, we will assume our Hash2Vec vectors have 10-dimensions. Initially, the Hash2Vec vector of each word has all of these 10 components as 0. We then scan over some corpus of text, for each 5-gram in the corpus, we update the Hash2Vec of its middle word using two hash functions. The first hash function, h1, maps a string to a number between 0 and 9 (for the components ) and the second hash function, h2, maps a string to `\pm 1`. Given a 5-gram (word1, word2, word3, word4 word5) to update the vector for word3 we would add h2(word1 concatenated with "-2") to the h1(word1 concatenated with "-2") component of the vector, then add h2(word2 concatenated with "-1") to the h1(word2 concatenated with "-1") component of the vector, then add h2(word4 concatenated with "+1") to the h1(word4 concatenated with "+1") component of the vector, and finally add h2(word5 concatenated with "+2") to the h1(word1 concatenated with "+2") component of the vector. So for example, the 5-gram (gave the book to jane) would adjust the vector of book by adding h2("gave-2") to the h1("gave-2") component of its vector, by adding h2("the-1") to the h1("the-1") component of its vector, by adding h2("to+1") to the h1("to+1") component of its vector, and by adding h2("jane+2") to the h1("jane+2") component of its vector.

Your program will be run from the command line using the syntax:

python3 word_clusterer.py some_folder num_clusters

for example,

python3 word_clusterer.py wiki_files 5

On this input your program should scan each .txt file in the folder specified and generate a dictionary (you are not allowed to use libraries like scikit, etc) mapping words to their Hash2Vec vectors. Next it should compute using the hierarchical clustering (suing `L_2` distance between vectors) the top specified number of clusters and output them as comma separated lists, one cluster per line:

Cluster1: word1_in_cluster1,word2_in_cluster1, ...
Cluster2: word1_in_cluster2,word2_in_cluster2, ...
...

For example,

Cluster1: canada,argentina,wales
Cluster2: has,was,founded,created
Cluster3: a,the
Cluster4: of,in,on,over
Cluster5: or,but,although

To see how well this kind of clustering works, I'd like you to make a folder with ten text files, each using text from a different Wikipedia page. Run your cluster algorithm on it for different values for the choice of number of clusters. Observe if you can identify any unifying pattern for the words that end up in the same cluster. Write-up your results in a file Experiments.pdf that you also include with your homework.

Point Breakdown

Written exercises, 2pts each 6pts
In code can see where program uses command line arguments to determin a folder and number of clusters, reads in data, and computes Hash2Vec vectors. (0.5 each) 1.5pts
Hierarchical Clustering algorithm implemented correctly. 1.5pts
Experiment write up. 1pts
Total 10pts