Chris Pollett>
Old Classses > |
HW#5 --- last modified November 22 2022 17:34:49.Due date: Dec 5 Files to be submitted: Purpose: To gain experience with knowledge representation. To understand the computations a probabilistic agent might do. To gain experience with learning algorithms. Related Course Outcomes: CLO7 -- Implement at least one machine learning algorithm. CLO8 -- Students should be able to describe: (a) the frame problem, (b) Possible representations for time and for beliefs. Specification: This homework will consist of the questions below, whose answers should be in Hw5.pdf in your submitted Hw5.zip folder, as well as, the coding exercise beneath them.
For the coding part of the assignment, I would like you to code a simple hierarchical clustering algorithm on Hash2Vec embeddings of words to see to what degree they capture word meanings. We will go over hierarchical clustering in class, but will describe the minimum needed to understand hash2vec embeddings as part of this homework. First up, given a sentence, a 5-gram from the sentence is a string of 5 consecutive words (lower-cased without punctuation). For example, the sentence: "Dick gave the book to Jane." has the following 5-grams: _ _ dick gave the _ dick gave the book dick gave the book to gave the book to jane the book to jane _ book to jane _ _ A 5-gram captures something of the meaning of the middle word in the 5-gram. I.e., that book is the middle word of "gave the book to jane" says something about the meaning of book. To capture something about the meaning of all the 5-grams a word is the middle word of for a collection of text we make use of Hash2Vec embeddings. To keep things simple, we will assume our Hash2Vec vectors have 10-dimensions. Initially, the Hash2Vec vector of each word has all of these 10 components as 0. We then scan over some corpus of text, for each 5-gram in the corpus, we update the Hash2Vec of its middle word using two hash functions. The first hash function, h1, maps a string to a number between 0 and 9 (for the components ) and the second hash function, h2, maps a string to `\pm 1`. Given a 5-gram (word1, word2, word3, word4 word5) to update the vector for word3 we would add h2(word1 concatenated with "-2") to the h1(word1 concatenated with "-2") component of the vector, then add h2(word2 concatenated with "-1") to the h1(word2 concatenated with "-1") component of the vector, then add h2(word4 concatenated with "+1") to the h1(word4 concatenated with "+1") component of the vector, and finally add h2(word5 concatenated with "+2") to the h1(word1 concatenated with "+2") component of the vector. So for example, the 5-gram (gave the book to jane) would adjust the vector of book by adding h2("gave-2") to the h1("gave-2") component of its vector, by adding h2("the-1") to the h1("the-1") component of its vector, by adding h2("to+1") to the h1("to+1") component of its vector, and by adding h2("jane+2") to the h1("jane+2") component of its vector. Your program will be run from the command line using the syntax: python3 word_clusterer.py some_folder num_clusters for example, python3 word_clusterer.py wiki_files 5 On this input your program should scan each .txt file in the folder specified and generate a dictionary (you are not allowed to use libraries like scikit, etc) mapping words to their Hash2Vec vectors. Next it should compute using the hierarchical clustering (suing `L_2` distance between vectors) the top specified number of clusters and output them as comma separated lists, one cluster per line: Cluster1: word1_in_cluster1,word2_in_cluster1, ... Cluster2: word1_in_cluster2,word2_in_cluster2, ... ... For example, Cluster1: canada,argentina,wales Cluster2: has,was,founded,created Cluster3: a,the Cluster4: of,in,on,over Cluster5: or,but,although To see how well this kind of clustering works, I'd like you to make a folder with ten text files, each using text from a different Wikipedia page. Run your cluster algorithm on it for different values for the choice of number of clusters. Observe if you can identify any unifying pattern for the words that end up in the same cluster. Write-up your results in a file Experiments.pdf that you also include with your homework. Point Breakdown
|