Chris Pollett > Old Classes > CS256
( Print View )

Student Corner:
  [Grades Sec1]

Lecture Notes]

  [Discussion Board]

Course Info:
  [Texts & Links]
  [Outcomes Matrix]
  [HW/Quiz Info]
  [Exam Info]
  [Additional Policies]

HWs and Quizzes:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Hw5]  [Quizzes]

Practice Exams:
  [Mid 1]  [Mid 2]  [Final]


HW#3 --- last modified Monday, 23-Oct-2017 21:06:35 PDT.

Solution set.

Due date: Nov 3

Files to be submitted:

Purpose: To build and train a multi-layer neural network in TensorFlow.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO4 -- Be able to select neural network layers type to build a network suitable for various learning tasks such as object classification, object detection, language processing, planning, policy selection, etc.

CLO5 -- Be able to select an appropriate regularization technique for a given learning task.

CLO6 -- Be able to code and train with a library such as Caffe, Theano, Tensorflow a multi-layer neural network.

CLO7 -- Be able to measure the performance of a model, determine if more data in needed, as well as how to tune the model.


For this homework, I want you to build and train a simple multi-layer neural network in TensorFlow. Your neural network will operate on strings of length 40 over the alphabet `{A, B, C, D}`. Since it is close to Halloween, we can imagine these letters are coming from chemicals in some alien's genetic code (or if you prefer vampires, you can think of this in terms of the vampire genetics from the Blood+ anime). The goal is that it should be able to classify strings as NONSTICK, 12-STICKY, 34-STICKY, 56-STICKY, 78-STICKY, STICK_PALINDROME.

To understand what these classes mean, let's first define stickiness. The letter A sticks with the letter C (and vice-versa) and the letter B sticks with the letter D (and vice-versa). Given two strings `u` and `v`, `u` sticks with `v` if `l\en(u) = l\en(v)` and for all `i < l\en(u)`, the letter `u[i]` sticks with `v[i]`. For example, `A\ABDC` sticks with `C\CDBA`. As an alien geneticist will tell you, if an alien chromosome has a lot of regions which are sticky, it can help protect the chromosome from mutations. This might be especially important for alien sexual chromosomes . GENECo has developed a tool which splits alien genetic info into strings which are precisely 40 character long. Given a string `w` let `w^R` denote the string written backwards (in reverse). A 40 character string is a stick palindrome if it can be written as the concatenations of two strings `vw` and `v` sticks with `w^R`. A 40 character string is `k`-sticky, if it can be written as a concatenation of three string `uvw` such that `l\en(u)=k` and `u` sticks with `w^R`. After training, your network should output NONSTICK on any string which is not even `1`-sticky; otherwise, it should output `k(k+1)`-STICKY if it is either `k`-sticky or `k+1`-sticky, and it is not also in the next class in our list of classes above. Finally, it should output STICK_PALINDROME only if the string is a stick palindrome.

To generate data for your neural net, I want you to write a program This program's command line signature should look like:

python num_snippets mutation_rate from_ends output_file

Here num_snippets is the number of gene snippets to generate. To understand mutation_rate imagine we start with a string that is a stick palindrome. Then mutation_rate is a float between 0 and 1 representing the odds that a character gets mutated to a random other character, and from_ends is the distance from either the start or end of the string to apply the mutation rate to. Characters further than this distance from either end are mutated with probability 1. Finally, output_file is a file to write the data generated by your program to.

As an example, you might run with concrete with the values:

python 3 .1 3 test_data.txt

This might write to the file test_data.txt the lines:


To train and test your neural net, I want you to write a second program using Python and the Tensorflow library. This program's command line signature should look like:

python mode model_file data_folder

Here mode is either train, in which case your program will use all the data for training; 5fold, in which case your program will do 5-fold cross-validation training and testing; or test, in which case your program will use the provided model and the data folder to do testing only. model_file is a file to read or write the trained weights of your program to. In the case of either train or 5fold mode your program should write a model to this file. For 5fold, as you really generate five models during this process, just write the last one. For test, the neural net in your program should use weights in model_file when performing testing. The data_folder is the name of a folder that your program will look for .txt files in. For each .txt file, it will first check that, excluding whitespace, each line is 40 characters. If so, it will use the file for training, if not it will skip it. Your program should train on all valid files in the data_folder (you don't need to train on files in sub folders).

Here are some additional requirements on

  1. The neural net trained/used in testing by your program should be implement in Tensorflow.
  2. It should consist of five dense layers, the last of which is softmax layer.
  3. Hidden layers should use rectified linear units.
  4. Models should be trained using stochastic gradient descent, but no momentum. The mini-batch size should be a configurable constant set at the top of your program. Your program when it reads a string should determine its label before giving it to Tensorflow for training.
  5. You can assume all the data in a folder can be read in to memory, so you can read everything in first. As your program processes data, it should write messages to the console indicating how much progress it has made at least every thousand items processed (make sure 1000 is some multiple of the mini-batch size). It should also write a message "Processing complete!" followed by the total number of items trained on and the total number of items tested on.
  6. If the mode is either 5fold or test your program should output the overall accuracy over the test data, the four rates from the confusion matrix, the total runtime for training, and the total runtime for testing.

For the last part of your homework I want you to conduct some experiments. To do this I want you to create four training folders and three test folders. The first three training folders should consist of six files each. Each file should consist only of data that mostly matches one of the six classes that your net is supposed to recognize. For example, if you want to generate data only for the class 12-STICKY you might run:

python 2500 0 1 out1.txt
python 2500 0 2 out2.txt
and then concatenate the two files. Note even though in the first line above you are guaranteeing that all items at least 1-STICKY, some of them might be 2 or 3 sticky just by random chance based on how the generator works. For the first folder, the files should each contain 5000 training examples; for the second folder, they should each contain 10000 training items; and for the third folder, the files should each contain 20000 training examples. The last training folder should consist of a single file of 60000 items all generated using the same method as you did for the NONSTICKY class.

The three test folders should consist of files with 5000 items generated in much the same way as the first training folder, except rather than using a 0 mutation rate, you use respectively, 0.2, 0.4, and 0.6 as the mutation rate.

You don't need to submit the data folders with your homework but you should include with your homework a shell script that generates the data using

Given the above data sets I want you to design and conduct experiments which follow the guidelines for experiments from the Oct 11 Lecture and which answer the following questions:

  1. What is the effect of training set size on how good the trained model is? How does this compare to the number of weights that your model has?
  2. What is the difference between training on data chosen completely at random versus on well chosen examples? What is the difference in choosing the testing data at random versus data that is likely to be equally representative of the classes that we are training for?
  3. What is the difference in accuracy in using cross-validation for testing versus using separate test data.

Put your experiment writes up in Hw3.pdf which you should include in your file.

Point Breakdown

Both programs make use of their command line arguments as described. 1pt generates strings as described and outputs them to the desired file 1pt writes a model as described on training data and this model can be reloaded for use in testing 1pt
Items 1-6 above for (1/2pt each) 3pt generates the data sets requested 0.5pt
There is a Hw3.pdf file and the experiment designs therein meet the Oct 11 guidelines. 0.5pt
Writes-up of the three experiments (1pt each) 3pts