CS 286 Syllabus

San Jose State University
Department of Computer Science
CS 286, Topics in Machine Learning, Fall 2017

Course and Contact information
- Instructor: Mark Stamp
- Office Location: MH 216
- Telephone: 408-924-5094
- Email: mark.stamp@sjsu.edu
- Office hours: Tuesday & Thursday, 1:15 - 2:00pm
- Class Days/Times: Tuesday & Thursday, noon - 1:15pm
- Classroom: MH 233
- Prerequisites: CS 149

Course Description
- Topics in machine learning. The following machine learning techniques are covered in detail: Hidden Markov Models (HMM), Profile Hidden Markov Models (PHMM), Principal Component Analysis (PCA), Support Vector Machines (SVM), and clustering. Illustrative applications of each of these major topics are provided, with most of the applications drawn from the field of information security. In addition, the course will include an overview of each of the following topics: k-Nearest Neighbor, Neural Networks, Boosting/AdaBoost, Random Forests, Linear Discriminant Analysis, Naive Bayes, Regression Analysis, Conditional Random Fields, and Data Analysis. Prerequisite: CS 149.
Learning Outcomes
- The focus of this course will be machine learning, with illustrative applications drawn primarily from information security. After completing this course students should have a working knowledge of a wide variety of machine learning topics, and have a good understanding of the application of such techniques.
Required Texts/Readings
- The primary text will be a manuscript written by your instructor. This manuscript, titled Machine Learning with Applications in Information Security, covers several machine learning techniques in detail, and includes a large number of illustrative applications. Many of the applications are from information security, including a variety of topics related to malware, intrusion detection (IDS), spam, and cryptanalysis, among others. The manuscript will be published as a textbook by Chapman Hall/CRC in September of 2017.
- Additional relevant material:
  - PowerPoint slides at http://www.cs.sjsu.edu/~stamp/ML/powerpoint
  - Current semester lecture videos are available at http://www.cs.sjsu.edu/~stamp/ML/lectures/CS286_Fall17/. If you are asked to login to access the videos, both the username and password are "infosec". Note: The instructor hereby gives students permission to record his lectures (audio and/or video). At least with respect to this class, your instructor has nothing to hide.
  - Class-related discussion will be posted on Piazza at https://piazza.com/class/j6nmd2xw4gn4je. You are strongly encouraged to participate by asking questions, as well as by responding to questions that other students ask. At the start of the semester, you should receive an email asking you to join this discussion group—if not, contact your instructor via email.
- The applications parts of this course are essentially self-contained, but for additional background information on the security-related topics, the following resources are recommended.
  - Computer Viruses and Malware, John Aycock, Springer 2006. Many of the applications we discuss are related to malware. Aycock's book is easy to read and in spite of being fairly old, it provides a good foundation for malware research.
  - Information Security: Principles and Practice, Mark Stamp, Wiley 2011. If you have not taken CS 265, you should do so. You can refer to this fine book if you have questions about security-related topics during this course.
  - Open Malware (at http://www.offensivecomputing.net/) includes a large collection of samples of live malware.
  - VX Heavens (at http://vx.netlux.org/) is a source for "hacker" type of information on viruses. Malware samples are also available.
  - Journal of Computer Virology and Hacking Techniques (at http://www.springer.com/computer/journal/11416) is a journal for malware-specific research papers. There are also several good conferences that focus on malware and/or machine learning applications in information security.
  - Recent masters project reports (at http://www.cs.sjsu.edu/~stamp/cv/mss.html#masters). Most of these projects involve applications of machine learning to malware or other topics in information security.
Course Requirements and Assignments
- SJSU classes are designed such that in order to be successful, it is expected that students will spend a minimum of forty-five hours for each unit of credit (normally three hours per unit per week), including preparing for class, participating in course activities, completing assignments, and so on. More details about student workload can be found in University Policy S12-3 at http://www.sjsu.edu/senate/docs/S12-3.pdf.
- Schedule
  - Week 1 --- Introduction and overview
  - Week 2 --- Hidden Markov Models
  - Week 3 --- Data Analysis
  - Week 4 --- Applications of Hidden Markov Models
  - Week 5 --- Profile Hidden Markov Models
  - Week 6 --- Applications of Profile Hidden Markov Models
  - Week 7 --- Principal Component Analysis
  - Week 8 --- Applications of Principal Component Analysis
  - Week 9 --- Support Vector Machines
  - Week 10 --- Applications of Support Vector Machines
  - Week 11 --- Clustering
  - Week 12 --- Clustering Applications
  - Week 13 --- k-Nearest Neighbor, Neural Networks, Boosting/AdaBoost, Random Forests
  - Week 14 --- Linear Discriminant Analysis, Naive Bayes, Regression Analysis, Conditional Random Fields
  - Week 15 --- Project presentations
- Homework is due typewritten (include source code, but not executable files) by class starting time on the due date. Each assigned problem requires a solution and an explanation and work detailing how you arrived at your solution. Cite any outside sources used to solve a problem. When grading an assignment, I may ask for additional information. Note that a subset of the assigned problems will typically be graded.
  
  Homework must be submitted via email before the start of class on the due date. Be sure to have an extra copy of your homework with you in class, and be prepared to discuss your solutions. Your written solutions must be in a pdf file. Submit any source code or other attachments in separate files (i.e., no code in the solution itself). You must provide enough discussion of your solution so that the grader can understand your solution, and so that the grader can be sure that you understand your solution. Put your written solution and any relevant source code in a folder named "yourlastname". Then zip your homework folder and submit the file yourlastname.zip via email to cs286.fall17@gmail.com. The subject line of your email must be of the form:
```
     CS286HMK assignmentnumber yourlastname last4digitofyourstudentnumber 
```
  The subject line must consist of the four identifiers listed. There is no space within an identifier and each identifier is separated by a space.
  - Assignment 0: Due Tuesday, August 29
    Read A Revealing Introduction to Hidden Markov Models (at https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf). Answer the following questions. For this assignment, turn in a hardcopy of your solutions at the start of class.
    1. Briefly (1 paragraph) summarize how an HMM is trained.
    2. How is a trained HMM used to score a sequence?
    3. Very briefly explain how an HMM and dynamic program differ.
    4. Why is it necessary to scale when training an HMM?
  - Assignment 1: Due Thursday, September 7
    Chapter 2, problems 1, 2, 3, 10. For problem 10 you must use HMM code that you have written entirely on your own, using the algorithms given in your textbook.
  - Assignment 2: Due Thursday, September 14
    Chapter 2, problems 11, 14, and 15. For these problems you must use your own HMM code.
  - Assignment 3: Due Thursday, September 21
    Chapter 8, problems 1, 6, 7, 8, 9, 10.
  - Assignment 4: Due Thursday, September 28
    Chapter 3, problems 3, 4, 5a, 7, 11.
  - Assignment 5: Due Thursday, October 12
    Chapter 4, problems 3, 4, 9, 10, 11, 13, 17.
  - Assignment 6: Due Thursday, October 26
    Chapter 5, problems 1, 4, 6, 9, 10, 12, 15.
  - Assignment 7: Due Thursday, November 16
    Chapter 6, problems 4, 5, 6, 7, 8, 13, 15, 16.
  - Assignment 8: Due ~~Tuesday, December 5~~ Thursday, December 7
    Chapter 7, problems 1, 2, 3, 4, 7, 12, 15.
  - Assignment 9: Due Varies
    You are required to attend at least one of the student defenses listed here (at http://www.cs.sjsu.edu/~stamp/defenses/fall17.html). If you attend more than one, you will receive extra credit towards the homework score.
- NOTE that University policy F69-24 at http://www.sjsu.edu/senate/docs/F69-24.pdf states that "Students should attend all meetings of their classes, not only because they are responsible for material discussed therein, but because active participation is frequently essential to insure maximum benefit for all members of the class. Attendance per se shall not be used as a criterion for grading."
Grading Policy
- Test 1, 100 points. Date: Tuesday, October 31.
- Homework, quizzes, class participation and other work as assigned, 100 points. A subset of the assigned problems will be graded.
- Machine Learning Project, 100 points. You must obtain approval for your project (via email) by Monday, September 18. A written project report is due Tuesday, November 28. Note that a written report is required, and oral presentations will begin on (or shortly after) the report due date.
- Final, 100 points. Date: Wednesday, December 13 at 9:45 am. The official finals schedule is here: http://info.sjsu.edu/static/policies/final-exam-schedule-fall.html
- Semester grade will be computed as a weighted average of the major scores listed above.
- No make-up tests or quizzes will be given and no late homework or project (or other work) will be accepted.
- Grading Scale:
  
  Percentage Grade
  92 and above A
  90 - 91 A-
  88 - 89 B+
  82 - 87 B
  80 - 81 B-
  78 - 79 C+
  72 - 77 C
  70 - 71 C-
  68 - 69 D+
  62 - 67 D
  60 - 61 D-
  59 and below F
- Note that "All students have the right, within a reasonable time, to know their academic scores, to review their grade-dependent work, and to be provided with explanations for the determination of their course grades." See University Policy F13-1 at http://www.sjsu.edu/senate/docs/F13-1.pdf for more details.
Guest Lectures
- Juan Miguel Pino, Facebook, Inc.
  - Date: Wednesday, November 8
  - Time: 7:00pm
  - Location: MH 520
  - Topic: Machine Translation @ Facebook
  - Abstract:
    Every day on Facebook, people interact with content in a different language than those they understand. We break those language barriers by supporting translation for more than 45 languages and over 2000 translation directions, serving more than 4B translations per day. We recently improved translation quality by shifting to deep learning techniques. We will go over the models served in production at Facebook and we will give an overview of current and future work.
- TBD
  - Date: Thursday, November 9
  - Time: noon (usual class time)
  - Location: MH 233
  - Topic: Identify Resolution
  - Abstract:
    Identity Resolution is the process of uncovering records that are co-referent to the same real-world individual. It plays an important role in wide variety of tasks including fraud detection, marketing, relationship discovery, and customer service.
    
    Identity resolution has been topics of extensive research. We can broadly categorize the existing resolution approaches as deterministic and probabilistic. A deterministic approach produces the same resolution results and is generally dependent on a set of domain specific rules. A probabilistic approach relies on calculating various probabilities of key matches and combine to make a determination of matches. A probabilistic approach can employ certain machine learning algorithm to learn weights, thresholds, or other parameters to improve its accuracy and recall rate. Both these approaches have their applications. A probabilistic approach may be fine when achieving high accuracy is not critical. For example, if you want to target a marketing campaign to individuals based on their unified identities. In that case, the cost of lower accuracy of your resolution algorithm is minimal. Whereas, a deterministic approach may be more suitable if you want identify banking transactions of individuals across two different banks.
    
    Identity resolution poses three primary challenges (1) the keys identifying the records do not exactly match either because intentional or unintentional errors or may not be present in all records, (2) Identity of a person a change over a period of time. For example, a person might change his or her name upon marriage, and (3) large data size makes pairwise comparison impossible to complete within a reasonable amount of time.
    
    In this presentation, we plan to cover deterministic solution for identity resolution that addresses the three problems stated above. We have tested our solution on over 200 million unique identities with 1 trillion records. We implemented the solution using Spark/Hadoop on 80 nodes.
- TBD
  - Date: TBD
  - Time: TBD
  - Location: TBD
  - Topic: TBA
  - Abstract: TBA
Classroom Protocol
- Keys to success: Do the homework, complete a good project, and attend class
- Wireless laptop is required. Your laptop must remain closed (preferably in your backpack and, in any case, not on your desk) until I inform you that it is needed for a particular activity
- Cheating will not be tolerated, but working together is encouraged
- Student must be respectful of the instructor and other students. For example,
  - No disruptive or annoying talking
  - Turn off cell phones
  - Class begins on time
  - Class is not over until I say it's over
- Valid picture ID required at all times
- The last day to drop is Wednesday, September 6, and the last day to add is Wednesday, September 13
University Policies
- Office of Graduate and Undergraduate Programs maintains university-wide policy information relevant to all courses, such as academic integrity, accommodations, etc. You may find all syllabus related University Policies and resources information listed on GUP’s Syllabus Information web page at http://www.sjsu.edu/gup/syllabusinfo/

Percentage	Grade
92 and above	A
90 - 91	A-
88 - 89	B+
82 - 87	B
80 - 81	B-
78 - 79	C+
72 - 77	C
70 - 71	C-
68 - 69	D+
62 - 67	D
60 - 61	D-
59 and below	F