Stamp's Master's Students' Defenses: Spring 2019

Who	When	Where	Title
Samanvitha Basole	May 6 @ 12:30pm	MH 322	Multifamily Malware Models
Mayuri Wadkar	May 8 @ 1:30pm	MH 221	Measuring Malware Evolution using Support Vector Machines
Nivedhitha Ramarathnam Krishna	May 9 @ 11:30am	MH 221	Classifying Classic Ciphers Using Machine Learning
Ashraf Saber	May 16 @ 1:00pm	MH 320	Intrusion Detection and CAN Vehicle Networks
Akriti Sethi	May 15 @ 12:30pm	MH 221	Classifying Malware Models
Anukriti Sinha	May 10 @ 10:00am	MH 210	Emulation vs Instrumentation for Android Malware Detection
Parth Jain	May 16 @ noon	MH 210	Machine Learning vs Deep Learning for Malware Detection
Tazmina Sharmin	May 13 @ noon	MH 322	Deep Learning for Image Spam Detection
Preethi Sundaravaradhan	May 13 @ 11:00am	MH 221	Smartphone Gesture-Based Authentication

Multifamily Malware Models

by Samanvitha Basole

When training a machine learning model, there is likely to be a tradeoff between the accuracy of the model and the generality of the dataset. For example, previous research has shown that if we train a model to detect one specific malware family, we typically obtain stronger results, as compared to the case where we train a single model on multiple diverse families. However, during the detection phase, it would be more efficient to use a single model to detect multiple families, rather than having to score each sample against multiple models. In this research, we conduct extensive experiments to quantify the relationship between the generality of the training dataset and the accuracy of various machine learning models, within the context of the malware detection problem.

Measuring Malware Evolution using Support Vector Machines

by Mayuri Wadkar

Malware is software that is designed to do harm to computer systems. As with other software, malware typically evolves over time as developers add new features and fix bugs. Thus, malware samples from the same family from different time periods can exhibit significantly different behavior. For example, differences between malware samples within a family can originate from various code modifications designed to evade signature-based detection, or changes that are made to alter the functionality of the malware. In this research, we apply feature ranking based on linear support vector machine (SVM) weights to identify, quantify, and track changes within malware families over time. We analyze numerous malware families over extended periods of time. We show that it is possible to detect evolutionary changes within malware families using quantifiable and automated machine learning techniques.

Classifying Classic Ciphers Using Machine Learning

by Nivedhitha Ramarathnam Krishna

We consider the problem of identifying the classic cipher that was used to generate a given ciphertext message. We assume that the plaintext is English and we restrict our attention to ciphertext consisting only of alphabetic characters. Among the classic ciphers considered are the simple substitution, Vigenère, playfair, and column transposition ciphers. The problem of classification is approached in two ways. The first method uses support vector machines (SVM) trained directly on ciphertext to classify the ciphers. In the second approach, we train hidden Markov models (HMM) on each ciphertext message, then use these trained HMMs as features for classifiers. Under this second approach, we compare two classification strategies, namely, convolutional neural networks (CNN) and SVMs. For the CNN classifier, we convert the trained HMMs into images. Extensive experimental results are provided for each of the classification techniques under consideration.

Intrusion Detection and CAN Vehicle Networks

by Ashraf Saber

In this paper, we consider intrusion detection systems (IDS) in the context of a controller area network (CAN), which is also known as the CAN bus. We provide a discussion of various IDS topics, including masquerade detection, and we include a selective survey of previous research involving IDS in a CAN network. We also discuss background topics and relevant practical issues, such as data collection on the CAN bus. Finally, we present experimental results where we have applied a variety of machine learning techniques to both real and simulated CAN data. Our experiments show that machine learning models can be used to effectively determine the status of a vehicle, as well as to detect masquerading behavior based on CAN data.

Classifying Malware Models

by Akriti Sethi

Automatically classifying similar malware families is a challenging problem. In this research, we attempt to classify malware families by applying machine learning to trained machine learning models. Specifically, we train hidden Markov models (HMM) for each malware family in our dataset. The resulting models are then compared in two ways. First, we treat the HMM matrices as images and experiment with convolutional neural networks (CNN) for image classification. Second, we apply support vector machines (SVM) to classify the HMMs. We analyze the results and discuss the relative advantages and disadvantages of these approaches.

Emulation vs Instrumentation for Android Malware Detection

by Anukriti Sinha

In resource constrained devices, malware detection is typically based on offline analysis using emulation. In previous work it has been claimed that such emulation fails for a significant percentage of Android malware, because well-designed malware detects that the code is being emulated. An alternative to emulation is malware analysis based on code that is executed on an actual Android device. In this research, we collect features from a corpus of Android malware using both emulation and on-phone instrumentation. We train machine learning models based on emulated features and also train models based on features collected via instrumentation. We find that the differences between these two cases are negligible. We conclude that it is uncommon for Android malware to implement evasive strategies based on emulation detection.

Machine Learning vs Deep Learning for Malware Detection

by Parth Jain

It is often claimed that an advantage of deep learning is that such models can continue to learn as more data is made available for training. In contrast, for other forms of machine learning it is claimed that the models "saturate," in the sense that no additional learning can occur beyond some point, regardless of the amount of additional data or computing power that is brought to bear on the problem. In this research, we compare the accuracy of deep learning to other forms of machine learning for malware detection, as a function of the training dataset size. We experiment with a wide variety of hyperparameters for our deep learning models, and we compare these models to results obtained using k-nearest neighbors. In our experiments, we use a subset of a large and diverse malware dataset that was collected as part of a recent research project.

Deep Learning for Image Spam Detection

by Tazmina Sharmin

Spam can be defined as unsolicited bulk email. In an effort to evade text-based spam filters, spammers can embed their spam text in an image, which is referred to as image spam. In this research, we consider the problem of image spam detection, based on image analysis. We apply various machine learning and deep learning techniques to real-world image spam datasets, and to a challenge image spam-like dataset. We obtain results comparable to previous work for the real-world datasets, while our deep learning approach yields the best results to date for the challenge dataset.

Smartphone Gesture-Based Authentication

by Preethi Sundaravaradhan

In this research, we consider the problem of authentication on a smartphone based on gestures, that is, movements of the phone. Accelerometer data from a number of subjects was collected and we analyze this data using a variety of machine learning techniques, including support vector machines (SVM) and convolutional neural networks (CNN). We determine both the fraud rate (i.e., the false accept rate) and the insult rate (i.e., the false reject rate) in each case.