Stamp's Master's Students' Defenses: Fall 2019

Who	When	Where	Title
Mugdha Jain	December 16 @ 10:00am	DH 243	Image-Based Malware Classification with Convolutional Neural Networks and Extreme Learning Machines
Snehal Bichkar	December 16 @ 11:00am	DH 243	Hot Fusion vs Cold Fusion for Malware Detection

Image-Based Malware Classification with Convolutional Neural Networks and Extreme Learning Machines

by Mugdha Jain

Research in the field of malware classification often relies on machine learning models that are trained on high level features, such as opcodes, function calls, and control flow graphs. Extracting such features is costly, since disassembly or code execution is generally required. In this research, we conduct experiments to train and evaluate machine learning models for malware classification, based on features that can be obtained without disassembly or execution of code. Specifically, we visualize malware samples as images and employ image analysis techniques. In this context, we focus on two machine learning models, namely, Convolutional Neural Networks (CNN) and Extreme Learning Machines (ELM). Surprisingly, we find that ELMs can yield comparable results to CNNs, yet ELMs are far more efficient to train.

Hot Fusion vs Cold Fusion for Malware Detection

by Snehal Bichkar

A fundamental problem in malware research consists of malware detection, that is, distinguishing malware samples from benign samples. This problem becomes more challenging when we consider multiple malware families. A typical approach to this multi-family detection problem is to train a machine learning model for each malware family and score each sample against all models. The resulting scores are then used for classification. We refer to this approach as "cold fusion," since we combine previously-trained models—no retraining of these base models is required when additional malware families are considered. An alternative approach is to train a single model on samples from multiple malware families. We refer to this latter approach as "hot fusion," since we must completely retrain the model whenever an additional family is included in our training set. In this research, we compare hot fusion and cold fusion—in terms of both accuracy and efficiency—as a function of the number of malware families considered. We use features based on opcodes and a variety of machine learning techniques are employed.