Mugdha Jain | Image-Based Malware Classification with Convolutional Neural Networks and Extreme Learning Machines | ||
Snehal Bichkar | Hot Fusion vs Cold Fusion for Malware Detection |
Research in the field of malware classification often relies on machine learning
models that are trained on high level features, such as opcodes, function calls,
and control flow graphs. Extracting such features is costly, since disassembly
or code execution is generally required. In this research, we conduct experiments
to train and evaluate machine learning models for malware classification, based on
features that can be obtained without disassembly or execution of code. Specifically,
we visualize malware samples as images and employ image analysis techniques. In this
context, we focus on two machine learning models, namely, Convolutional Neural Networks
(CNN) and Extreme Learning Machines (ELM). Surprisingly, we find that ELMs can
yield comparable results to CNNs, yet ELMs are far more efficient to train.
A fundamental problem in malware research consists of malware detection, that is,
distinguishing malware samples from benign samples. This problem becomes more
challenging when we consider multiple malware families. A typical approach to this
multi-family detection problem is to train a machine learning model for each
malware family and score each sample against all models. The resulting scores
are then used for classification. We refer to this approach as "cold fusion,"
since we combine previously-trained models—no retraining of these base
models is required when additional malware families are considered. An alternative
approach is to train a single model on samples from multiple malware families.
We refer to this latter approach as "hot fusion," since we must completely retrain
the model whenever an additional family is included in our training set. In this
research, we compare hot fusion and cold fusion—in terms of both accuracy and
efficiency—as a function of the number of malware families considered.
We use features based on opcodes and a variety of machine learning techniques
are employed.