| Christofer Washington Berruz Chungata | Concept Drift Detection and Adaptive Retraining of Malware Classification Models | ||
| Jhanvi Lotwala | TBD | ||
| Nathan Durrant | TBD |
Concept drift refers to changes over time in the
statistical properties of data, as compared to the data
that was used to train a learning model.
Machine learning models for malware detection or classification
are particularly susceptible to performance
degradation caused by concept drift, as
attackers constantly modify existing malware.
In this paper, we analyze two machine learning-based
approaches to automated
concept drift detection—a novel approach based on
One-Class Support Vector Machines (OCSVM)
and a previously-studied technique
based on Minibatch K-Means (MK-Means). For
comparison we also consider
Maximum Mean Discrepancy (MMD),
a statistical technique for detecting changes in multidimensional data.
We conduct an extensive series of experiments
comparing the effectiveness of four learning
models, namely, Multilayer Perceptron (MLP),
Random Forest (RF), Support Vector Machines (SVM),
and eXtreme Gradient Boosting (XGB).
For each of these models, we consider three distinct scenarios:
A static scenario where no model retraining occurs,
a periodic scenario where models are constantly
retrained irrespective of concept drift,
and a drift-aware scenario where
models are only retrained when concept drift is detected.
Under the drift-aware scenario,
we analyze the tradeoff between accuracy and training efficiency
using Pareto Front analysis.
We find that all three concept drift detection techniques
achieve classification accuracy comparable to
periodic retraining, while offering substantially greater efficiency in terms
of the number of models that must be retrained.
In addition, drift-aware retraining based on
our OCSVM technique generally outperforms
the MK-Means and MMD approaches.
Overall, these results provide strong evidence that we are able to accurately
detect concept drift in malware classification models.
Furthermore, our concept drift detection techniques
are efficient and practical, and the process of updating
learning models can easily be fully automated.
TBD
TBD