Aniket Mishra | Cluster Analysis for Concept Drift Detection in Malware | ||
Jonathan Jiang | Multimodal Techniques for Malware Classification | ||
Grace Li | The Art of Detecting AI-Generated Art |
This research addresses concept drift in malware detection,
that is, gradual or sudden changes in malware properties that lower detection accuracy. We propose and analyze a clustering-based approach to detect and adapt to concept drift. Using a subset of the KronoDroid dataset, data is segmented into temporal batches and analyzed with MiniBatch K-Means clustering. The silhouette coefficient is used as a metric to evaluate clustering quality and identify drift by detecting significant changes in cluster patterns. We experiment with three scenarios: Static models, periodic retraining, and drift-aware retraining.
In each case, we consider four supervised classifiers, namely,
Linear SVM, Random Forest, MLP neural networks, and XGBoost.
Experimental results show that drift-aware retraining guided by silhouette score thresholds improves classification accuracy, as compared to static or periodic retraining. This
provides strong evidence that our clustering-based approach is effective at detecting
concept drift, and it illustrates an automated approach to
improved malware detection via concept drift detection.
Malware continues to be a significant threat to computer systems and networks.
This research utilizes structured information from PE files and
employs a multimodal machine learning approach to differentiate between
malware types. The proposed multimodal approach considers a variety of features
derived from PE headers and the malware body. We then train
several types of learning models independently on header features and body features,
and we combine the output of these models to obtain multimodal models.
We compare these multimodal models to models trained only on the PE header
and models trained only on the body, and we also compare to models
trained on the entire file. We consider SVM, LSTM, and CNN models,
and combinations thereof in the multimodal cases. We find that the multimodal
approach yields a slight, but meaningful, improvement in accuracy.
In this honors project, we first construct a large dataset consisting of
human-generated art and AI-generated art, each consisting of samples from
three different styles. We then consider the problem of distinguishing
the human-generated from the AI-generated art, both as a binary classification
problem and when classifying each sample according to its respective type. We attain
high accuracies using various features and learning techniques. We
consider directions for future work on this research topic.