Rachel Gonsalves | Data Poisoning Attacks on HMMs | ||
Wei-Chung Huang | Image Robust Hashing for Malware Detection | ||
Samuel Kim | PE Headers for Malware Classification | ||
Aditya Raghavan | Boosted HMMs for Malware Detection | ||
Anish Singh Shekhawat | Analysis of Encrypted Malicious Traffic | ||
Supraja Suresh | Analyzing Android Adware |
With the ever increasing use of ever increasing volumes of data,
machine learning systems involving minimal human oversight are crucial for
classification and analysis tasks. Machine learning algorithms
used for such purposes have revolutionized the way we sort, classify,
and analyze data. The accuracy of any machine learning algorithm depends
heavily on the data it is trained on. In some circumstances,
an attacker can attempt to poison the training data to subvert a
machine learning system. In this research, we analyze the effects of
training data poisoning attacks on hidden Markov models (HMMs),
in the context of malware classification. We find that HMMs are
surprisingly sensitive to such attacks.
Robust hashing is a technique that has been successfully used to detect similarity in images. In this research, we consider a novel robust-hashing inspired approach for detecting malware families. Specifically, we treat each executable file as a two-dimensional image and use robust hashing techniques to determine whether a given executable belongs to a particular family or not. The robust hashing stage comprises two steps, namely, feature extraction, and compression, while the classification phase is based on machine learning. We compare our robust hashing approach to other machine learning based malware classification techniques.
Recent research indicates that effective malware detection can be based
on analyzing portable executable (PE) file headers. Such research typically
relies on prior knowledge of the header to extract relevant features. However,
it is also possible to consider the entire header as a whole,
and use this directly to determine whether the file is malware.
In this research, we collect a large and diverse malware dataset.
We then analyze the effectiveness of various machine learning techniques
based on PE headers to classify the malware samples.
We compare the accuracy and efficiency of each technique considered.
Digital security is an important issue today, and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection has recently seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has found widespread application in the field of pattern matching in general—and malware detection in particular—is hidden Markov models (HMMs). Since HMM training relies on a hill climb technique, we can often significantly improve a model by training multiple times with different initial values. In contrast, boosting is a general technique for combining weaker models to yield a stronger model. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained using multiple random restarts, in the context of the malware detection problem. These techniques are applied to a variety of challenging malware datasets and we analyze and compare the results in terms of effectiveness and efficiency.
In recent years there has been a dramatic increase in the number of malware attacks that
use encrypted HTTP traffic for propagation and communication. Due to the volume
of legitimate encrypted data, it can be difficult to filter encrypted malicious traffic from the vast background of benign traffic. Since antivirus software and firewalls will not typically have access to encryption keys, this poses a serious challenge for antivirus software and firewalls. Hence, detection techniques are needed that do not require decrypting the traffic. In this research, we apply a variety of machine learning techniques to the problem of distinguishing malicious from benign encrypted HTTP traffic. We show that we can obtain high accuracy with practical systems.
Most Android smartphone apps are free—to generate revenue, the app developers embed ad libraries so that advertisements are displayed when the app is being used. Billions of dollars are lost annually due to ad fraud on Android devices. In this research, we propose a machine learning based scheme to detect Android adware. We consider both static and dynamic features, and combinations thereof. Specifically, we collect static features from the manifest file, while our dynamic features are derived from network traffic. Using these features, we develop and analyze a tiered approach, where we initially classify Android applications into broad categories (e.g., adware, malware, benign) and then further classify each application into a more specific family. We employed a variety of machine learning techniques including neural networks, random forests, AdaBoost, and support vector machines.