Samanvitha Basole | Multifamily Malware Models | ||
Mayuri Wadkar | Measuring Malware Evolution using Support Vector Machines | ||
Nivedhitha Ramarathnam Krishna | Classifying Classic Ciphers Using Machine Learning | ||
Ashraf Saber | Intrusion Detection and CAN Vehicle Networks | ||
Akriti Sethi | Classifying Malware Models | ||
Anukriti Sinha | Emulation vs Instrumentation for Android Malware Detection | ||
Parth Jain | Machine Learning vs Deep Learning for Malware Detection | ||
Tazmina Sharmin | Deep Learning for Image Spam Detection | ||
Preethi Sundaravaradhan | Smartphone Gesture-Based Authentication |
When training a machine learning model, there is likely to be
a tradeoff between the accuracy of the model and the generality
of the dataset. For example, previous research has shown that
if we train a model to detect one specific malware family,
we typically obtain stronger results, as compared to the case
where we train a single model on multiple diverse families.
However, during the detection phase,
it would be more efficient to use a single
model to detect multiple families, rather than having
to score each sample against multiple models. In this research,
we conduct extensive experiments to quantify the relationship
between the generality of the training dataset and the accuracy
of various machine learning models, within the context
of the malware detection problem.
Malware is software that is designed to do harm to computer
systems. As with other software, malware typically evolves
over time as developers add new features and fix bugs.
Thus, malware samples from
the same family from different time periods can exhibit
significantly different behavior. For example,
differences between malware
samples within a family can originate from various code
modifications designed to evade signature-based detection,
or changes that are made to alter the functionality of the
malware. In this research, we apply feature ranking based
on linear support vector machine (SVM) weights to identify,
quantify, and track changes within malware families over time.
We analyze numerous malware families over extended periods
of time. We show that it is possible to detect evolutionary
changes within malware families using quantifiable and
automated machine learning techniques.
We consider the problem of identifying the classic cipher that was
used to generate a given ciphertext message. We assume that the
plaintext is English and we restrict our attention to ciphertext
consisting only of alphabetic characters. Among the classic ciphers
considered are the simple substitution, Vigenère,
playfair, and column transposition ciphers. The problem of
classification is approached in two ways. The first method uses
support vector machines (SVM) trained directly on ciphertext to
classify the ciphers. In the second approach, we train hidden Markov
models (HMM) on each ciphertext message, then use these trained HMMs
as features for classifiers. Under this second approach, we compare
two classification strategies, namely, convolutional neural networks
(CNN) and SVMs. For the CNN classifier, we convert the trained HMMs
into images. Extensive experimental results are provided for each of
the classification techniques under consideration.
In this paper, we consider intrusion detection systems (IDS) in the
context of a controller area network (CAN), which is also known as
the CAN bus. We provide a discussion of various IDS topics,
including masquerade detection, and we include a selective survey of
previous research involving IDS in a CAN network. We also discuss
background topics and relevant practical issues, such as data
collection on the CAN bus. Finally, we present experimental results
where we have applied a variety of machine learning techniques to
both real and simulated CAN data. Our experiments show that
machine learning models can be used to effectively determine
the status of a vehicle, as well as to detect masquerading
behavior based on CAN data.
Automatically classifying similar malware families is a challenging
problem. In this research, we attempt to classify malware families
by applying machine learning to trained machine learning models.
Specifically, we train hidden Markov models (HMM) for each malware
family in our dataset. The resulting models are then compared in two
ways. First, we treat the HMM matrices as images and experiment with
convolutional neural networks (CNN) for image classification.
Second, we apply support vector machines (SVM) to classify the HMMs.
We analyze the results and discuss the relative advantages and
disadvantages of these approaches.
In resource constrained devices, malware detection is typically
based on offline analysis using emulation. In previous work it has
been claimed that such emulation fails for a significant percentage
of Android malware, because well-designed malware detects that the
code is being emulated. An alternative to emulation is malware
analysis based on code that is executed on an actual Android
device. In this research, we collect features from a corpus of
Android malware using both emulation and on-phone instrumentation.
We train machine learning models based on emulated features and also
train models based on features collected via instrumentation. We
find that the differences between these two cases are negligible.
We conclude that it is uncommon for Android malware to implement
evasive strategies based on emulation detection.
It is often claimed that an advantage of deep learning is
that such models can continue to learn as more data
is made available for training. In contrast,
for other forms of machine learning it is claimed that the
models "saturate," in the sense that no additional learning can
occur beyond some point, regardless of the amount of additional
data or computing power that is brought to bear on the problem.
In this research, we compare the accuracy of deep learning
to other forms of machine learning for malware
detection, as a function of the training dataset size. We
experiment with a wide variety of hyperparameters for our
deep learning models, and we compare these models to
results obtained using k-nearest neighbors.
In our experiments, we use a subset of a large and
diverse malware dataset that was collected as part of
a recent research project.
Spam can be defined as unsolicited bulk email. In an
effort to evade text-based spam filters, spammers can
embed their spam text in an image, which is referred to as
image spam. In this research, we consider
the problem of image spam detection, based on image analysis.
We apply various machine learning and deep learning techniques
to real-world image spam datasets, and to a challenge
image spam-like dataset. We obtain results comparable to
previous work for the real-world datasets, while our
deep learning approach yields the best results to date
for the challenge dataset.
In this research, we consider the problem of authentication on a
smartphone based on gestures, that is, movements of the phone.
Accelerometer data from a number of subjects was collected and we
analyze this data using a variety of machine learning techniques,
including support vector machines (SVM) and convolutional neural
networks (CNN). We determine both the fraud rate
(i.e., the false accept rate) and the insult rate
(i.e., the false reject rate) in each case.