Software
Uniqueness: How and Why
Puneet Mishra and Mark Stamp
Department of Computer Science
San Jose State University
San Jose, California
ABSTRACT: In this paper we consider some of the possible security implications of unique software. We are interested in the case where each instance of a piece of software is functionally equivalent, but the internal structure is as distinct as possible. The potential security benefits of such software include improved resistance to viruses and an increased difficulty of certain types of attacks based on reverse engineering the code. The reverse engineering problem is particularly relevant in the field of digital rights management (DRM) and, more generally, whenever we must hide a secret in software. We discuss secret hiding and other possible security applications for unique code and we briefly consider the use of this type of uniqueness as a software watermark. We then present a partial taxonomy (in summary form) of several types of code transformations and we briefly discuss the problem of automatically generating unique instances of code using these transformations. We also mention several research topics and future work.
Keywords: software uniqueness, digital rights management, DRM, virus protection, digital watermark, key hiding
1. Introduction
In this paper we discuss code transformations that can be applied to arbitrary C [1] or C++ [2] source code to yield distinct instances of the input code. We require that the resulting distinct copies are functionally equivalent to the original code. Our goal is to make instances of the transformed code as unique as possible. If instances of the code are sufficiently distinct, we expect that the work required to reverse engineer N instances will be linear in N.
To illustrate the potential usefulness of unique software, consider the following simple example. Suppose we have software that is used to authenticate a user. Regardless of the authentication method employed, the result is ultimately a 1-bit answer, say, 1 for “success” and 0 for “failure.” An attacker can reverse engineer this authentication software and determine where the software makes this 1-bit comparison. Then by simply setting this comparison bit to “1”, the attacker can gain access regardless of the authentication information supplied.
Now if each instance of this authentication software is identical, then an identical 1-bit attack can be applied to each instance (i.e., set the comparison bit to 1). On the other hand, if each instance of the software were distinct (though functionally equivalent), an attack on one instance of the software would only apply to that particular instance. An attack on another instance of the software would require another reverse engineering effort.
Perhaps after reverse engineering multiple instances of the software the attacker could gain enough insight into the transformation process to develop a general attack that would succeed against any instance of transformed software. However, this is almost certainly a far greater challenge than reverse engineering a single instance of the software. Provided that this is the case, we have increased the work required by an attacker to break an arbitrary instance of the software. Of course, the precise level of difficulty depends on the way that the uniqueness occurs.
The approach being advocated here is to make each instance of a particular piece of software unique. Note that the goal is not to prevent an individual instance of the software from being compromised. Instead, the goal is to have each instance of the software be a distinct reverse engineering problem for an attacker. In order for this approach to succeed, we propose to develop a large collection of elementary transformations and apply these in sufficiently random combinations so that each instance of code is unique with a high probability. Furthermore, the uniqueness must be variable enough so that diagnosing the uniqueness process from a number of reverse engineered instances of the code is extremely difficult. Since we are applying transformations to source code, we must also insure that the transformations will survive an optimizing compiler.
We have selected
the C language as the language in which these transformations will be
applied. One reason for selecting C is that it is inherently procedural.
The use of C is not a significant restriction, since similar transformations
could be developed for other programming languages. However, from a
security standpoint, it would be preferable to develop these transformations
in assembly code [3]. For simplicity and clarity, we have chosen a high-level
language. In the future, we intend to extend this work to the assembly
language case.
Plagiarism
detection tools provide one possible method for testing the robustness
of uniqueness transformations. We could apply such a test to several
unique instances of a particular piece of code. Another interesting
code comparison technique compares programmer’s style and uses this
as the basis for identifying programs written by a particular individual
[4]. There are many other possible tests for program similarity.
Program equivalence
is known to be an NP-hard problem. However, we are claiming that each
of our transformations produces the same output for any given input
as the parent code would produce. Thus, we only need to verify that
the individual transformations result in code that is functionally equivalent
to the original code. If this is the case then the program equivalence
issue can be avoided.
2. Potential Applications
In this section
we discuss some of the possible applications of program uniqueness.
The listed applications are not intended to be exhaustive. In fact,
we view uniqueness as an extremely general tool that has potential applications
in many areas of computer security. (See, for example, [5] for a discussion
of similar ideas in the context of operating systems.)
2.1 Key Protection
A good candidate
application for uniqueness is software that stores a key or other secret
information. In a typical (non-unique) scenario if an attacker manages
to extract the key from one instance of the software, the same attack
can be used to extract the key from any other instance of the software.
However, if each copy of the software derives from distinct source code─which
results in significantly different executable code─it would make an attack on any instance
of the software nearly as difficult as the original attack.
2.2 Digital Rights Management
Digital Rights
Management (DRM) is an important technology being promoted by major
software companies such as Microsoft®. Software uniqueness has been touted
as a useful ingredient in software-based DRM [6].
Recently, several
DRM systems have failed dramatically when exposed to attacks (see [6]
for a discussion of failed DRM products from Adobe® and Microsoft®, among others). None of these failed
systems employed any uniqueness. Consequently, hackers who were able
to reverse engineer a single instance of the software were assured that
any attack they developed would apply universally to all instances of
that particular product.
If, on the
other hand, a DRM system employs software uniqueness, it is highly unlikely
that a single reverse engineering effort will be sufficient to break
all instances of the software. Consequently, it is unlikely that an
attacker who breaks one instance of such DRM software will have broken
the entire system. Furthermore, it is likely that the work factor required
to break the entire system will be dramatically higher than in the non-unique
case. Of course, this same argument holds for many other software systems.
But the implications in the DRM field are perhaps more readily apparent.
Any DRM system designed to enforce copyright that is prone to systemic
failure─as
opposed to sporadic losses due to attack─is not likely to be trusted to protect
valuable digital content.
As an aside,
we mention the Trusted Computing Platform Association (TCPA) [7], which
is an effort by several leading IT companies to develop a hardware-based
DRM solution. Presently, there is a rancorous debate over the privacy
and fair use implications of such technology [8]. If DRM cannot be made
more robust in software, it seems likely that the TCPA effort will eventually
succeed since many powerful economic forces are pushing the technology;
see [6] and [9] for much more information on this and related DRM issues.
2.3 Software Watermarks
Unique code
could also serve as a watermark to distinguish versions of the code.
In principle, we could create a unique version of a piece of software
based on some key. Then if a pirated binary version of the code were
discovered, we could read the watermark to determine the source. In
our scheme, the uniqueness is inserted at the source code level but
we would like to read the watermark from the object code. Consequently,
there are issues to be resolved. We are currently actively investigating
this possible use of software uniqueness.
Another potentially
useful watermarking idea is to insert arbitrary values at certain points
in unique instances of code. These unique values could serve as a runtime
watermark. For example, at various points we could compute a hash of
the watermark values, which would result in a unique identification
key for a program. Of course, a similar thing can be done in the non-unique
case. But in the unique case the identifier would be more robust, since
each instance of the software would require a distinct watermark removal
attack.
2.4 Code Obfuscation
Code obfuscation
is a technique used to make the reverse engineering of software more
difficult [10]. For example, this technique could be used to make it
more difficult for competitors to easily gaining an insight into an
application. Obfuscation could also be employed in an effort to prevent
malicious users from easily understanding or modifying an application.
Note, however, that obfuscation does not imply uniqueness, since a non-unique
piece of software could be highly obfuscated. On the other hand, our
usage of uniqueness is intended to provide a degree of obfuscation and
therefore can be viewed as yet another obfuscation technique. Also,
when applying uniqueness transformations, it would be reasonable to
further obfuscate each unique versions of software by applying obfuscating
transformations, such as those discussed in [10].
2.5 Anti-Piracy
The software
industry has been battling piracy for many years. It is estimated that
in 1998 the U.S. software industry lost over $2.9 billion in sales within
the U.S. and $11 billion internationally due to software theft [11].
This number has undoubtedly grown in the last few years making the stakes
even higher in the field of anti-piracy technologies.
Tamper resistant
software (TRS) [12,13,14] is one idea that has emerged from research
in anti-piracy. A TRS system employs a variety of techniques that cause
an application to crash or otherwise misbehave if it detects that someone
is trying to analyze the code while it is executing. Uniqueness can
potentially improve the robustness of such systems by making attacks
on the TRS system itself more difficult to duplicate.
2.6 Virus Protection
A computer
virus─analogous
to a biological virus─is a malicious program that creates
havoc on the host computer and replicates itself in order to attach
itself to other programs. The goal of the virus is to infect other hosts.
A computer
virus can display polymorphism [15], i.e., it can evolve as it spreads.
Typically, anti-virus software scans for viruses by using an elementary
pattern matching approach. Polymorphic viruses attempt to avoid detection
by changing their pattern when they replicate. Thus, as in a natural
biological system, the “genetic diversity” of the virus population
increases. Software uniqueness offers the potential to employ genetic
diversity to defend against viruses. A virus will usually attach itself
to a specific position in the code it is attacking. If the software
under attack has no genetic diversity (as is nearly universally the
case today) then the virus will be able to attack every instance of
the software in precisely the same manner. However, if the software
under attack displays genetic diversity (i.e., uniqueness), then the
virus cannot simply attach itself to a specific point in the software
since its desired entry point will appear at a different point─and
in a different form─in each instance of the software. In
this situation, the virus has a much more difficult task to infect a
particular piece of software. It is likely that a large fraction of
instances of unique software will have a natural immunity to any given
virus. The analogy to biological systems is substantial and bears further
investigation.
3. TRANSFORMATIONS
In this section we present a brief list of transformations that can be applied to C code to obtain unique versions of code. These transformations are merely a representative sample of the possible transformations and are not intended to provide an exhaustive list. The transformations have been classified into categories based on their functional similarity and complexity. For each of these categories we list several examples of representative transformations.
III. COMPILER BASED
Compilers are
a critical element in the implementation of source code transformations.
Many potential source code transformations would be removed in the compilation
process. Such transformations would have no effect on the resulting
binary code. This would, of course, defeat the purpose of the code transformations.
Ideally, we would write our own compiler that only performs optimizations
that do not affect our desired transformations. Such a compiler would
be the logical location to insert the transformations as well.
IV. FLOW AND STRUCTURE
V. Miscellaneous
For example, insertion of operating
system calls at various points in the code could be used to add uniqueness.
The negative aspect of this technique is that it would make the resulting
code platform dependent.
4. GENERATING UNIQUE CODE
In this section we discuss
the issues that must be confronted when trying to automatically generate
distinct instances of code. A complete implementation of just the limited
set of code transformations discussed above is a non-trivial task. The
simpler of these transformations (for example, the insertion of dummy
code) do provide us with uniqueness. But is not clear which transformations
(or combination of transformations) are most likely to cause the greatest
difficulty for a motivated attacker. We discuss this further in the
next section.
We are currently developing a tool that will implement a substantial subset of the transformations listed above. Our tool is essentially a parser [16] that looks for particular patterns in the code and inserts a transformation when an appropriate location is found.
4. CONCLUSION
The ideas of software uniqueness discussed in this paper appear to have significant potential utility in many applications, including key hiding software, DRM systems and software watermarking. However, it appears to be difficult to quantify the strength of this approach based solely on theoretical considerations. Consequently, the automated implementation that we are developing will be valuable in making such an assessment. When this tool is completed, we expect that we will be able to produce substantial empirical data to further bolster our claims of the security benefits of unique software. Various methods can then be employed to measure the variability in the resulting unique code and realistic testing can be conducted to determine the difficulty of attacking systems based on such code.
Mark Stamp
has spent well over a decade working in security. He can neither confirm
nor deny that he spent seven years as a cryptanalyst at the National
Security Agency. However, he can confirm that he recently spent two
years developing a DRM product for MediaSnap, Inc., a small Silicon
Valley startup company. Dr. Stamp’s current research interests include
security, networks, algorithms and DRM.
Puneet Mishra received his Bachelor of Engineering, Computer Technology degree from India and attended the National Center for Software Technology, Mumbai, India. His primary area of interest is in the field of computer security.