Software Uniqueness: How and Why

Puneet Mishra and Mark Stamp

Department of Computer Science

San Jose State University

San Jose, California

ABSTRACT: In this paper we consider some of the possible security implications of unique software. We are interested in the case where each instance of a piece of software is functionally equivalent, but the internal structure is as distinct as possible. The potential security benefits of such software include improved resistance to viruses and an increased difficulty of certain types of attacks based on reverse engineering the code. The reverse engineering problem is particularly relevant in the field of digital rights management (DRM) and, more generally, whenever we must hide a secret in software. We discuss secret hiding and other possible security applications for unique code and we briefly consider the use of this type of uniqueness as a software watermark. We then present a partial taxonomy (in summary form) of several types of code transformations and we briefly discuss the problem of automatically generating unique instances of code using these transformations. We also mention several research topics and future work.

Keywords: software uniqueness, digital rights management, DRM, virus protection, digital watermark, key hiding

1. Introduction

In this paper we discuss code transformations that can be applied to arbitrary C [1] or C++ [2] source code to yield distinct instances of the input code. We require that the resulting distinct copies are functionally equivalent to the original code. Our goal is to make instances of the transformed code as unique as possible. If instances of the code are sufficiently distinct, we expect that the work required to reverse engineer N instances will be linear in N.

To illustrate the potential usefulness of unique software, consider the following simple example. Suppose we have software that is used to authenticate a user. Regardless of the authentication method employed, the result is ultimately a 1-bit answer, say, 1 for “success” and 0 for “failure.” An attacker can reverse engineer this authentication software and determine where the software makes this 1-bit comparison. Then by simply setting this comparison bit to “1”, the attacker can gain access regardless of the authentication information supplied.

Now if each instance of this authentication software is identical, then an identical 1-bit attack can be applied to each instance (i.e., set the comparison bit to 1). On the other hand, if each instance of the software were distinct (though functionally equivalent), an attack on one instance of the software would only apply to that particular instance. An attack on another instance of the software would require another reverse engineering effort.

Perhaps after reverse engineering multiple instances of the software the attacker could gain enough insight into the transformation process to develop a general attack that would succeed against any instance of transformed software. However, this is almost certainly a far greater challenge than reverse engineering a single instance of the software. Provided that this is the case, we have increased the work required by an attacker to break an arbitrary instance of the software. Of course, the precise level of difficulty depends on the way that the uniqueness occurs.

The approach being advocated here is to make each instance of a particular piece of software unique. Note that the goal is not to prevent an individual instance of the software from being compromised. Instead, the goal is to have each instance of the software be a distinct reverse engineering problem for an attacker. In order for this approach to succeed, we propose to develop a large collection of elementary transformations and apply these in sufficiently random combinations so that each instance of code is unique with a high probability. Furthermore, the uniqueness must be variable enough so that diagnosing the uniqueness process from a number of reverse engineered instances of the code is extremely difficult. Since we are applying transformations to source code, we must also insure that the transformations will survive an optimizing compiler.

We have selected the C language as the language in which these transformations will be applied. One reason for selecting C is that it is inherently procedural. The use of C is not a significant restriction, since similar transformations could be developed for other programming languages. However, from a security standpoint, it would be preferable to develop these transformations in assembly code [3]. For simplicity and clarity, we have chosen a high-level language. In the future, we intend to extend this work to the assembly language case.

Plagiarism detection tools provide one possible method for testing the robustness of uniqueness transformations. We could apply such a test to several unique instances of a particular piece of code. Another interesting code comparison technique compares programmer’s style and uses this as the basis for identifying programs written by a particular individual [4]. There are many other possible tests for program similarity.

Program equivalence is known to be an NP-hard problem. However, we are claiming that each of our transformations produces the same output for any given input as the parent code would produce. Thus, we only need to verify that the individual transformations result in code that is functionally equivalent to the original code. If this is the case then the program equivalence issue can be avoided.

2. Potential Applications

In this section we discuss some of the possible applications of program uniqueness. The listed applications are not intended to be exhaustive. In fact, we view uniqueness as an extremely general tool that has potential applications in many areas of computer security. (See, for example, [5] for a discussion of similar ideas in the context of operating systems.)

2.1 Key Protection

A good candidate application for uniqueness is software that stores a key or other secret information. In a typical (non-unique) scenario if an attacker manages to extract the key from one instance of the software, the same attack can be used to extract the key from any other instance of the software. However, if each copy of the software derives from distinct source code─which results in significantly different executable code─it would make an attack on any instance of the software nearly as difficult as the original attack.

2.2 Digital Rights Management

Digital Rights Management (DRM) is an important technology being promoted by major software companies such as Microsoft®. Software uniqueness has been touted as a useful ingredient in software-based DRM [6].

Recently, several DRM systems have failed dramatically when exposed to attacks (see [6] for a discussion of failed DRM products from Adobe® and Microsoft®, among others). None of these failed systems employed any uniqueness. Consequently, hackers who were able to reverse engineer a single instance of the software were assured that any attack they developed would apply universally to all instances of that particular product.

If, on the other hand, a DRM system employs software uniqueness, it is highly unlikely that a single reverse engineering effort will be sufficient to break all instances of the software. Consequently, it is unlikely that an attacker who breaks one instance of such DRM software will have broken the entire system. Furthermore, it is likely that the work factor required to break the entire system will be dramatically higher than in the non-unique case. Of course, this same argument holds for many other software systems. But the implications in the DRM field are perhaps more readily apparent. Any DRM system designed to enforce copyright that is prone to systemic failure─as opposed to sporadic losses due to attack─is not likely to be trusted to protect valuable digital content.

As an aside, we mention the Trusted Computing Platform Association (TCPA) [7], which is an effort by several leading IT companies to develop a hardware-based DRM solution. Presently, there is a rancorous debate over the privacy and fair use implications of such technology [8]. If DRM cannot be made more robust in software, it seems likely that the TCPA effort will eventually succeed since many powerful economic forces are pushing the technology; see [6] and [9] for much more information on this and related DRM issues.

2.3 Software Watermarks

Unique code could also serve as a watermark to distinguish versions of the code. In principle, we could create a unique version of a piece of software based on some key. Then if a pirated binary version of the code were discovered, we could read the watermark to determine the source. In our scheme, the uniqueness is inserted at the source code level but we would like to read the watermark from the object code. Consequently, there are issues to be resolved. We are currently actively investigating this possible use of software uniqueness.

Another potentially useful watermarking idea is to insert arbitrary values at certain points in unique instances of code. These unique values could serve as a runtime watermark. For example, at various points we could compute a hash of the watermark values, which would result in a unique identification key for a program. Of course, a similar thing can be done in the non-unique case. But in the unique case the identifier would be more robust, since each instance of the software would require a distinct watermark removal attack.

2.4 Code Obfuscation

Code obfuscation is a technique used to make the reverse engineering of software more difficult [10]. For example, this technique could be used to make it more difficult for competitors to easily gaining an insight into an application. Obfuscation could also be employed in an effort to prevent malicious users from easily understanding or modifying an application. Note, however, that obfuscation does not imply uniqueness, since a non-unique piece of software could be highly obfuscated. On the other hand, our usage of uniqueness is intended to provide a degree of obfuscation and therefore can be viewed as yet another obfuscation technique. Also, when applying uniqueness transformations, it would be reasonable to further obfuscate each unique versions of software by applying obfuscating transformations, such as those discussed in [10].

2.5 Anti-Piracy

The software industry has been battling piracy for many years. It is estimated that in 1998 the U.S. software industry lost over $2.9 billion in sales within the U.S. and $11 billion internationally due to software theft [11]. This number has undoubtedly grown in the last few years making the stakes even higher in the field of anti-piracy technologies.

Tamper resistant software (TRS) [12,13,14] is one idea that has emerged from research in anti-piracy. A TRS system employs a variety of techniques that cause an application to crash or otherwise misbehave if it detects that someone is trying to analyze the code while it is executing. Uniqueness can potentially improve the robustness of such systems by making attacks on the TRS system itself more difficult to duplicate.

2.6 Virus Protection

A computer virus─analogous to a biological virus─is a malicious program that creates havoc on the host computer and replicates itself in order to attach itself to other programs. The goal of the virus is to infect other hosts.

A computer virus can display polymorphism [15], i.e., it can evolve as it spreads. Typically, anti-virus software scans for viruses by using an elementary pattern matching approach. Polymorphic viruses attempt to avoid detection by changing their pattern when they replicate. Thus, as in a natural biological system, the “genetic diversity” of the virus population increases. Software uniqueness offers the potential to employ genetic diversity to defend against viruses. A virus will usually attach itself to a specific position in the code it is attacking. If the software under attack has no genetic diversity (as is nearly universally the case today) then the virus will be able to attack every instance of the software in precisely the same manner. However, if the software under attack displays genetic diversity (i.e., uniqueness), then the virus cannot simply attach itself to a specific point in the software since its desired entry point will appear at a different point─and in a different form─in each instance of the software. In this situation, the virus has a much more difficult task to infect a particular piece of software. It is likely that a large fraction of instances of unique software will have a natural immunity to any given virus. The analogy to biological systems is substantial and bears further investigation.

3. TRANSFORMATIONS

In this section we present a brief list of transformations that can be applied to C code to obtain unique versions of code. These transformations are merely a representative sample of the possible transformations and are not intended to provide an exhaustive list. The transformations have been classified into categories based on their functional similarity and complexity. For each of these categories we list several examples of representative transformations.

I. Code Reorganization

Equivalent mathematical transformations: Many mathematical functions can be easily transformed into another form using some elementary operations. This technique is quite relevant for applications involving cryptography, for example, since such applications typically contain many mathematical functions.
Inline a function or vice-versa
Add goto labels in place of structured code
Partition variables into two or more variables
Add redundant code
Add dummy arguments to function calls
Reorder the code

II. Memory Based Transformations

Allocate more memory than required: The following code snippets are equivalent.
Change data types

III. COMPILER BASED

Compilers are a critical element in the implementation of source code transformations. Many potential source code transformations would be removed in the compilation process. Such transformations would have no effect on the resulting binary code. This would, of course, defeat the purpose of the code transformations. Ideally, we would write our own compiler that only performs optimizations that do not affect our desired transformations. Such a compiler would be the logical location to insert the transformations as well.

IV. FLOW AND STRUCTURE

Change algorithms and flow: The programmer determines the algorithms and flow of the code. If we can alter algorithms and flow control structures in the code we can achieve a level of uniqueness. This process requires a careful mapping of the flow and parameters from one algorithmic technique to another. Consequently, this approach might be relatively difficult to automate.
Alter data structures

V. Miscellaneous

For example, insertion of operating system calls at various points in the code could be used to add uniqueness. The negative aspect of this technique is that it would make the resulting code platform dependent.

4. GENERATING UNIQUE CODE

In this section we discuss the issues that must be confronted when trying to automatically generate distinct instances of code. A complete implementation of just the limited set of code transformations discussed above is a non-trivial task. The simpler of these transformations (for example, the insertion of dummy code) do provide us with uniqueness. But is not clear which transformations (or combination of transformations) are most likely to cause the greatest difficulty for a motivated attacker. We discuss this further in the next section.

We are currently developing a tool that will implement a substantial subset of the transformations listed above. Our tool is essentially a parser [16] that looks for particular patterns in the code and inserts a transformation when an appropriate location is found.

4. CONCLUSION

The ideas of software uniqueness discussed in this paper appear to have significant potential utility in many applications, including key hiding software, DRM systems and software watermarking. However, it appears to be difficult to quantify the strength of this approach based solely on theoretical considerations. Consequently, the automated implementation that we are developing will be valuable in making such an assessment. When this tool is completed, we expect that we will be able to produce substantial empirical data to further bolster our claims of the security benefits of unique software. Various methods can then be employed to measure the variability in the resulting unique code and realistic testing can be conducted to determine the difficulty of attacking systems based on such code.

References

Kernighan, B., D.M. Ritchie, The C Programming Language, Second Edition; Prentice Hall
B. Stroustrup: The C++ Programming Language, 3rd edition; Addison Wesley Longman, Reading, MA. 1997. ISBN 0-201-88954-4.
Singh, A., W. Triebel, The 8086 Assembly Language, Prentice Hall
Ivan Krsul Eugene H. Spafford , Authorship Analysis: Identifying The Author of a Program The COAST Project Department of Computer Sciences Purdue University West Lafayette, IN 47907–1398
Cohen, F.B., Operating System Protection Through Program Evolution at http://www.all.net/books/IP/evolve.html
Stamp, M., Digital Rights Management: The Technology Behind the Hype, to appear in Journal of Electronic Commerce Research
Anderson, R.J., TCPA / Palladium Frequently Asked Questions, Version 1.0; at http://www.cl.cam.ac.uk/~rja14/tcpa-faq.html
Infoanarchy, Copy Prevention, at http://www.infoanarchy.org/?op=search;topic=CopyPrevention
Stamp, M., Risks of digital rights management, Inside Risks 147, Communications of the ACM, Vol. 45, No. 9, September 2002, p. 120 http://www.csl.sri.com/users/neumann/insiderisks.html#147
Collberg, C., C. Thomborson, D. Low, A Taxonomy of Obfuscating Transformations, Technical Report #148 Department of Computer Science, The University of Auckland
Microsoft Research, Cryptography, Piracy http://research.microsoft.com/crypto/piracy.aspx
AccessTicket Systems Inc., What is tamper resistant coding technology” at http://www.accessticket.com/eng-trc5.html
Cloakware, Inc., Cloakware Makes Software Tamper-Resistant, at http://www.hhcmag.com/backissues/550.html
An Approach to the Objective and Quantitative Evaluation of Tamper-Resistant Software at http://www.ecip.tohoku.ac.jp/~shizuya/abstracts/GMMS 00.html
Dark Angel, Advanced Polymorphism Primer; at http://vx.netlux.org/lib/vda01.html
Aho, A., J. Ulman and R. Sethi, Compilers Principles, Technique and Tools, Addison Wesley.

Biographies

Mark Stamp has spent well over a decade working in security. He can neither confirm nor deny that he spent seven years as a cryptanalyst at the National Security Agency. However, he can confirm that he recently spent two years developing a DRM product for MediaSnap, Inc., a small Silicon Valley startup company. Dr. Stamp’s current research interests include security, networks, algorithms and DRM.

Puneet Mishra received his Bachelor of Engineering, Computer Technology degree from India and attended the National Center for Software Technology, Mumbai, India. His primary area of interest is in the field of computer security.