Java Security Research - Language Based Security

SPADE: Security and Privacy Aware Development Environment

Many well known security attacks on software systems are the result of design and / or coding errors. Privacy may be compromised due to the inadvertent disclosure of sensitive information. Often security / privacy is compromised due to complexity of the programming models and their limitations in common development environments, particularly C/C++ and assembly language. The result is vulnerability to buffer overflows,stack overflows, C string formatting flaws, failures to perform authorization tests on sensitive operations, failures to validate user input, etc. The enforcement of type safety in programming languages, such as Java, eliminates some of these attacks, but not all. In fact, there are automated code generations tools, including compilers, that generate nonsecure code.

Many common security / privacy flaws can be discovered through static analysis of software artifacts, which includes source, object code, deployment descriptors or other metadata that describes the program and its intended operation or security characteristics. The past few years have seen an increase in the nubmer of tools and services to scan source code for likely vulnerabilities. The level of user effort to use these tools and services varies considerably. The biggest near term challenge is improvement of accuracy of these tools as well as improving the usability, both of which contribute to user/programmer productivity.

The state of static analysis

Static analysis of software technology has been researched for decades. The effectiveness of static analysis has often been a trade-off between ease-of-use and the completeness of the analysis. For example, typical compiler analyses require only source code. But to find many of the security flaws, modelling information is also required. Our experience is that if the quantity of modelling information to be manually entered by the user is too great, then the tools will not be used. Thus, we examine both the current usage model and analysis approaches.

Usage Model

Performing an analysis requires a model for the analysis tool. This requires more than just the source or object code. The more comprehensive the tool, the more elaborate are the model descriptions required to enable the analysis. This proves to be a significant hurdle for usability of the tool. Part of the struggle is to automate as many parts of the model construction as possible. Recent examples of this trend are the use of a compiler's intermediate results to qualify data types used by the CQUAL tool. Other information may be extracted from the operational environment, such as middleware configuration and deployment files. The environments being analyzed for security flaws and risks are complex and have interactions among multiple components. The trend is to use a combination of tools for constructing the models and performing analyses since the combination is more likely to be successful than any single standalone tool.

An analysis results in a set of positives, or potential security flaws, being identified. Some of these may be false positives, flaws that do not exist, due to the conservative nature of static analysis. False positives are reporting security flaws that do not exist, but the analysis could not identify the cases as being safe. An overly high false positive rate makes static analysis tools unusable since manual review of the results requires too much time and reduces the user's confidence in the accuracy/correctness of the tool. Some cases reported are too difficult for the user to manually ascertain whether the reported positive is false or not. Much of the current work aims at reducing the false positive rate through improvements in algorithms, combining tools and improvements in model generation. Reducing the false positive rate is essential for improving the usability and encouraging wide spread use of security analysis tools.

Analysis Approaches

There are a variety of static analysis techniques, though our project is currently focusing on the use of control flow and data flow analyses. Typically, both intra- and inter-procedural analyses are used for performance optimization (e.g., compilers) as well as software specification and verification. We use bo intra- and inter-procedural analysis in our projects.

The traditional "closed-world" analysis assumes that there is a single or limited number of program entry points, and that all of the code is available for the analysis. Conversely, "open-world" analysis assumes that only software fragments (e.g., libraries, software components) are available for analysis. For security analysis, we believe that being able to perform open-world analyses is essential.

Current Activities in IBM Research

IBM Research has a long history in programming languages and analysis. More recently, within the Network Security and Privacy department, we have been developing tools to enable analysis of large middleware and applications, the Java runtime itself, parts of the Linux kernel, as well as other large software products.

The initial focus for our static analysis work was Java code. Generally speaking, analysis of Java code has been far more scalable than previous research on C/C++ analysis. The basic static analysis framework has also been used for a variety of projects, including mutability analysis to determine when and where data values are reachable and are / can be modified. This is essential when protecting information from inadvertent modification or disclosure. For example, researchers at Princeton University reported that an early version of the HotJava browser inadvertently would disclose the cryptographic "private key ring" in the browser. This would allow a downloaded Applet to retrieve the private key used to identify a person, digitally sign documents, etc. A static analysis can discover that a path and data flow exist that allow the disclosure of the private key ring without an appropriate authorization test.

Similarly, control flow analyses can discover paths that lead to security sensitive "native" method access (e.g., non-Java C/C++ library code) without appropriate authorization tests (complete mediation). A toolkit called "Back Orifice" (a bad pun on MS' Back Office application suite) was written to exploit a convoluted path through the Netscape's network libraries that allowed a rogue Applet to convert a Web browser into a network a file server. Even with a manual inspection of the expliot and the Netscape source code, it was not obvious how the exploit succeeded.

In addition to reducing the conservativeness of the analyses performed, we focus on making the the results easier to use. One of the important features of our work is to directly link the analysis results back to the source code, regardless of whether we started from source code or object code. This enables the user to understand the identified problems in the context in which the issue occurred. This includes reporting paths through the code or data structures which can lead to the identified issues. A navigable presentation has a dramatic affect on the usability and usefulness of the result.

Many security and privacy failures are due to violations of "Best Practices" for a programming model, including Java 2 Enterprise Edition and Microsoft's .NET. J2EE is the primary programming model for Java-based server-side applicdations (e.g., Servlets, JSPs, Struts, EJBs). These programming models have rules that can be unintentionally violated, even by experts. In addition, there are "rules of thumb" when implementing code for these programming models. Violations of the programming model and/or rules of thumb can have disastrous results.

We are applying our static analysis techniques to identify code that violates the programming model. Correcting application code so that it conforms to many of the "best practices" rules identified will result in an improvement in not just performance / correctness of the applications, but will close up security and privacy holes at the same time. For example, the identification and removal of an unnecessary synchronized database or messaging operation will improve overall throughput. The same improvement removes an opportunity for an adversary to exploit that operation for use in a denial of service attack, where the adversary submits many seemingly reasonable service requests that would be serialized through the database or messaging operation.

Security analysis of C/C++ code remains a critical issue for the IT industry. Newer static analysis techniques and tools are enabling researchers to verify type safety of "legacy" C/C++ code. Once type safety has been determined, stronger analyses that are currently possible with Java will be enabled for some aspects C/C++ code. Code injection will enable monitoring of the C/C++ code which can not be verified statically. Examples of this work are applied to the Linux Security Modules (LSM) as part of an effort by IBM to improve Linux security.

Future Directions

Future directions aim to improve the usability of analysis techniques, the breadth of problems to which the analysis techniques can be applied, the effectiveness of the analysis techniques themselves, and use Best Practices to drive techniques that can automatically make recommendations on how to restructure code to improve its security and/or remove the identified privacy problem. Our early efforts in this area are SABER and SWORD4J.

First, there is a movement to automate the construction of models required to do the analyses, along with a layering of analysis techniques, to make such analysis easier to perform and use. Ultimately, the user wants to provide only the code to be analyzed and get responses indicating real flaws, not false warnings. Exploration into the combination of analyses and maintaining dependencies to reduce recomputation are still rather immature.

Second, collection of the various kinds of analyses into a coherent analysis set is necessary to gain confidence in the system as a whole. We need to have verification that buffer overflows and print vulnerabilities do not take place. Also, verify that proper authorization takes place along all paths before a high degree of confidence in the code is possible. Thus, the myriad of analysis approaches must be collected into a coherent tool. Ultimately, any redundant effort must be eliminated to improve usability (e.g., use of a common model through the various tools) and performance (avoiding recomputation of the same results).

Third, improvements to the analysis techniques themselves will have obvious effects on this approach. Fine-grained control / data flow analyses have a well-deserved reputation for being slow, have substantial RAM requirements, and are too "conservative". To make these tools practical for everyday use and incorporate them into regular development use, the recognized limitations need to be addressed. This includes developing algorithms for incremental analysis based on previously analyzed libraries / components. For example, the Java runtime, middleware code and the C/C++ runtime libraries could be pre-analyzed for its essential control / data flow / security. Application code can then be analyzed relative to the previously computed results. Nonessential parts of the control / data flows for the analysis being performed could be omitted in the application code analysis. This results in less computation and a smaller memory footprint.

Even with faster analyses, the current large-scale analyses are based on a path-insensitive analysis algorithm. Though often sufficient for many analyses, it is conservative. Stronger analyses based on path-sensitive analysis are desirable to reduce the conservativeness, resulting in fewer false positives. Since the analysis results are currently targeted towards people performing security or software engineering tasks, false positives are time consuming to track down and refute. They also reduce the user's confidence in the robustness of the tool. Making path sensitive analysis time / space efficient is going to be a challenge.

Ultimately, the confidence we have in an analysis of code for security and / or privacy issues is constrained by the models constructed and the verification techniques employed. As the theoretical and practical techniques progress, we will be able to expand the scope of what we are able to analyze and increase our confidence level. An area of research is the interaction between components written and modelled using separate tools, such as modelling of Java code callling C/C++ "native" methods, interactions of Perl script with the underlying operating system, operating system calls into assembly language-based device drivers, etc. Part of the research is in the integration of component specifications (through formal methods or via automated tools) with verifications of how component interact with one another.

The initial successes in the use of static analysis for security are promising. There is an intuition that the same techniques can be applied to issues in the privacy arena. The use of static analysis for security beyond the traditional type-safety issues has gained traction both in academia and in the security world at large. Part of this is motivated by a shift in the programming languages community to broaden their focus to investigate non-compilation / optimization issues. Security is perceived to be an interesting area of research. In addition, the current state of programming languages technologies appears to be maturing to the point that the algorithms can start to be applied to real-world security issues.