David F. Bacon
2012
Joshua Auerbach, David F. Bacon, Ioana Burcea, Perry Cheng, Stephen J. Fink, Rodric Rabbah, Sunil Shukla
Proceedings of the 49th Annual Design Automation Conference, pp. 271--276, ACM, 2012
Heterogeneous systems show a lot of promise for extracting high-performance by combining the benefits of conventional architectures with specialized accelerators in the form of graphics processors (GPUs) and reconfigurable hardware (FPGAs). Extracting this performance often entails programming in disparate languages and models, making it hard for a programmer to work equally well on all aspects of an application. Further, relatively little attention is paid to co-execution---the problem of orchestrating program execution using multiple distinct computational elements that work seamlessly together.
We present Liquid Metal, a comprehensive compiler and runtime system for a new programming language called Lime. Our work enables the use of a single language for programming heterogeneous computing platforms, and the seamless co-execution of the resultant programs on CPUs and accelerators that include GPUs and FPGAs.
We have developed a number of Lime applications, and successfully compiled some of these for co-execution on various GPU and FPGA enabled architectures. Our experience so far leads us to believe the Liquid Metal approach is promising and can make the computational power of heterogeneous architectures more easily accessible to mainstream programmers.
David F. Bacon, Perry Cheng, Sunil Shukla
The Second Workshop on the Intersections of Computer Architecture and Reconfigurable Logic, 2012
Programmers are turning to diverse architectures such as reconfigurable hardware (FPGAs) to achieve performance. But such systems are far more complex to use than conventional CPUs. The continued exponential increase in transistors, combined with the desire to implement ever more sophisticated algorithms, makes it imperative that such systems be programmed at much higher levels of abstraction. One fundamental high-level language features is automatic memory management in the form of garbage collection.
We present the first implementation of a complete garbage collector in hardware (as opposed to previous ``hardware-assist'' techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle.
We have synthesized the collector to hardware and show that it never consumes more than 1\% of the logic resources of a high-end FPGA. For comparison we also implemented explicit (malloc/free) memory management, and show that our collector is between 4\% to 17\% slower than malloc, with comparable energy consumption. Surprisingly, in hardware real-time collection is superior to stop-the-world collection on every performance axis, and even for stressful micro-benchmarks can achieve 100\% MMU with heaps as small as 1.01 to 1.4 times the absolute minimum.
This reprises work previously published in PLDI [2].
Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, Stephen J. Fink
Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, pp. 1--11, ACM, 2012
Languages such as OpenCL and CUDA offer a standard interface for general-purpose programming of GPUs. However, with these languages, programmers must explicitly manage numerous low- level details involving communication and synchronization. This burden makes programming GPUs difficult and error-prone, ren- dering these powerful devices inaccessible to most programmers.
We desire a higher-level programming model that makes GPUs more accessible while also effectively exploiting their computa- tional power. This paper presents features of Lime, a new Java- compatible language targeting heterogeneous systems, that allow an optimizing compiler to generate high quality GPU code. The key insight is that the language type system enforces isolation and immutability invariants that allow the compiler to optimize for a GPU without heroic compiler analysis.
We evaluate the performance of the resulting code, comparing against an implementation on the Java Virtual Machine (with JIT) and against hand-tuned native OpenCL code. The compiler attains GPU speedups relative to JVM/JIT between 12x and 431x, while achieving between 75\% and 140\% of the performance of native OpenCL code.
David F. Bacon, Yiling Chen, Ian Kash, David Parkes, Malvika Rao, Manu Sridharan
Proceedings of the Eleventh International Conference on Autonomous Agents and Multi-agent Systems, 2012
We consider a setting in which a worker and a manager may each have information about the likely completion time of a task, and the worker also affects the completion time by choosing a level of effort. The task itself may further be composed of a set of subtasks, and the worker can also decide how many of these subtasks to split out into an explicit prediction task. In addition, the worker can learn about the likely completion time of a task as work on subtasks completes. We characterize a family of scoring rules for the worker and manager that provide three properties: information is truthfully reported; best effort is exerted by the worker in completing tasks as quickly as possible; and collusion is not possible. We also study the factors influencing when a worker will split a task into subtasks, each forming a separate prediction target.
Best paper award
David F. Bacon, Perry Cheng, Sunil Shukla
Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, pp. 23--34, ACM, 2012
Programmers are turning to radical architectures such as reconfigurable hardware (FPGAs) to achieve performance. But such systems, programmed at a very low level in languages with impoverished abstractions, are orders of magnitude more complex to use than conventional CPUs. The continued exponential increase in transistors, combined with the desire to implement ever more sophisticated algorithms, makes it imperative that such systems be programmed at much higher levels of abstraction. One of the fundamental high-level language features is automatic memory management in the form of garbage collection.
We present the first implementation of a complete garbage collector in hardware (as opposed to previous ``hardware-assist'' techniques), using an FPGA and its on-chip memory. Using a completely concurrent snapshot algorithm, it provides single-cycle access to the heap, and never stalls the mutator for even a single cycle, achieving a deterministic mutator utilization (MMU) of 100\%.
We have synthesized the collector to hardware and show that it never consumes more than 1\% of the logic resources of a high-end FPGA. For comparison we also implemented explicit (malloc/free) memory management, and show that real-time collection is about 4\% to 17\% slower than malloc, with comparable energy consumption. Surprisingly, in hardware real-time collection is superior to stop-the-world collection on every performance axis, and even for stressful micro-benchmarks can achieve 100\% MMU with heaps as small as 1.01 to 1.4 times the absolute minimum.
2011
David F. Bacon
Proceedings of the Seventh ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 1--2, ACM, 2011
Since their invention over 40 years ago, virtual machines have been used to virtualize one or more von Neumann processors and their associated peripherals. System virtual machines provide the illusion that the user has their own instance of a physical machine with a given instruction set architecture (ISA). Process virtual machines provide the illusion of running on a synthetic architecture independent of the underlying ISA, generally for the purpose of supporting a high-level language.
To continue the historical trend of exponential increase in computational power in the face of limits on clock frequency scaling, we must find ways to harness the inherent parallelism of billions of transistors. I contend that multi-core chips are a fatally flawed approach - instead, maximum performance will be achieved by using heterogeneous chips and systems that combine customized and customizable computational substrates that achieve very high performance by closely matching the computational and communications structures of the application at hand.
Such chips might look like a mashup of a conventional multicore, a GPU, an FPGA, some ASICs, and a DSP. But programming them with current technologies would be nightmarishly complex, portability would be lost, and innovation between chip generations would be severely limited.
The answer (of course) is virtualization, and at both the device level and the language level.
In this talk I will illustrate some challenges and potential solutions in the context of IBM's Liquid Metal project, in which we are designing a new high-level language (Lime) and compiler/runtime technology to virtualize the underlying computational devices by providing a uniform semantic model.
I will also discuss problems (and opportunities) that this raises at the operating system and data center levels, particularly with computational elements like FPGAs for which ``context switching'' is currently either extremely expensive or simply impossible.
Joshua Auerbach, David F. Bacon, Perry Cheng, Rodric Rabbah, Sunil Shukla
Proceedings of the 48th Design Automation Conference, pp. 890--894, ACM, 2011
Lime is a new Java-compatible and object-oriented language designed to make programming of reconflgurable hardware significantly more accessible to skilled software developers. Lime programs may run either in software (via Java bytecodes) or in hardware (via behavioral and logic synthesis). This paper illustrates the salient synthesis-oriented features of the language using a photo-mosaic algorithm with inherent bit, pipeline, and data parallelism. The result is a virtual machine abstraction that extends across a heterogeneous architecture comprising a CPU, FPGA, and other computational structures.
2010
David F. Bacon, Jyh-Herng Chow, Dz-Ching R. Ju, Kalyan Muthukumar, Vivek Sarkar
CASCON First Decade High Impact Papers, pp. 146--158, ACM, 2010
It has been observed that memory access performance can be improved by restructuring data declarations, using simple transformations such as array dimension padding and inter-array padding (array alignment) to reduce the number of misses in the cache and TLB (translation lookaside buffer). These transformations can be applied to both static and dynamic array variables. In this paper, we provide a padding algorithm for selecting appropriate padding amounts, which takes into account various cache and TLB effects collectively within a single framework. In addition to reducing the number of misses, we identify the importance of reducing the impact of cache miss jamming by spreading cache misses more uniformly across loop iterations.
We translate undesirable cache and TLB behaviors into a set of constraints on padding amounts and propose a heuristic algorithm of polynomial time complexity to find the padding amounts to satisfy these constraints. The goal of the padding algorithm is to select padding amounts so that there are no set conflicts and no offset conflicts in the cache and TLB, for a given loop. In practice, this algorithm can efficiently find small padding amounts to satisfy these constraints.
David F. Bacon, Eric Bokelberg, Yiling Chen, Ian A. Kash, David C. Parkes, Malvika Rao, Manu Sridharan
Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research, pp. 7--12, ACM, 2010
Software construction has typically drawn on engineering metaphors like building bridges or cathedrals, which emphasize architecture, specification, central planning, and determinism. Approaches to correctness have drawn on metaphors from mathematics, like formal proofs. However, these approaches have failed to scale to modern software systems, and the problem keeps getting worse.
We believe that the time has come to completely re-imagine the creation of complex software, drawing on systems in which behavior is decentralized, self-regulating, non-deterministic, and emergent---like economies.
In this paper we describe our vision for, and prelimary work on, the creation of software economies for both open systems and internal corporate development, and our plans to deploy these ideas within one of the largest developer communities at IBM.
Joshua Auerbach, David F. Bacon, Perry Cheng, Rodric Rabbah
Proceedings of the ACM International Conference on Object-Oriented Programming Systems, Languages, and Applications, pp. 89--108, ACM, 2010
The halt in clock frequency scaling has forced architects and language designers to look elsewhere for continued improvements in performance. We believe that extracting maximum performance will require compilation to highly heterogeneous architectures that include reconfigurable hardware.
We present a new language, Lime, which is designed to be executable across a broad range of architectures, from FPGAs to conventional CPUs. We present the language as a whole, focusing on its novel features for limiting side-effects and integration of the streaming paradigm into an object- oriented language. We conclude with some initial results demonstrating applications running either on a CPU or co- executing on a CPU and an FPGA.
Joshua Auerbach, David F. Bacon, Perry Cheng, Rodric Rabbah
Technical Report RC25004, IBM, 2010
version 2.0
2009
Jia Zou, Joshua Auerbach, David F. Bacon, Edward A. Lee
Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 31--40, ACM, 2009
The Flexotask system claims to enable implementation of both real-time applications and real-time schedulers in a Java Virtual Machine using an actors-like model. The PTIDES model is an actors-like model that claims to deliver precise control over end-to-end latencies in a complex real-time system. The present work jointly investigates both claims by (1) implementing several PTIDES-based schedulers as Flexotask scheduler plugins, and (2) using the resulting system to implement a new reactive control program for a simulation of the JAviator. We present results from the realistic JAviator control application and also from synthetic benchmarks designed to shed light on the differences between the several PTIDES schedulers we implemented.
Andrei Hagiescu, Weng-Fai Wong, David F. Bacon, Rodric Rabbah
Proceedings of the 46th Annual Design Automation Conference, pp. 282--287, ACM, 2009
Stream processing represents an important class of applications that spans telecommunications, multimedia and the Internet. The implementation of streaming programs in FPGAs has attracted significant attention because of their inherent parallelism and high performance requirements. Languages, tools, and even custom hardware for streaming have been proposed, some of which are commercially available.
There are several significant challenges to realizing streaming applications directly in hardware (FPGAs). Since FPGAs have finite resources, there are often many non-trivial tradeoffs between processing throughput and overall latency. In this paper, we describe an algorithm that computes refinements of stream graphs into designs that optimize processing throughput subject to user-specified area and latency constraints.
Dan Tsafrir, Robert W. Wisniewski, David F. Bacon, Bjarne Stroustrup
Proceedings of the 24th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 425--444, ACM, 2009
Generic classes can be used to improve performance by allowing compile-time polymorphism. But the applicability of compile-time polymorphism is narrower than that of runtime polymorphism, and it might bloat the object code. We advocate a programming principle whereby a generic class should be implemented in a way that minimizes the dependencies between its members (nested types, methods) and its generic type parameters. Conforming to this principle (1) reduces the bloat and (2) gives rise to a previously unconceived manner of using the language that expands the applicability of compile-time polymorphism to a wider range of problems. Our contribution is thus a programming technique that generates faster and smaller programs. We apply our ideas to GCC's STL containers and iterators, and we demonstrate notable speedups and reduction in object code size (real application runs 1.2x to 2.1x faster and STL code is 1x to 25x smaller). We conclude that standard generic APIs (like STL) should be amended to reflect the proposed principle in the interest of efficiency and compactness. Such modifications will not break old code, simply increase flexibility. Our findings apply to languages like C++, C#, and D, which realize generic programming through multiple instantiations.
Harald R"ock, Joshua Auerbach, Christoph M. Kirsch, David F. Bacon
Proceedings of the Seventh International Workshop on Java Technologies for Real-Time and Embedded Systems, pp. 70--79, ACM, 2009
Large real-time software systems such as real-time Java virtual machines often use barrier protocols, which work for a dynamically varying number of threads without using centralized locking. Such barrier protocols, however, still suffer from priority inversion similar to centralized locking. We introduce gang priority management as a generic solution for avoiding unbounded priority inversion in barrier protocols. Our approach is either kernel-assisted (for efficiency) or library-based (for portability) but involves cooperation from the protocol designer (for generality). We implemented gang priority management in the Linux kernel and rewrote the garbage collection safe-point barrier protocol in IBM's WebSphere Real Time Java Virtual Machine to exploit it. We run experiments on an 8-way SMP machine in a multi-user and multi-process environment, and show that by avoiding unbounded priority inversion, the maximum latency to reach a barrier point is reduced by a factor of 5.3 and the application jitter is reduced by a factor of 1.5.
David F. Bacon, Yiling Chen, David Parkes, Malvika Rao
Companion to the 24th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 973--980, ACM, 2009
Software correctness has bedeviled the field of computer science since its inception. Software complexity has increased far more quickly than our ability to control it, reaching sizes that are many orders of magnitude beyond the reach of formal or automated verification techniques.
We propose a new paradigm for evaluating ``correctness'' based on a rich market ecosystem in which coalitions of users bid for features and fixes. Developers, testers, bug reporters, and analysts share in the rewards for responding to those bids. In fact, we suggest that the entire software development process can be driven by a disintermediated market-based mechanism driven by the desires of users and the capabilities of developers.
The abstract, unspecifiable, and unknowable notion of absolute correctness is then replaced by quantifiable notions of correctness demand (the sum of bids for bugs) and correctness potential (the sum of the available profit for fixing those bugs). We then sketch the components of a market design intended to identify bugs, elicit demand for fixing bugs, and source workers for fixing bugs. The ultimate goal is to achieve a more appropriate notion of correctness, in which market forces drive software towards a correctness equilibrium in which all bugs for which there is enough value, and with low enough cost to fix, are fixed.
Joshua Auerbach, David F. Bacon, Daniel Iercan, Christoph M. Kirsch, V. T. Rajan, Harald R"ock, Rainer Trummer
ACM Transactions on Embedded Computing Systems 8(2), 1--48, ACM, 2009
Exotasks are a novel Java programming construct that achieve three important goals. They achieve low latency while allowing the fullest use of Java language features, compared to previous attempts to restrict the Java language for use in the submillisecond domain. They support pluggable schedulers, allowing easy implementation of new scheduling paradigms in a real-time Java system. They can achieve deterministic timing, even in the presence of other Java threads, and across changes of hardware and software platform. To achieve these goals, the program is divided into tasks with private heaps. Tasks may be strongly isolated, communicating only with each other and guaranteeing determinism, or weakly isolated, allowing some communication with the rest of the Java application. Scheduling of the tasks' execution, garbage collection, and value passing is accomplished by the pluggable scheduler. Schedulers that we have written employ logical execution time (LET) in association with strong isolation to achieve time portability. We have also built a quad-rotor model helicopter, the JAviator, which we use to evaluate our implementation of Exotasks in an experimental embedded version of IBM's J9 real-time virtual machine. Our experiments show that we are able to maintain very low scheduling jitter and deterministic behavior in the face of variations in both software load and hardware platform. We also show that Exotasks perform nearly as well as Eventrons on a benchmark audio application.
2008
Amir Hormati, Manjunath Kudlur, Scott Mahlke, David F. Bacon, Rodric Rabbah
Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 41--50, ACM, 2008
In this paper, we introduce Optimus: an optimizing synthesis compiler for streaming applications. Optimus compiles programs written in a high level streaming language to either software or hardware implementations. The compiler uses a hierarchical compilation strategy that separates concerns between macro- and micro-functional requirements. Macro-functional concerns address how components (modules) are assembled to implement larger more complex applications. Micro-functional issues deal with synthesis issues of the module internals. Optimus thus allows software developers who lack deep hardware design expertise to transparently leverage the advantages of hardware customization without crossing the semantic gap between high level languages and hardware description languages. Optimus generates streaming hardware that achieves on average 40x speedup over our baseline embedded processor for a fraction of the energy. Additionally, our results show that streaming-specific optimizations can further improve performance by 255\% and reduce the area requirements by 16\% in average. These designs are competitive with Handel-C implementations for some of the same benchmarks.
Shan Shan Huang, Amir Hormati, David F. Bacon, Rodric Rabbah
Proceedings of the 22nd European Conference on Object-Oriented Programming, pp. 76--103, Springer-Verlag, 2008
The paradigm shift in processor design from monolithic processors to multicore has renewed interest in programming models that facilitate parallelism. While multicores are here today, the future is likely to witness architectures that use reconfigurable fabrics (FPGAs) as coprocessors. FPGAs provide an unmatched ability to tailor their circuitry per application, leading to better performance at lower power. Unfortunately, the skills required to program FPGAs are beyond the expertise of skilled software programmers. This paper shows how to bridge the gap between programming software vs. hardware. We introduce Lime, a new Object-Oriented language that can be compiled for the JVM or into a synthesizable hardware description language. Lime extends Java with features that provide a way to carry OO concepts into efficient hardware. We detail an end-to-end system from the language down to hardware synthesis and demonstrate a Lime program running on both a conventional processor and in an FPGA.
Joshua Auerbach, David F. Bacon, Rachid Guerraoui, Jesper Honig Spring, Jan Vitek
Proceedings of the ACM SIGPLAN-SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 1--11, ACM, 2008
The disadvantages of unconstrained shared-memory multi-threading in Java, especially with regard to latency and determinism in realtime systems, have given rise to a variety of language extensions that place restrictions on how threads allocate, share, and communicate memory, leading to order-of-magnitude reductions in latency and jitter. However, each model makes different trade-offs with respect to expressiveness, efficiency, enforcement, and latency, and no one model is best for all applications.
In this paper we present Flexible Task Graphs (Flexotasks), a single system that allows different isolation policies and mechanisms to be combined in an orthogonal manner, subsuming four previously proposed models as well as making it possible to use new combinations best suited to the needs of particular applications. We evaluate our implementation on top of the IBM Web-Sphere Real Time Java virtual machine using both a microbenchmark and a 30 KLOC avionics collision detector. We show that Flexotasks are capable of executing periodic threads at 10 KHz with a standard deviation of 1.2us and that it achieves significantly better performance than RTSJ's scoped memory constructs while remaining impervious to interference from global garbage collection.
Doug Lea, David F. Bacon, David Grove
Proceedings of the 2008 SIGPLAN Workshop on Programming Language Curriculum, pp. 87--92, ACM
Programs encounter increasingly complex and fragile mappings to computing platforms, resulting in performance characteristics that are often mysterious to students, practitioners, and even researchers. We discuss some steps toward an experimental methodology that demands and provides a deep understanding of complete systems, the necessary instrumentation and tools to support such a methodology, and a curriculum that teaches the methodology and tools as a fundamental part of the discipline.
Bill McCloskey, David F. Bacon, Perry Cheng, David Grove
Technical Report RC24504, 2008
Existing real-time garbage collectors are either unable to scale to large multiprocessors, or unable to meet hard real-time require- ments even with specialized hardware support.
These limitations are rapidly becoming unacceptable: hardware improvements have brought multi-gigabyte heaps and ubiquitous multi-core parallelism; applications have increasingly stringent real-time requirements; and non-embedded, highly parallel server applications form an increasing percentage of real-time workloads.
We present Staccato, an algorithm that supports both hard- and soft-real-time garbage collection on stock hardware and both real- time and stock operating systems. The algorithm is incremental, concurrent, and parallel. Defragmentation can be performed on a per-object basis using a lock-free algorithm which requires no atomic mutator operations in the common case. On a real-time operating system it is capable of meeting hard real-time bounds.
We have implemented Staccato in IBM's J9 virtual machine and present an evaluation on IBM's real-time variant of Linux. Staccato is able to achieve maximum pause times of 753us and an MMU of 85\% over a 5ms time window, out-performing both IBM's Metronome-based product and Azul's soft real-time, hardware- assisted collector.
Joshua Auerbach, David F. Bacon, Perry Cheng, David Grove, Ben Biron, Charlie Gracie, Bill McCloskey, Aleksandar Micic, Ryan Sciampacone
Proceedings of the 8th ACM International Conference on Embedded Software, pp. 245--254, ACM, 2008
Real-time Garbage Collection (RTGC) has recently advanced to the point where it is being used in production for financial trading, military command-and-control, and telecommunications. However, among potential users of RTGC, there is enormous diversity in both application requirements and deployment environments.
Previously described RTGCs tend to work well in a narrow band of possible environments, leading to fragile systems and limiting adoption of real-time garbage collection technology.
This paper introduces a collector scheduling methodology called tax-and-spend and the collector design revisions needed to support it. Tax-and-spend provides a general mechanism which works well across a variety of application, machine, and operating system configurations. Tax-and-spend subsumes the predominant pre-existing RTGC scheduling techniques. It allows different policies to be applied in different contexts depending on the needs of the application. Virtual machines can co-exist compositionally on a single machine.
We describe the implementation of our system, Metronome-TS, as an extension of the Metronome collector in IBM's Real-time J9 virtual machine product, and we evaluate it running on an 8-way SMP blade with a real-time Linux kernel. Compared to the state-of-the-art Metronome system on which it is based, implemented in the identical infrastructure, it achieves almost 3x shorter latencies, comparable utilization at a 2.5x shorter time window, and mean throughput improvements of 10-20\%.
2007
Adam Dingle, David F. Bacon
Unpublished, 2007
While many forms of memory management have been proposed, the only ones to achieve widespread adoption have been explicit deallocation and garbage collection. This leaves programmers requiring explicit control of memory behavior unable to obtain the software engineering and security benefits of programming in a safe language. We present a new approach to memory management called alias counting which combines type-based object ownership and run-time reference counting of references by non-owning (aliasing) pointers. When an owning pointer is destroyed, if the target object's reference count is zero, the object is destroyed; otherwise a run-time error occurs. We have implemented alias counting in a safe C#-like language called Gel. The Gel compiler includes an effective reference count elimination optimization that takes advantage of ownership semantics. Our anecdotal experience in writing the Gel compiler and some smaller benchmarks in Gel is that the annotation requirements and extra programming effort are small, and the type system catches most ownership errors at compile time. Quantitatively, we compare microbenchmarks written in Gel, Java, C#, and C++, and versions of the Gel compiler written in both Gel and C#. In our benchmarks, Gel programs require at most 40\% more memory than the corresponding C++ programs, while C# and Java generally require at least twice as much memory and often considerably more. Run-time performance varies considerably across all languages, with no clear winner, but Gel outperforms either C++ or Java/C# in all but one benchmark. Our reference count optimization typically eliminates 90\% of all reference counting operations, speeding Gel programs by 20-40\%.
Unpublished
Daniel Frampton, David F. Bacon, Perry Cheng, David Grove
Proceedings of the 21st European Conference on Object-Oriented Programming, pp. 101--125, Springer, 2007
While real-time garbage collection is now available in production virtual machines, the lack of generational capability means applications with high allocation rates are subject to reduced throughput and high space overheads.
Since frequent allocation is often correlated with a high-level, object-oriented style of programming, this can force builders of real-time systems to compromise on software engineering.
We have developed a fully incremental, real-time generational collector based on a tri-partite nursery, which partitions the nursery into regions that are being allocated, collected, and promoted. Nursery collections are incremental, and can occur within any phase of a mature collection.
We present the design, mathematical model, and implementation of our collector in IBM's production Real-time Java virtual machine, and show both analytically and experimentally that the collector achieves real-time bounds comparable to a non-generational Metronome-style collector, while cutting memory consumption and total execution times by as much as 44\% and 24\% respectively.
Ben L. Titzer, Joshua Auerbach, David F. Bacon, Jens Palsberg
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 352--362, ACM, 2007
Embedded systems pose unique challenges to Java application developers and virtual machine designers. Chief among these challenges is the memory footprint of both the virtual machine and the applications that run within it. With the rapidly increasing set of features provided by the Java language, virtual machine designers are often forced to build custom implementations that make various tradeoffs between the footprint of the virtual machine and the subset of the Java language and class libraries that are supported. In this paper, we present the ExoVM, a system in which an application is initialized in a fully featured virtual machine, and then the code, data, and virtual machine features necessary to execute it are packaged into a binary image. Key to this process is feature analysis, a technique for computing the reachable code and data of a Java program and its implementation inside the VM simultaneously. The ExoVM reduces the need to develop customized embedded virtual machines by reusing a single VM infrastructure and automatically eliding the implementation of unused Java features on a per-program basis. We present a constraint-based instantiation of the analysis technique, an implementation in IBM's J9 Java VM, experiments evaluating our technique for the EEMBC benchmark suite, and some discussion of the individual costs of some of Java's features. Our evaluation shows that our system can reduce the non-heap memory allocation of the virtual machine by as much as 75\%. We discuss VM and language design decisions that our work shows are important in targeting embedded systems, supporting the long-term goal of a common VM infrastructure spanning from motes to large servers.
Joshua Auerbach, David F. Bacon, Daniel T. Iercan, Christoph M. Kirsch, V. T. Rajan, Harald R"ock, Rainer Trummer
Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 51--62, ACM, 2007
Existing programming methodologies for real-time systems suffer from a low level of abstraction and non-determinism in both the timing and the functional domains. As a result, real-time systems are difficult to test and must be re-certified every time changes are made to either the software or hardware environment. Exotasks are a novel Java programming construct that achievedeterministic timing, even in the presence of other Java threads, and across changes of hardware and software platform. They are deterministic functional data-flow tasks written in Java, combined with an orthogonal scheduling policy based on the logical execution time (LET) model. We have built a quad-rotor model helicopter, the JAviator, which we use as a testbed for this work. We evaluate our implementation of exotasks in IBM's J9 real-time virtual machine using actual flights of the helicopter. Our experiments show that we are able to maintain deterministic behavior in the face of variations in both software load and hardware platform.
Martin T. Vechev, Eran Yahav, David F. Bacon, Noam Rinetzky
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 456--467, ACM, 2007
Concurrent garbage collectors are notoriously hard to design, implement, and verify. We present a framework for the automatic exploration of a space of concurrent mark-and-sweep collectors. In our framework, the designer specifies a set of ``building blocks'' from which algorithms can be constructed. These blocks reflect the designer's insights about the coordination between the collector and the mutator. Given a set of building blocks, our framework automatically explores a space of algorithms, using model checking with abstraction to verify algorithms in the space.
We capture the intuition behind some common mark-and-sweep algorithms using a set of building blocks. We utilize our framework to automatically explore a space of more than 1,600,000 algorithms built from these blocks, and derive over 100 correct fine-grained algorithms with various space, synchronization, and precision tradeoffs.
David F. Bacon
Queue 5(1), 40--49, ACM, 2007
Traditional computer science deals with the computation of correct results. Realtime systems interact with the physical world, so they have a second correctness criterion: they have to compute the correct result within a bounded amount of time. Simply building functionally correct software is hard enough. When timing is added to the requirements, the cost and complexity of building the software increase enormously.
Joshua Auerbach, David F. Bacon, Florian B"omers, Perry Cheng
Proceedings of the International Computer Music Conference, 2007
Automatic memory management via garbage collection is the key to the safety, portability, and high productivity of modern programming languages like Java. However, until now no truly real-time garbage collector has existed for Java. As a result, the extreme real-time requirements of interactive music synthesis and processing have made it impossible to build such systems in Java.
We have developed a hard real-time garbage collector, called Metronome, around which IBM has built a production real-time Java virtual machine running on a real-time variant of Redhat Enterprise Linux. In this paper we demonstrate the real-time capabilities of our virtual machine with a MIDI music synthesizer that we built entirely in standard Java using the full object-oriented feaures of the language.
We show that even with the addition of another thread allocating 8 MB/second of data, our garbage collector is able to sustain error-free 44 KHz music synthesis down to buffer sizes of 1ms, achieving keyboard-to-speaker latencies of 5.0 +/- 0.75ms, comparable to a Kurzweil K2000R synthesizer (3.9 +/- 1ms) and suitable for high-fidelity interactive performance.
David F. Bacon, Perry Cheng, David Grove
Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 854--855, ACM, 2007
Debugging the timing behavior of real-time systems is notoriously difficult, and with a new generation of complex systems consisting of tens of millions of lines of code, the difficulty is increasing enormously. We have developed TuningFork, a tool especially designed for visualization and analysis of large-scale real-time systems. TuningFork is capable of recording high-frequency events at sub-microsecond resolution with minimal perturbation. Users can visualize system activity online in real-time and interactively explore the data. Data can be gathered from multiple layers and/or components and synthesized into visualizations that illuminate whole system interactions. Interactive exploration of hypothesis is naturally supported by direct manipulation to quickly build up complex visualizations.
Joshua Auerbach, David F. Bacon, Bob Blainey, Perry Cheng, Michael Dawson, Mike Fulton, David Grove, Darren Hart, Mark Stoodley
Proceedings of the Seventh ACM/IEEE International Conference on Embedded Software, pp. 249--258, ACM, 2007
The emergence of standards for programming real-time systems in Java has encouraged many developers to consider its use for systems previously only built using C, Ada, or assembly language. However, the RTSJ standard in isolation leaves many important problems unaddressed, and suffers from some serious problems in usability and safety.
As a result, the use of Java for real-time programming has continued to be viewed as risky and adoption has been slow.
In this paper we provide a description of IBM's new real-time Java virtual machine product, which combines Metronome real-time garbage collection, ahead-of-time compilation, and a complete implementation of the RTSJ standard, running on top of a custom real-time multiprocessor Linux kernel.
We will describe the implementation of each of these components, including how they interacted both positively and negatively, and the extensions to previous work required to move it from research prototype to a system implementing the complete semantics of the Java language. The system has been adopted for hard real-time development of naval weapons systems and soft real-time telecommunications servers. We present measurements showing that the system is able to provide sub-millisecond worst-case garbage collection latencies, 50 microsecond Linux scheduling accuracy, and eliminate non-determinism due to JIT compilation.
2006
David F. Bacon, Xiaowei Shen
IBM Journal of Research and Development 50(2/3), 209--221, IBM Corp., 2006
As processor speeds continue to increase at a much higher rate than memory speeds, memory latencies may soon approach a thousand processor cycles. As a result, the flat memory model that was made practical by deeply pipelined superscalar processors with multilevel caches will no longer be tenable. The most common approach to this problem is multithreading; however, multithreading requires either abundant independent applications or well-parallelized monolithic applications, and neither is easy to come by. We present high-level programming constructs called braids and fibers. The programming constructs facilitate the creation of programs that are partially ordered, in which the partial orders can be used to support adaptive responses to memory access latencies. Braiding is simpler than parallelizing, while yielding many of the same benefits. We show how the programming constructs can be effectively supported with simple instruction set architecture extensions and microarchitectural enhancements. We have developed braided versions of a number of important algorithms. The braided code is easy to understand at the source level and can be translated into highly efficient instructions using our architecture extensions.
Martin T. Vechev, Eran Yahav, David F. Bacon
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 341--353, ACM, 2006
Constructing correct concurrent garbage collection algorithms is notoriously hard. Numerous such algorithms have been proposed, implemented, and deployed - and yet the relationship among them in terms of speed and precision is poorly understood, and the validation of one algorithm does not carry over to others.As programs with low latency requirements written in garbagecollected languages become part of society's mission-critical infrastructure, it is imperative that we raise the level of confidence in the correctness of the underlying system, and that we understand the trade-offs inherent in our algorithmic choice.In this paper we present correctness-preserving transformations that can be applied to an initial abstract concurrent garbage collection algorithm which is simpler, more precise, and easier to prove correct than algorithms used in practice--but also more expensive and with less concurrency. We then show how both pre-existing and new algorithms can be synthesized from the abstract algorithm by a series of our transformations. We relate the algorithms formally using a new definition of precision, and informally with respect to overhead and concurrency.This provides many insights about the nature of concurrent collection, allows the direct synthesis of new and useful algorithms, reduces the burden of proof to a single simple algorithm, and lays the groundwork for the automated synthesis of correct concurrent collectors.
David F. Bacon, Perry Cheng, Daniel Frampton, David Grove, Matthias Hauswirth, V. T. Rajan
Proceedings of the 15th International Conference on Compiler Construction, pp. 96-100, Springer, 2006
TuningFork is an online, scriptable data visualization and analysis tool that supports the development and continuous monitoring of real-time systems. While TuningFork was originally designed and tested for use with a particular real-time Java Virtual Machine, the architecture has been designed from the ground up for extensibility by leveraging the Eclipse plug-in architecture. This allows different client programs to design custom data formats, new visualization and analysis components, and new export formats. The TuningFork views allow the visualization of data from time scales of microseconds to minutes, enabling rapid understanding and analysis of system behavior.
Daniel Spoonhower, Joshua Auerbach, David F. Bacon, Perry Cheng, David Grove
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283--294, ACM, 2006
While real-time garbage collection has achieved worst-case latencies on the order of a millisecond, this technology is approaching its practical limits. For tasks requiring extremely low latency, and especially periodic tasks with frequencies above 1 KHz, Java programmers must currently resort to the NoHeapRealtimeThread construct of the Real-Time Specification for Java. This technique requires expensive run-time checks, can result in unpredictable low-level exceptions, and inhibits communication with the rest of the garbage-collected application. We present Eventrons, a programming construct that can arbitrarily preempt the garbage collector, yet guarantees safety and allows its data to be visible to the garbage-collected heap. Eventrons are a strict subset of Java, and require no run-time memory access checks. Safety is enforced using a data-sensitive analysis and simple run-time support with extremely low overhead. We have implemented Eventrons in IBM's J9 Java virtual machine, and present experimental results in which we ran Eventrons at frequencies up to 22 KHz (a 45 us period). Across 10 million periods, 99.997\% of the executions ran within 10 us of their deadline, compared to 99.999\% of the executions of the equivalent program written in C.
2005
David F. Bacon, Perry Cheng, David Grove, Martin T. Vechev
Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 183--192, ACM, 2005
Real-time garbage collection has been shown to be feasible, but for programs with high allocation rates, the utilization achievable is not sufficient for some systems.Since a high allocation rate is often correlated with a more high-level, abstract programming style, the ability to provide good real-time performance for such programs will help continue to raise the level of abstraction at which real-time systems can be programmed.We have developed techniques that allow generational collection to be used despite the problems caused by variance in program behavior over the short time scales in which a nursery can be collected. Syncopation allows such behavior to be detected by the scheduler in time for allocation to by-pass the nursery and allow real-time bounds to be met.We have provided an analysis of the costs of both generational and non-generational techniques, which allow the trade-offs to be evaluated quantitatively. We have also provided measurements of application behavior which show that while syncopation is necessary, the need for it is rare enough that generational collection can provide major improvements in real-time utilization. An additional technique, arraylet pre-tenuring, often significantly improves generational behavior.
Harel Paz, Erez Petrank, David F. Bacon, Elliot K. Kolodner, V. T. Rajan
Proceedings of the 14th International Conference on Compiler Construction, pp. 156--171, Springer-Verlag, 2005
A reference-counting garbage collector cannot reclaim unreachable cyclic structures of objects. Therefore, reference-counting collectors either use a backup tracing collector infrequently, or employ a cycle collector to reclaim cyclic structures. We propose a new concurrent cycle collector, i.e., one that runs concurrently with the program threads, imposing negligible pauses (of around 1ms) on a multiprocessor.
Our new collector combines the state-of-the-art cycle collector [5] with the sliding-views collectors [20, 2]. The use of sliding views for cycle collection yields two advantages. First, it drastically reduces the number of cycle candidates, which in turn, drastically reduces the work required to record and trace these candidates. Therefore, a large improvement in cycle collection efficiency is obtained. Second, it eliminates the theoretical termination problem that appeared in the previous concurrent cycle collector. There, a rare race may delay the reclamation of an unreachable cyclic structure forever. The sliding-views cycle collector guarantees reclamation of all unreachable cyclic structures.
The proposed collector was implemented on the Jikes RVM and we provide measurements including a comparison between the use of backup tracing and the use of cycle collection with reference counting. To the best of our knowledge, such a comparison has not been reported before.
Martin T. Vechev, David F. Bacon, Perry Cheng, David Grove
Proceedings of the 19th European Conference on Object-Oriented Programming, pp. 577--601, Springer-Verlag, 2005
There are many algorithms for concurrent garbage collection, but they are complex to describe, verify, and implement. This has resulted in a poor understanding of the relationships between the algorithms, and has precluded systematic study and comparative evaluation. We present a single high-level, abstract concurrent garbage collection algorithm, and show how existing snapshot and incremental update collectors, can be derived from the abstract algorithm by reducing precision. We also derive a new hybrid algorithm that reduces floating garbage while terminating quickly. We have implemented a concurrent collector framework and the resulting algorithms in IBM's J9 Java virtual machine product and compared their performance in terms of space, time, and incrementality. The results show that incremental update algorithms sometimes reduce memory requirements (on 3 of 5 benchmarks) but they also sometimes take longer due to recomputation in the termination phase (on 4 of 5 benchmarks). Our new hybrid algorithm has memory requirements similar to the incremental update collectors while avoiding recomputation in the termination phase.
David F. Bacon, Perry Cheng, David Grove, Michael Hind, V. T. Rajan, Eran Yahav, Matthias Hauswirth, Christoph M. Kirsch, Daniel Spoonhower, Martin T. Vechev
Proceedings of the Fifth ACM International Conference on Embedded Software, pp. 68--78, ACM, 2005
Real-time systems have reached a level of complexity beyond the scaling capability of the low-level or restricted languages traditionally used for real-time programming.While Metronome garbage collection has made it practical to use Java to implement real-time systems, many challenges remain for the construction of complex real-time systems, some specific to the use of Java and others simply due to the change in scale of such systems.The goal of our current research is the creation of a comprehensive Java-based programming environment and methodology for the creation of complex real-time systems. Our goals include construction of a provably correct real-time garbage collector capable of providing worst case latencies of 100 us, capable of scaling from sensor nodes up to large multiprocessors; specialized programming constructs that retain the safety and simplicity of Java, and yet provide sub-microsecond latencies; the extension of Java's ``write once, run anywhere'' principle from functional correctness to timing behavior; on-line analysis and visualization that aids in the understanding of complex behaviors; and a principled probabilistic analysis methodology for bounding the behavior of the resulting systems.While much remains to be done, this paper describes the progress we have made towards these goals.
2004
Martin T. Vechev, David F. Bacon
Proceedings of the Fourth International Symposium on Memory Management, pp. 13--24, ACM, 2004
Concurrent garbage collectors require write barriers to preserve consistency, but these barriers impose significant direct and indirect costs. While there has been a lot of work on optimizing write barriers, we present the first study of their elision in a concurrent collector. We show conditions under which write barriers are redundant, and describe how these conditions can be applied to both incremental update or snapshot-at-the-beginning barriers. We then evaluate the potential for write barrier elimination with a trace-based limit study, which shows that a significant percentage of write barriers are redundant. On average, 54\% of incremental barriers and 83\% of snapshot barriers are unnecessary.
Sunil Soman, Chandra Krintz, David F. Bacon
Proceedings of the Fourth International Symposium on Memory Management, pp. 49--60, ACM, 2004
Much prior work has shown that the performance enabled by garbage collection (GC) systems is highly dependent upon the behavior of the application as well as on the available resources. That is, no single GC enables the best performance for all programs and all heap sizes. To address this limitation, we present the design, implementation, and empirical evaluation of a novel Java Virtual Machine (JVM) extension that facilitates dynamic switching between a number of very different and popular garbage collectors. We also show how to exploit this functionality using annotation-guided GC selection and evaluate the system using a large number of benchmarks. In addition, we implement and evaluate a simple heuristic to investigate the efficacy of switching automatically. Our results show that, on average, our annotation-guided system introduces less than 4\% overhead and improves performance by 24\% over the worst-performing GC (across heap sizes) and by 7\% over always using the popular Generational/Mark-Sweep hybrid.
David F. Bacon, Ravi Konuru, Chet Murthy, Mauricio J. Serrano
Twenty Years of the ACM SIGPLAN Conference on Programming Language
Design and Implementation (1979--1999): A Selection, pp. 583--595, ACM, 2004
David F. Bacon, Perry Cheng, V. T. Rajan
Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 50--68, ACM, 2004
Tracing and reference counting are uniformly viewed as being fundamentally different approaches to garbage collection that possess very distinct performance properties. We have implemented high-performance collectors of both types, and in the process observed that the more we optimized them, the more similarly they behaved - that they seem to share some deep structure.
We present a formulation of the two algorithms that shows that they are in fact duals of each other. Intuitively, the difference is that tracing operates on live objects, or ``matter'', while reference counting operates on dead objects, or ``anti-matter''. For every operation performed by the tracing collector, there is a precisely corresponding anti-operation performed by the reference counting collector.
Using this framework, we show that all high-performance collectors (for example, deferred reference counting and generational collection) are in fact hybrids of tracing and reference counting. We develop a uniform cost-model for the collectors to quantify the trade-offs that result from choosing different hybridizations of tracing and reference counting. This allows the correct scheme to be selected based on system performance requirements and the expected properties of the target application.
David F. Bacon, Xiaowei Shen
First Watson Conference on Interaction between Architecture, Circuits, and Compilers, 2004
As processor speeds continue to increase at a much higher exponential rate than DRAM speeds, memory latencies will soon exceed 1000 cycles.
With such non-uniform access times, the flat memory model that was made practical by deeply pipelined superscalar processors with multi-level cache memories will no longer be tenable due to the inexorable effects of Amdahl's Law. The most common approach to this problem is hardware multi-threading, but multi-threading requires either abundant independent applications, or well-parallelized monolithic applications, and neither is easy to come by.
We present braids and fibers, high-level programming constructs which facilitate the creation of programs that are partially ordered. These partial orders can be used to respond adaptively to hardware latencies. We show how these constructs can be effectively supported with very simple and inexpensive instruction set and micro-architectural extensions.
Braiding is much simpler than parallelizing, but yields many of the same benefits. We have developed braided versions of a number of important algorithms, including quicksort and the mark phase of a garbage collector. The braided code is easy to understand at the source level, yet can be translated into highly efficient code using our hardware extensions.
David F. Bacon
Third International Workshop on Language Runtimes, 2004
Virtual machine technology will continue to increase in importance, eventually spanning from motes to supercomputers. In this position paper, I describe my personal vision for the virtual machine of the future: a virtual machine that dynamically modifies all portions of itself, rather than just the compiled code, and that dynamically re- configures the underlying hardware as well.
David F. Bacon, Perry Cheng, David Grove
Proceedings of the Fourth ACM International Conference on Embedded Software, pp. 125--136, ACM, 2004
Security concerns on embedded devices like cellular phones make Java an extremely attractive technology for providing third-party and user-downloadable functionality. However, garbage collectors have typically required several times the maximum live data set size (which is the minimum possible heap size) in order to run well. In addition, the size of the virtual machine (ROM) image and the size of the collector's data structures (metadata) have not been a concern for server- or workstation-oriented collectors.We have implemented two different collectors specifically designed to operate well on small embedded devices. We have also developed a number of algorithmic improvements and compression techniques that allow us to eliminate almost all of the per-object overhead that the virtual machine and the garbage collector require. We describe these optimizations and present measurements of the Java embedded benchmarks (EEMBC) of our implementations on both an IA32 laptop and an ARM-based PDA.For applications with low to moderate allocation rates, our optimized collector running on the ARM is able to achieve 85\% of peak performance with only 1.05 to 1.3 times the absolute minimum heap size. For applications with high allocation rates, the collector achieves 85\% of peak performance with 1.75 to 2.5 times the minimum heap size. The collector code takes up 40 KB of ROM, and collector metadata overhead has been almost completely eliminated, consuming only 0.4\% of the heap.
2003
David F. Bacon
Concurrency---Practice and Experience 15(3--5), 185--206, 2003
Object-oriented programming languages have always distinguished between ``primitive'' and ``user- defined'' data types, and in the case of languages like C++ and Java, the primitives are not even treated as objects, further fragmenting the programming model. The distinction is especially problematic when a particular programming community requires primitive-level support for a new data type, as for complex, intervals, fixed-point numbers, and so on.
We present Kava, a design for a backward-compatible version of Java that solves the problem of programmable lightweight objects in a much more aggressive and uniform manner than previous proposals. In Kava, there are no primitive types; instead, object-oriented programming is provided down to the level of single bits, and types such as int can be explicitly programmed within the language. While the language maintains a uniform object reference semantics, efficiency is obtained by making heavy use of unboxing and semantic expansion.
We describe Kava as a dialect of the Java language, show how it can be used to define various primitive types, describe how it can be translated into Java, and compare it to other approaches to lightweight objects
David F. Bacon, Clement R. Attanasio, V. T. Rajan, Stephen Smith, Han B. Lee
Technical Report, 2003
While garbage collection via reference counting has many appealing properties,
it has always suffered from some major limitations. We present new algorithms
for reference counting that address these limitations by (1) improving on
techniques for avoiding reference counting of stack variables, and (2)
reclaiming cyclic garbage.
These algorithms are designed to operate with concurrent mutators in
multiprocessor run-time systems. We present both proofs of correctness and an
experimental evaluation in the context of the Jalape\~no Java virtual machine.
We present measurements comparing our collector against a non-concurrent
but parallel load-balancing mark-and-sweep collector (that we also implemented
in Jalape\~no), and evaluate the classical tradeoff between response time and
throughput.
When processor or memory resources are limited, the Recycler usually runs at
about 90\% of the speed of the mark-and-sweep collector.
However, with an extra processor to run collection and with a moderate amount
of memory headroom, the Recycler is able to operate without ever blocking the
mutators and achieves a maximum measured mutator delay of only 2.6 milliseconds
for our benchmarks. End-to-end execution time is usually within 5\%.
An extended journal-style paper describing the Recycler, combining and extending material from the PLDI'01 and ECOOP'01 papers
David F. Bacon, Perry Cheng, V. T. Rajan
Proceedings of the OTM 2003 Workshops: On The Move to Meaningful Internet Systems, pp. 466-478, Springer
With the wide-spread adoption of Java, there is significant interest in using the language for programming real-time systems. The community has generally viewed a truly real-time garbage collector as being impossible to build, and has instead focused its efforts on adding manual memory management mechanisms to Java. Unfortunately, these mechanisms are an awkward fit for the language: they introduce significant run-time overhead, introduce run-time memory access exceptions, and greatly complicate the development of library code. In recent work we have shown that it is possible to build a real-time collector for Java with highly regular CPU utilization and greatly reduced memory footprint. The system currently achieves 6 ms pause times with 50\% CPU utilization (MMU) and virtually no ``tail'' in the distribution. We show how this work can be incorporated into a general real-time framework, and extended to systems with higher task frequencies. We argue that the community should focus more effort on such a simple, orthogonal solution that is true to the spirit of the Java language.
David F. Bacon, Perry Cheng, V. T. Rajan
Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 285--298, ACM, 2003
Now that the use of garbage collection in languages like Java is becoming widely accepted due to the safety and software engineering benefits it provides, there is significant interest in applying garbage collection to hard real-time systems. Past approaches have generally suffered from one of two major flaws: either they were not provably real-time, or they imposed large space overheads to meet the real-time bounds. We present a mostly non-moving, dynamically defragmenting collector that overcomes both of these limitations: by avoiding copying in most cases, space requirements are kept low; and by fully incrementalizing the collector we are able to meet real-time bounds. We implemented our algorithm in the Jikes RVM and show that at real-time resolution we are able to obtain mutator utilization rates of 45\% with only 1.6--2.5 times the actual space required by the application, a factor of 4 improvement in utilization over the best previously published results. Defragmentation causes no more than 4\% of the traced data to be copied.
John Corwin, David F. Bacon, David Grove, Chet Murthy
Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications, pp. 241--254, ACM, 2003
While Java provides many software engineering benefits, it lacks a coherent module system and instead provides only packages (which are primarily a name space mechanism) and classloaders (which are very low-level). As a result, large Java applications suffer from unexpected interactions between independent components, require complex CLASSPATH definitions, and are often extremely complex to install and maintain. We have implemented a module system for Java called MJ that is implemented with class loaders, but provides a much higher-level interface. High-level properties can be specified in a module definition and are enforced by the module system as new modules are loaded. To experimentally validate the ability of MJ to properly handle the complex module inter-relationships found in large Java server systems, we replaced the classloader mechanisms of Apache Tomcat 4.1.18 [27] with 30 MJ modules. The modified Tomcat is functionally identical to the original, but requires no CLASSPATH definitions, and will operate correctly even if user code loads a different version of a module used by Tomcat, such as the Xerces XML parser [31]. Furthermore, by making a small change to the Java core libraries enabled by MJ, we obtained a 30\% performance improvement in a servlet microbenchmark.
David F. Bacon, Perry Cheng, V. T. Rajan
Proceedings of the ACM SIGPLAN Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 81--92, ACM, 2003
Now that the use of garbage collection in languages like Java is becoming widely accepted due to the safety and software engineering benefits it provides, there is significant interest in applying garbage collection to hard real-time systems. Past approaches have generally suffered from one of two major flaws: either they were not provably real-time, or they imposed large space overheads to meet the real-time bounds.Our previous work [3] presented the Metronome, a mostly non-copying real-time collector. The Metronome achieves worst-case pause times of 6 milliseconds while maintaining consistent mutator CPU utilization rates of 50\% with only 1.5-2.1 times the maximum heap space required by the application, which is comparable with space requirements for stop-the-world collectors.However, that algorithm assumed a constant collection rate, ignored program-dependent characteristics, and lacked a precise specification for when to trigger collection or how much defragmentation to perform. This paper refines the model by taking into account program properties such as pointer density, average object size, and locality of object size. This allows us to bound both the time for collection and consequently the space overhead required much more tightly. We show experimentally that most parameters usually are not subject to large variation, indicating that a small number of parameters will be sufficient to predict the time and space requirements accurately.Our previous work also did not present the details of our approach to avoiding and undoing fragmentation. In this paper we present a more detailed analysis of fragmentation than in previous work, and show how our collector is able to bound fragmentation to acceptable limits.
2002
David F. Bacon, Stephen J. Fink, David Grove
Proceedings of the Sixteenth European Conference on Object-Oriented Programming, pp. 111--132, Springer-Verlag, 2002
While many object-oriented languages impose space overhead of only one word per object to support features like virtual method dispatch, Java's richer functionality has led to implementations that require two or three header words per object. This space overhead increases memory usage and attendant garbage collection costs, reduces cache locality, and constrains programmers who might naturally solve a problem by using large numbers of small objects.In this paper, we show that with careful engineering, a high-performance virtual machine can instantiate most Java objects with only a single-word object header. The single header word provides fast access to the virtual method table, allowing for quick method invocation. The implementation represents other per-object data (lock state, hash code, and garbage collection flags) using heuristic compression techniques. The heuristic retains two-word headers, containing thin lock state, only for objects that have synchronized methods.We describe the implementation of various object models in the IBM Jikes Research Virtual Machine, by introducing a pluggable object model abstraction into the virtual machine implementation. We compare an object model with a two-word header with three different object models with single-word headers. Experimental results show that the object header compression techniques give a mean space savings of 7\%, with savings of up to 21\%. Compared to the two-word headers, the compressed space-encodings result in application speedups ranging from -1.5\% to +2.2\%. Performance on synthetic micro-benchmarks ranges from +23\% due to benefits from reduced object size, to -12\% on a stress test of virtual method invocation.
2001
Clement R. Attanasio, David F. Bacon, Anthony Cocchi, Stephen Smith
Proceedings of the 14th International Conference on Languages and Compilers for Parallel Computing, pp. 177--192, Springer-Verlag, 2001
While uniprocessor garbage collection is relatively well understood, experience with collectors for large multiprocessor servers is limited and it is unknown which techniques best scale with large memories and large numbers of processors. In order to explore these issues we designed a modular garbage collection framework in the IBM Jalape\~no Java virtual machine and implemented five different parallel garbage collectors: non-generational and generational versions of mark-and-sweep and semi-space copying collectors, as well as a hybrid of the two. We describe the optimizations necessary to achieve good performance across all of the collectors, including load balancing, fast synchronization, and inter-processor sharing of free lists. We then quantitatively compare the different collectors to find their asymptotic performance both with respect to how fast they can run applications as well as how little memory they can run them in. All of our collectors scale linearly up to sixteen processors. The least memory is usually required by the hybrid mark-sweep collector that uses a copying collector for its nursery, although sometimes the non-generational mark-sweep collector requires less memory. The fastest execution is more application-dependent. Our only application with a large working set performed best using the mark-sweep collector; with one exception, the rest of the applications ran fastest with one of the generational collectors.
David F. Bacon, V. T. Rajan
Proceedings of the Fifteenth European Conference on Object-Oriented Programming, pp. 207--235, Springer-Verlag, 2001
Automatic storage reclamation via reference counting has important advantages, but has always suffered from a major weakness due to its inability to reclaim cyclic data structures.We describe a novel cycle collection algorithm that is both concurrent -- it is capable of collecting garbage even in the presence of simultaneous mutation -- and localized--it never needs to perform a global search of the entire data space. We describe our algorithm in detail and present a proof of correctness.We have implemented our algorithm in the Jalapeño Java virtual machine as part of the Recycler, a concurrent multiprocessor reference counting garbage collector that achieves maximum mutator pause times of only 6 milliseconds. We present measurements of the behavior of the cycle collection algorithm over a set of eight benchmarks that demonstrate the effectiveness of the algorithm at finding garbage cycles, handling concurrent mutation, and eliminating global tracing.
David F. Bacon, Clement R. Attanasio, Han B. Lee, V. T. Rajan, Stephen Smith
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 92--103, ACM, 2001
The deployment of Java as a concurrent programming language has created a critical need for high-performance, concurrent, and incremental multiprocessor garbage collection. We present the Recycler, a fully concurrent pure reference counting garbage collector that we have implemented in the Jalapeño Java virtual machine running on shared memory multiprocessors.
While a variety of multiprocessor collectors have been proposed and some have been implemented, experimental data is limited and there is little quantitative basis for comparison between different algorithms. We present measurements of the Recycler and compare it against a non-concurrent but parallel load-balancing mark-and-sweep collector (that we also implemented in Jalapeño), and evaluate the classical tradeoff between response time and throughput.
When processor or memory resources are limited, the Recycler runs at about 90\% of the speed of the mark-and-sweep collector. However, with an extra processor to run collection and with a moderate amount of memory headroom, the Recycler is able to operate without ever blocking the mutators and achieves a maximum measured mutator delay of only 2.6 milliseconds for our benchmarks. End-to-end execution time is usually within 5\%.
David F. Bacon
Proceedings of the Joint ACM Java Grande/ISCOPE Conference, pp. 68--77, ACM, 2001
Object-oriented programming languages have always distinguished between ``primitive'' and ``user-defined'' data types, and in the case of languages like C++ and Java, the primitives are not even treated as objects, further fragmenting the programming model. The distinction is especially problematic when a particular programming community requires primitive-level support for a new data type, as for complex, intervals, fixed-pointed numbers, and so on.
We present Kava, a design for a backward-compatible version of Java that solves the problem of programmable lightweight objects in a much more aggressive and uniform manner than previous proposals. In Kava, there are no primitive types; instead, object-oriented programming is provided down to the level of single bits, and types such as int can be explicitly programmed within the language. While the language maintains a uniform object reference semantics, efficiency is obtained by making heavy use of unboxing and semantic expansion.
We describe Kava as a dialect of the Java language, show how it can be used to define various primitive types, describe how it can be translated into Java, and compare it to other approaches to lightweight objects.
2000
David F. Bacon, Robert E. Strom, Ashis Tarafdar
Proceedings of the 15th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 382--400, ACM, 2000
We introduce Guava, a dialect of Java whose rules statically guarantee that parallel threads access shared data only through synchronized methods. Our dialect distinguishes three categories of classes: (1) monitors, which may be referenced from multiple threads, but whose methods are accessed serially; (2) values, which cannot be referenced and therefore are never shared; and (3) objects, which can have multiple references but only from within one thread, and therefore do not need to be synchronized. Guava circumvents the problems associated with today's Java memory model, which must define behavior when concurrent threads access shared memory without synchronization.We present an overview of the syntax and the semantic rules of Guava. We discuss how implementations of Guava can exploit these rules to re-enable compiler optimizations inhibited by standard Java. We discuss how compilers for certain multiprocessor architectures can automatically generate certain programming idioms, such as double-check reads, as optimizations of serialized monitors.
1998
David F. Bacon, Ravi Konuru, Chet Murthy, Mauricio J. Serrano
Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 258--268, ACM
Language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs.
1997
David F. Bacon, Ravi Konuru, Chet Murthy
Unpublished, 1997
This document describes the implementation derived from that proposal that we have built on top of IBM's version of the JDK 1.1.2, which contains the TRL garbage collector and JIT compiler. The TRL system also includes monitor optimizations which are themselves a substantial improvement over the original Sun implementation.
The Watson monitor implementation offers significant improvements over the TRL monitor implementation. In our micro-benchmarks consisting of monitor-intensive loops, speedups up to a factor of six have been obtained. Scalability is much improved, with performance being linear in the number of locks.
The speedups are achieved without any sacrifices in space, due to our (patent pending) method for compressing hash codes to two bits in all but a few instances.
David Francis Bacon
Ph.D. Thesis, University of California, Berkeley, 1997
In this dissertation, we show how a relatively simple and extremely fast
interprocedural optimization algorithm can be used to optimize many of the
expensive features of statically typed, object-oriented languages --- in
particular, C++ and Java.
We present a new program analysis algorithm, Rapid Type Analysis, and show
that it is fast both in theory and in practice, and significantly
out-performs other ``fast'' algorithms for virtual function call
resolution.
We present optimization algorithms for the resolution of virtual function
calls, conversion of virtual inheritance to direct inheritance, elimination
of dynamic casts and dynamic type checks, and removal of object
synchronization. These algorithms are all presented within a common
framework that allows them to be driven by the information collected by
Rapid Type Analysis, or by some other type analysis algorithm.
Collectively, the optimizations in this dissertation free the programmer
from having to sacrifice modularity and extensibility for performance.
Instead, the programmer can freely make use of the most powerful features
of object-oriented programming, since the optimizer will remove unnecessary
extensibility from the program.
Report No. UCB/CSD-98-1017
David F. Bacon
Technical Report, 1997
Language-supported synchronization is a source of serious performance problems in Java programs. Even for single threaded programs the overhead of synchronization in compiled Java can be as high as 45\%. I address this problem with a new language-level locking algorithm suitable for both uniprocessor and multiprocessor environments. On a Pentium uniprocessor, in the most common case the lock-and-unlock overhead for a Java syncrhonized method is a mere 6 machine cycles when a synchronous thread scheduler is used or 15 machine cycles when an asynchronous thread scheduler is used.
1996
David F. Bacon, Peter F. Sweeney
Proceedings of the Eleventh ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 324--341, ACM, 1996
Virtual functions make code easier for programmers to reuse but also make it harder for compilers to analyze. We investigate the ability of three static analysis algorithms to improve C++ programs by resolving virtual function calls, thereby reducing compiled code size and reducing program complexity so as to improve both human and automated program understanding and analysis. In measurements of seven programs of significant size (5000 to 20000 lines of code each) we found that on average the most precise of the three algorithms resolved 71\% of the virtual function calls and reduced compiled code size by 25\%. This algorithm is very fast: it analyzes 3300 source lines per second on an 80 MHz PowerPC 601. Because of its accuracy and speed, this algorithm is an excellent candidate for inclusion in production C++ compilers.
1995
David Francis Bacon
Masters Thesis, 1995
The hype accompanying the headlong rush towards massively parallel
supercomputing has been accompanied by numerous ``real world'' studies
showing how real applications can be dramatically sped up on machines with
thousands of processors achieving hundreds of gigaflops. We worked for a
year with a group of astrophysicists to vectorize and parallelize their
X-ray pulsar simulation with the understanding that in return for their
investment of time we would give them an improved version of the code that
they could maintain and develop independently. We found that current
optimizing compiler technology is greatly limited in its effectiveness by
the narrow scope across which optimizations can usually be performed. We
also found that when an application is ``scaled up'' to fit on a massively
parallel machine, the axis along which it is scaled may not be appropriate
to the requirements of the users. This leads us to an interesting
generalization of Amdahl's Law, and a better understanding of application
scalability.
1994
David F. Bacon, Jyh-Herng Chow, Dz-Ching R. Ju, Kalyan Muthukumar, Vivek Sarkar
Proceedings of the 1994 Conference of the Centre for Advanced Studies on Collaborative Research, IBM Press
It has been observed that memory access performance can be improved by restructuring data declarations, using simple transformations such as array dimension padding and inter-array padding (array alignment) to reduce the number of misses in the cache and TLB (translation lookaside buffer). These transformations can be applied to both static and dynamic array variables. In this paper, we provide a padding algorithm for selecting appropriate padding amounts, which takes into account various cache and TLB effects collectively within a single framework. In addition to reducing the number of misses, we identify the importance of reducing the impact of cache miss jamming by spreading cache misses more uniformly across loop iterations.We translate undesirable cache and TLB behaviors into a set of constraints on padding amounts and propose a heuristic algorithm of polynomial time complexity to find the padding amounts to satisfy these constraints. The goal of the padding algorithm is to select padding amounts so that there are no set conflicts and no offset conflicts in the cache and TLB, for a given loop. In practice, this algorithm can efficiently find small padding amounts to satisfy these constraints.
David F. Bacon, Susan L. Graham, Oliver J. Sharp
Computing Surveys 26(4), 345--420, ACM, 1994
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimizations for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimizations for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis.This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each transformation, explain how to determine if it is legal, and give an example of its application.Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimizations that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of optimizing compiler technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as bibliographic reference for the details of each optimization. Readers are expected to be familiar with modern computer architecture and basic program compilation techniques.
1993
David F. Bacon, Susan L. Graham, Oliver J. Sharp
Technical Report UCB/CSD-93-781, 1993
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimizations for uniprocessors reduce the number of instructions executed by the program, and analyze the properties of scalar quantities using flow analysis techniques. In contrast, optimizations for high-performance vector and parallel processors maximize parallelism and memory locality, mostly by tracking the properties of arrays using loop dependence analysis. In this survey we give an overview of the important high-level program restructuring techniques for imperative languages such as C and Fortran, and to describe how and when they should be applied on high-performance uniprocessors and on vector and multiprocessor machines. The basic issues involved in optimization are discussed, and the compiler analysis required for the transformations is described in some detail. A basic familiarity with modern computer architecture and program compilation is assumed.
1992
Robert E. Strom, David F. Bacon, Arthur P. Goldberg, Andy Lowry, Bill Silvermann, Daniel Yellin, Jim Russell, Shaula Yemini
Technical Report, 1992
1991
J. S. Auerbach, D. F. Bacon, A. P. Goldberg, G. S. Goldszmidt, M. T. Kennedy, A. R. Lowry, J. R. Russell, W. Silverman, R. E. Strom, D. M. Yellin, S. A. Yemini
Proceedings of the 1991 Conference of the Centre for Advanced Studies on Collaborative Research, pp. 173--196, IBM Press
This paper presents a strategy for simplifying the programming of heterogeneous distributed multiapplications. A multiapplication is a distributed system or a collection of autonomous applications which occasionally interact, such as a bank and its customers. Our approach is based on designing a machine-and language-independent process model, and integrating the abstract primitives for process creation, connection, and communication into programming languages. Our goal is to make it easy for non-experts to write multiapplications and multiapplication components, and to produce software which is portable across different machine and operating system environments. The complexity of managing address spaces, name spaces, buffers, communications connections, and recovery in such environments is hidden inside the implementation of a small number of high-level language constructs.We discuss our process model, and two language efforts based on this model: Hermes, a secure, representation-independent language designed explicitly around the process model, and Concert-C, a minimal set of extensions to C to support the process model while allowing reuse of existing C code. We discuss the status of the Hermes and Concert-C prototype implementations.
Hermes: A Language for Distributed Computing
Robert E. Strom, David F. Bacon, Arthur Goldberg, Andy Lowry, Daniel Yellin, Shaula Alexander Yemini
Prentice-Hall, 1991
David F. Bacon
Proceedings of the Tenth Symposium on Reliable Distributed Systems, pp. 21--30, IEEE Computer Society Press, 1991
File system operation in a transparently fault-tolerant system that uses checkpointing and message logging is discussed. Logging messages to disk is one of the primary performance costs of such systems. The author has measured the file system operations performed on large timesharing systems running Unix in terms of the level of concurrency (number of consecutive operations that do not change the state of the file system). By performing much of the data analysis online within a modified Unix kernel, statistics were collected over a long period of time with a substantial variation in system load. Using this data, it is demonstrated that a technique called null logging can reduce the number of messages logged to disk by a factor of 10 to 25, depending on the workload. This reduces the overhead of the fault-tolerance mechanism and allows a large fraction of file system operations to commit instantaneously.
David F. Bacon, Robert E. Strom
Proceedings of the Third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 155--166, ACM, 1991
We present a transparent program transformation which converts
a sequential execution of S1;S2 by a process in a multiprocess environment
into an optimistic parallel execution of S1 and S2.
Such a transformation is valuable in the case where S1 and S2
cannot be parallelized by static analysis either because S2
reads a value from S1 or because S1 and S2 each interact
with an external process. The optimistic transformation works
under a weaker set of conditions: (1) if the value S2 reads from
S1 can usually, but not always, be correctly
guessed ahead of time, and (2) if S1 and S2 interact with
an external process, conflicts which violate the ordering of
S1 and S2 are possible but rare. Practical applications
of this approach include executing the likely outcome of
a test in parallel with making the test, and converting
sequences of calls into streams of asynchronous sends.
We analyze the problem using the framework of guarded computations,
in which each computation is tagged with the set of guesses on
which it depends. We present an algorithm for managing communications,
thread creation, committing, and aborting in the transformed program.
We contrast our approach with related work.
David F. Bacon, Seth Copen Goldstein
Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 194--206, ACM
Shared-memory parallel programs can be highly non-deterministic due to the unpredictable order in which shared references are satisfied. However, deterministic execution is extremely important for debugging and can also be used for fault-tolerance and other replay-based algorithms. We present a hardware/software design that allows the order of memory references in a parallel program to be logged efficiently by recording a subset of the cache traffic between memory and the CPUs. This log can then be used along with hardware and software control to replay execution.
Simulation of several parallel programs shows that our device records no more than 1.17 MB/second for an application exhibiting fine-grained sharing behavior on a 16-way multi-processor consisting of 12 MIP CPUs. In addition, no probe effect or performance degradation is introduced. This represents several orders of magnitude improvement in both performance and log size over purely software-based methods proposed previously.
1990
Robert E. Strom, David F. Bacon, Arthur Goldberg, Andy Lowry, Daniel Yellin, Shaula Alexander Yemini
Technical Report, 1990
Arthur P. Goldberg, David F. Bacon, Ajei Gopal, Kong Li, Robert E. Strom
Proceedings of the USENIX Mach Workshop, pp. 169-183, 1990
We have built a software layer on top of Mach 2.5 that recovers multi-task Mach applications from fail-stop failures. The layer implements Optimistic Recovery (OR), a mechanism for transparent recovery from failing tasks and processors, based on asynchronous checkpointing and logging of inter-task messages. OR recovers from failure by restoring a checkpoint and replaying logged messages.
The current prototype supports message communication via sends and receives, simple port operations, and task interactions through the environment manager.
This paper discusses the issues Mach raised for this implementation, the structure of the OR layer, the design of future enhancements, and comparisons with other recovery techniques.
Also available as IBM Research Report RC 16242
David F. Bacon, Andy Lowry
Proceedings of the Summer USENIX Technical Conference, pp. 39--49, Usenix Association, 1990
We present our implementation of a portable run-time system for Hermes, a very high level language containing integrated constructs similar to those provided at the system and library level in systems like Mach and SQL.
The Hermes features we will focus upon in this paper are lightweight concurrent processes, the use of ports for typed communication between processes, a typed object store and its implementation via the Unix filesystem, and relational data structures for high-level associative queries.
Alexander Dupuy, Jed Schwartz, Yechiam Yemini, David F. Bacon
Communications of the ACM 33(10), 63--74, ACM, 1990
David F. Bacon
Proceedings of the Fourth ACM SIGOPS European Workshop, pp. 1--4, ACM, 1990
We are investigating transparent optimistic solutions to problems in distributed systems such as recovery [6], replication [3], parallelization [2], and concurrent competing alternatives [4]. By a transparentsolution to such a problem we mean that a program is transformed automatically, and that the behavior of the program is equivalent to a possible behavior of the untransformed program; in addition, the programmer and the end-user need not be aware of the transformation.
Transparent solutions to such problems are relatively straightforward if synchronization is relied upon, but performance of such methods is generally poor or the implementation is too expensive, and they do not scale. Our approach is to use optimistic methods in which we guess that synchronization is unnecessary, and verify this asynchronously while the program continues execution. We track inter-process dependencies and log non-deterministic
events so that we can roll back a computation that depends upon an incorrect guess. Where virtual memory virtualizes the space of a process, we virtualize time, ``paging in'' a previous process state when a ``time fault'' (or incorrect guess) occurs.
1989
A. Dupuy, J. Schwartz, Y. Yemini, D. F. Bacon
Proceedings of the 21st Winter Simulation Conference, pp. 1058--1064, ACM, 1989
This paper describes Nest, a graphical environment for distributed networked systems simulation and rapid-prototyping. Nest users can develop and test distributed systems and protocols (from crude models to actual system code) within simulated network scenarios. Nest represents an environment-based approach to simulation. Users view Nest as an extension of their standard Unix environment. Nest offers the generality of language-based simulation techniques and the efficiencies of model-based techniques. Users interact with Nest through standardized graphical interfaces. Nest permits the users to modify and reconfigure the simulation during execution. Thus, it is possible to study the dynamic response of a distributed system to failures or burst-loads. Nest is organized as a simulation server, responsible for execution of complex simulation scenarios, and client monitors responsible for simulation control. The client/server model permits distribution of Nest over a network environment. This permits migration of simulations to powerful remote computational servers as well as development of a shared multi-site simulation/integration testbed. Nest is portable and extensible. It has been ported to virtually all Unix variants and distributed since 1987 to over 150 sites worldwide. It has been used in scores of studies ranging from communication protocols, to distributed databases and operating systems as well as distributed manufacturing systems.
David F. Bacon, Robert E. Strom
Technical Report RC 14518, 1989
Hermes is a very high level language for distributed systems which provides many of the functions typically provided by an operating system. We describe the language constructs of Hermes that provide these functions and demonstrate how they simplify the construction of systems and improve performance. In particular, we present what we believe to be the cheapest possible implementation of capabilties. This is possible because Hermes is a secure language, so security is enforced at compile-time rather than at run-time.
Security also makes it feasible to run multi-user systems within a single address space, allowing us to perform a context switch in as little as eight instructions. This in turn allows us to eliminate traditional distinctions between procedures, lightweight processes, and heavyweight processes. Hermes subsumes all three mechanisms with a uniform process model that has the concurrency and protection of heavyweight processes but the performance of procedures (when used on a single machine) and which hides distribution by automatically marshaling parameters and sending messages (when used across machines).
1988
David F. Bacon, Alexander Dupuy, Jed Schwartz, Yechiam Yemini
Proceedings of the Winter USENIX Technical Conference, pp. 71--77, Usenix Association, Berkeley, California, 1988
This paper describes Nest, a testbed which provides a simulated network environment for developing and analyzing distributed systems and algorithms. Nest has a number of interesting features, including a transparent implementation of lightweight processes under UNIX, and a distributed monitoring facility with a graphical user interface.
Robert E. Strom, Shaula Alexander Yemini, David F. Bacon
Proceedings of the Twenty-First Annual Hawaii International Conference on Systems Sciences, pp. 215--221, IEEE Computer Society Press, 1988
A design is presented for the storage component of a self-recovering distributed operating system. This component consists of an object manager, which maintains objects on main memory and on the disk, and a recovery layer, which incorporates a collection of highly optimized algorithms based on optimistic recovery. With optimistic recovery it is possible for a machine (or collection of machines) to present a fault-free interface to programs running on it (or them), making all data appear to be persistent. The optimizations presented make it possible to do this at a cost no higher than that of transaction systems.
Robert E. Strom, David F. Bacon, Shaula Alexander Yemini
Digest of Papers of The Eighteenth Annual International Symposium on Fault-Tolerant Computing, pp. 44--49, IEEE Computer Society Press, 1988
The authors introduce two enhancements to optimistic recovery which allow messages to be logged without performing any I/O to stable storage. The first permits messages to be instantaneously logged in volatile storage, as in the sender-based message logging technique of D.B. Johnson and W. Zwaenepoel (1987), but without their restriction of single-fault-tolerance. The second permits message data and/or message arrival orders not to be logged in circumstances where this information can be reconstructed in other ways. They show that the combination of these two optimizations yields a transparent n-fault-tolerant system which logs to stable storage only those messages received from the outside world and a very small number of additional messages.
1987
Robert E. Strom, Shaula Alexander Yemini, David F. Bacon
IBM Programming Language Technology ITL, pp. 187--195, IBM Coporation, 1987
This paper presents a proposal to incorporate a structured datatype corresponding to programs into the type system of strongly typed high level langauges. Our proposal achieves the flexibility of LISP-like program manipulation facilities (that is, the ability to dynamically construct programs and immediately invoke them) while preserving the advantages of static type checking for program reliability, early error detection, and optimization.
We present the details of our proposal, and show how it can serve as a unifying framework for introducing such diverse program generation and program reuse techniques as polymorphism, generics, and inheritance into any typed language.
Improving Distributed Protocols by Decoupling Recovery from Concurrency Control
Shaula Alexander Yemini, Robert E. Strom, David F. Bacon
Technical Report RC 13326, 1987
Robert E. Strom, Shaula Alexander Yemini, David F. Bacon
The International Conference on Parallel Processing and Applications, pp. 475--483, North-Holland, 1987
We present a layered design for a highly available distributed operating system which automatically recovers to a consistent system-wide state following a crash. Our design is based on Optimistic Recovery [SY85], and introduces a set of powerful optimizations for its implementation.
Our system simplifies the programmer's and the end user's view of the machine by creating a virtual machine in which all storage is persistent. The end user no longer needs to issue ``save''s to make data recoverable in the event of a system crash. The application programmer need no longer be concerned with the difference between persistent storage and volatile storage.