Performance Modeling and Analysis - History
Performance modeling and analysis has been and continues to be of great practical and theoretical importance in research labs in the design development and optimization of computer and communication systems and applications. This includes a broad spectrum of research activities from the use of more empirical methods (ranging from experimental tweaking of simple existing models up to building and experimenting with prototype implementations) through the use of simulation to more sophisticated mathematical methods. IBM Research has a long and rich history in this area of research. A small subset of examples include: Time-Sharing Computer Model, Token Ring Local Area Networks, Product Form Networks and RESQ Package, Computer Scheduling, S/390 Sysplex, Traffic Management, Parallel Scheduling. For more information, refer to S.S. Lavenberg and M.S. Squillante, Performance Evaluation in Industry: A Personal Perspective, in "Performance Evaluation -- Origins and Directions", G. Haring, C. Lindemann, M. Reiser (eds.), Springer-Verlag, 1999, and the references cited therein.
A Simple Queueing Network Model of a Time Sharing System One of the earliest documented successful applications of analytical computer performance modeling in industry is due to Lassettre and Scherr (1972) at IBM in the late 1960s during the development of IBM's OS/360 Time Sharing Option (TSO). The machine repairman model, originally developed in the operations research literature to model a single repairman servicing machines that break down, was used to represent OS/360 TSO running with a single memory partition and supporting multiple terminal attached users via time sharing the execution of their programs. With a single memory partition, only one user's program could be resident in memory and the system was time shared by swapping programs into and out of memory. The machine repairman model has a very simple analytical solution which was used to estimate the number of users the system could support without exceeding a specified average response time. The model was used in conjunction with measurements. Average program execution time was estimated from measurements on a TSO system with a second computer system simulating the users by running benchmark scripts and generating exponentially distributed user think times. The measured average response time was compared with the mean response time computed using the model under the assumption of exponentially distributed execution times with mean equal to the measured average. It is interesting to note that a substantial difference between the measured and predicted response time was usually due to bugs in the operating system. Once the bugs were removed the measured and predicted results tracked closely. In approximately 75% of the test cases, the prediction error was less than 10% with a maximum error of 24%. This was surprising due to the simplistic model assumptions. In particular, program execution times were assumed to be exponentially distributed, although measurements showed they were not, and programs were assumed to be executed first come first served, rather than via time sharing. Unknown at the time, this queueing model is an example of a product form queueing network. Product form queueing network results of the 1970s showed invariance of performance measures including mean response time when a first-come-first-served queue with exponential service times is replaced by a processor sharing queue with general service times. A processor sharing queue with general service times would have been a more realistic model of the TSO system, but the model's predictions would not have been different, thus helping to explain the model's accuracy.
The Performance of Token Ring Local Area Networks One of the most influential analytical performance modeling studies in IBM was done by Bux (1981). This study compared the analytically derived delay-throughput characteristics of local area networks based on ring and bus topologies. Included were the token ring, a ring network with access controlled by a single circulating token that was being built at IBM's Research Lab in Zurich, and the CSMA-CD (carrier sensing, multiple access with collision detection) bus that was the basis of Ethernet, originally developed by Xerox in the 1970s. The study primarily used existing analytical queueing results, modifying them as required to capture the essential characteristics of the networks being modeled. For example, the key to analyzing token ring performance was recognizing that a token ring functioned like a single server that served multiple queues by round robin polling. An elegant discrete time queueing analysis of such a polling system had appeared in Konheim and Meister (1974). (The discrete time results were converted to continuous time by letting the discrete time interval approach zero.) The study showed that the delay-throughput characteristics of the token ring and CSMA-CD bus were comparable at low transmission speeds, e.g. 1 Mb/sec, but the token ring was superior at higher speeds, e.g. 10 Mb/sec. (The key parameter affecting the relative performance of the token ring and CSMA-CD bus is the ratio of propagation delay to packet transmission time, with higher ratios favoring the token ring.) While many factors influenced IBM's decision to develop a token ring local area network product, the performance of the token ring as demonstrated in this study was a key factor.
Product Form Networks and The Research Queueing (RESQ) Package The discovery of product form queueing networks and their properties and the development of efficient computational algorithms for product form networks was a breakthrough in analytical performance modeling. Within the computer science literature, the classic paper that first defined product form networks and gave their properties is Baskett, Chandy, Muntz and Palacios-Gomez (1975). Researchers in IBM developed the main computational algorithms for solving product form networks, first the convolution algorithm, Reiser and Kobayashi (1975), and later the Mean Value Analysis (MVA) algorithm, Reiser and Lavenberg (1980), and they incorporated these algorithms in performance modeling software packages to make them available to performance modeling practitioners. The first such package was QNET4, software for specifying and solving closed multichain product form queueing networks. QNET4 provided a textual interface for the user to specify the queues and routing chains and their associated parameter values. Performance measures were computed using the convolution algorithm. Shortly after QNET4 was developed, it was integrated into a more general performance modeling package, the Research Queueing (RESQ) package.
RESQ allowed a user to specify and solve product form networks (initially using the convolution algorithm; the MVA algorithm was added later), but it also allowed a user to specify more general ``extended queueing networks'' and use discrete event simulation to estimate performance measures. The then recently developed regenerative method for estimating confidence intervals and controlling simulation run length was incorporated in RESQ. One of the key extensions was the inclusion of passive queues, which provide a convenient way to model simultaneous resource possession. QNET4's textual user interface was extended in a natural way to allow specification of extended queueing networks. The modeling level of abstraction provided by extended queueing networks and the implementation in RESQ proved very useful. It allowed the rapid development of simulation models without the programming required with a simulation programming language. It helped guard against the pitfall of developing overly detailed simulation models by forcing a higher level of abstraction. It included modern statistical simulation techniques and made them easy to use. It also helped bridge the gap between analytical modeling and simulation modeling by incorporating both product form networks and extended queueing networks. RESQ was a major success in IBM. Developed by researchers, it began to be widely used in product development groups in IBM in the late 1970s to model computer systems and subsystems and local and wide area computer networks. It was enhanced over time with additional computational algorithms and statistical methods, a graphical user interface, simulation animation and other features, and its widespread use in IBM continued into the 1990s. It was also made available for use in research and teaching at universities.
Computer System Scheduling The performance modeling and related stochastic literature over the past five decades is rich with studies of scheduling optimization problems. This includes optimal scheduling results for minimizing a weighted sum of the per-class mean response times, as well as for achieving a given vector of per-class mean response times, in a single queue or a queueing network, with or without side constraints (i.e., a per-class performance constraint that must be satisfied in addition to the global objective function). These results have basically established that, in many cases, the space of achievable performance measures is a polymatroid, or extended polymatroid, whose vertices correspond to the performance of the system under all possible fixed priority rules. Furthermore, the optimal or desired performance vector is a vertex or an interior point of this performance polytope, and the scheduling strategies which satisfy these classes of objective functions are some form or mixture of fixed priority policies (dynamic priority policies are considered below).
As computer technology advanced and the complexity of computer systems and applications continued to grow, new customer and/or user requirements arose that were not fully addressed by previous classes of scheduling objective functions. For this reason, research studies at IBM investigated specific scheduling optimization problems to consider the needs of certain IBM computer platforms. Two such studies became the basis for the processor scheduling algorithms in the Application System/400 (AS/400) and System/390 (S/390) computer systems.
Basis for the AS/400 Processor Scheduler:
Much of the previous scheduling research considered scheduling strategies for minimizing a weighted sum of the per-class mean response times. An important additional objective is to maintain a low variance of response times for each class. Two related research studies at IBM that investigated this problem resulted in the concepts of Delay Cost Scheduling, due to Franaszek and Nelson (1995), and Time-Function Scheduling, due to Fong and Squillante (1995). These two studies considered different forms of objective functions that are based on minimizing a weighted sum of per-class second moment measures of response time, and they used different approaches to establish certain structural properties for the respective optimal solutions.
One scheduling strategy with the structural properties for a particular instance of the scheduling objectives considered by Fong and Squillante is based on the use of general time-based functions to obtain effective and flexible control over the allocation of resources. This scheduling strategy is in part a generalization of the linear time-dependent priority discipline due to Kleinrock (1964) in which the priority of each job increases (linearly) according to a per-class function of some measure of time and the job with the highest instantaneous priority value in the queue is selected for execution at each scheduling epoch. In its most general setting, the time parameter for each per-class function can include any measure of the time spent waiting for a resource or set of resources (response mode), any measure of the time spent using a resource or set of resources (usage mode), and any combination of such modes, as developed by Fong, Hough and Squillante (1997). An adaptive feedback mechanism is used together with mathematical control formulas to adjust these per-class time functions, as well as to migrate each job to a different class upon the occurrence of certain events or upon the job exceeding some criteria, in order to satisfy the scheduling objective function across all and/or varying workloads. These control formulas can also be used to obtain a scheduling strategy that realizes the optimal or desired performance vector which is a (interior) point in the performance space, while providing better per-class response time variance properties. A number of important scheduling issues can be easily accommodated in the per-class time functions and/or addressed by the use of these time functions to control resource scheduling decisions; examples include priority inversion and processor-cache affinities. The theoretical properties of time-function scheduling can be further exploited to obtain very efficient implementations.
The above scheduling strategy is a fundamental aspect of the dynamic priority scheduling algorithms employed in AS/400 systems. Based on the control formulas mentioned above and a general set of assumptions regarding workloads and performance objectives, this scheduling technology offering has proven to be a great success with AS/400 customers providing efficient and effective control over resource management to address various customer requirements across numerous diverse workloads without any perceived increase in complexity by the system administrator and users.
Basis for the S/390 Processor Scheduler:
Instead of minimizing a weighted sum of the per-class mean response times, it can be more natural to associate a mean response time goal with each class and to consider the performance of the class relative to its goal. The corresponding objective function then is to minimize a vector of the per-class ratios of the response time mean to the response time goal. This problem is studied within the context of a multi-class M/GI/1 queue, with and without feedback, by Bhattacharya et al. (1993,1995) where adaptive scheduling strategies are presented and shown to lexicographically minimize the vector of per-class performance ratios (exactly or approximately). The results also apply to other systems including certain multi-class Jackson networks and multi-class M/GI/c queues.
Consider a $K$-class system at a scheduling decision time epoch in which the mean response time for class $i$ realized over the previous scheduling time interval(s) is $x_i$ and the specified mean response time goal for this class is $g_i$. A fixed scheduling policy that gives priority to jobs of class $i$ over jobs of class $j$ if $x_i/g_i \geq x_j/g_j$ is then used for the next scheduling time interval. In other words, priority is given to the class that has received the worse performance, relative to its goal, over the previous time interval(s). The time intervals between scheduling decision epochs, in which priorities are updated, can be arbitrary provided that they are bounded above by a finite measure. This scheduling strategy is proven optimal in the sense that it converges asymptotically in the number of time intervals to the optimal solution which lexicographically minimizes the vector of ratios $x_i/g_i$ arranged in non-increasing order $x_1/g_1 \geq x_2/g_2 \geq \cdots \geq x_K/g_K$ (with probability 1).
The above scheduling strategy is an important aspect of the processor scheduling algorithms employed in S/390 systems, where $x_i$ and $g_i$ can be functions of performance metrics other than just mean response time -- in particular, response time percentiles and/or velocity goals can be used together with or instead of mean response time goals. This S/390 concept is commonly referred to as goal-oriented scheduling. It is a key component of the workload management system provided on S/390 computers, which has been a great success for IBM in the control and management of mainframes and mainframe clusters.
Cluster Architectures and S/390 Parallel Sysplex As commercial workloads expand and evolve, clusters of multiple computers are required to support high transaction processing rates and high availability in large-scale commercial computing environments, which include on-line transaction processing systems and parallel database systems. The principal cluster architectures proposed to support these scalable commercial application environments are the shared-nothing (or partitioned), the shared-disk, the virtual shared-disk, and the Parallel Sysplex} models. The shared-nothing architecture consists of partitioning the database and disks among the nodes, where either function-shipping (i.e., a remote function call to be executed by the remote node with the results, if any, returned to the local node) or I/O shipping (i.e., a remote I/O request to fetch the required data from the remote node) is used when a local transaction needs to access data located at a remote node in the cluster. Advantages of this architecture include a higher local buffer hit ratio and no need for global concurrency control, whereas the main disadvantages center around the various additional costs for remote requests to access non-local databases and load balancing and availability problems. The shared-disk architecture consists of essentially having all nodes directly access the disks on which shared data is located, where each node has a local database buffer cache and a global concurrency control protocol is used to maintain consistency among the local caches as well as the database. This architecture has advantages with respect to load balancing and availability, but it can suffer from additional overhead to acquire and release global locks, as well as large overhead, latency and increased I/O activity for hot shared data (so-called ping-ponging). The virtual shared-disk architecture is functionally identical to the shared-nothing architecture with I/O shipping, while providing the view of a shared-disk model to the database (i.e., the partitioned disks are transparent to the database). The Parallel Sysplex architecture consists of the shared-disk model together with a shared coupling facility that provides a shared database buffer and highly-efficient support for locking, cache coherency and general-purpose queueing.
Various aspects of each of these principal cluster architecture alternatives have been analyzed with performance models, many of which have been formulated by decomposing the original problem into more tractable parts. The solutions of these hierarchical models have often involved a combination of different methods including analytical, mathematical optimization and simulation. The results of this analysis of the principal cluster architectures by researchers at IBM, such as King et al. (1997), and elsewhere influenced the design of the IBM S/390 Parallel Sysplex architecture, and was used by IBM to demonstrate the key advantages of this design over alternative cluster architectures. In particular, the Parallel Sysplex architecture provides the benefits of shared-disk architectures and exploits the coupling facility services to obtain very efficient intertransaction concurrency control, buffer cache coherency control, shared buffer management, and shared job queues. This results in transaction rate scaling which is close to linear in the number of nodes, high shared buffer hit ratios which can reduce the I/O rate per node, and excellent dynamic load balancing even in systems with heterogeneous nodes.
The Parallel Sysplex technology provides the fundamental infrastructure for IBM's large-scale enterprise server environments, and the above performance modeling studies played a key role in its design.
Network Traffic Management The allocation and management of shared network resources among different classes of traffic streams with a wide range of performance requirements and traffic characteristics in high-speed packet-switched network architectures, such as ATM, is more complex than in traditional networks. An important aspect of this traffic management problem is to characterize the effective bandwidth requirement of both individual connections and the aggregate bandwidth usage of multiple connections statistically multiplexed on a given network link, as a function of their statistical properties and the desired level of service. These metrics can then be used for efficient bandwidth management and traffic control in order to achieve high utilization of network resources while maintaining the desired level of service for all connections. Guerin et al. (1991) proposed a methodology for the computation of the effective bandwidth requirement of individual and multiplexed connections based on a combination of two complementary approaches, namely a fluid model to estimate the bandwidth requirement when the impact of individual connection characteristics is critical, and the stationary bit rate distribution to estimate the bandwidth requirement when the effect of statistical multiplexing is significant. While the fluid model and its extension to capture the impact of multiplexing can be used to obtain an exact computation of the bandwidth requirement, the computational complexity involved is too high in general, and particularly for real-time network traffic management. Guerin et al. therefore used the proposed methodology to develop a computationally simple approximate expression for the effective bandwidth requirement of individual and multiplexed connections, which is shown to have sufficiently good accuracy across the range of possible connection characteristics in comparison with both exact computations and simulation results. This approximation has proven quite successful as one of the key mechanisms used for traffic management in IBM's Networking BroadBand Services (NBBS) architecture, as well as in similar offerings from other companies.
Parallel Scheduling A significant body of the performance modeling research literature has focused on various aspects of the parallel computer scheduling problem -- i.e., the allocation of computing resources among the parallel jobs submitted for execution. Several classes of scheduling strategies have been proposed for such computing environments, each differing in the way the parallel resources are shared among the jobs. This includes the class of space-sharing strategies that share the processors in space by partitioning them among different parallel jobs, the class of time-sharing strategies that share the processors by rotating them among a set of jobs in time, and the class of gang-scheduling strategies that combine both space-sharing and time-sharing. The numerous performance modeling and optimization studies in this area by researchers at IBM and elsewhere have played a fundamental and important role in the design, development and implementation of different forms of space-sharing and gang-scheduling strategies in commercial parallel supercomputing systems. In fact, the IBM SP family of parallel computers supports various forms of space-sharing and gang-scheduling; support for space-sharing and/or gang-scheduling is also included in many other commercial parallel supercomputers, such as the Cray T3E, the Intel Paragon, the Meiko CS-2 and the SGI Origin.
Last updated on March 28, 2013