Dr. C. Mohan *  Dr. C. Mohan * photo       

contact information

IBM Fellow
Almaden Research Center, San Jose, CA 95120, USA
  +1dash408dash927dash1733

links



Abstracts of Some of C. Mohan's Papers and Patents

Last updated on 10 February 2010

Another bibliography of almost all of Mohan's papers is maintained at the University of Trier by Michael Ley. It includes some papers that are not listed here and it excludes patents that are listed here.

With each entry in the following list, information is included about where the algorithms described in the corresponding paper/patent have been implemented. IBM's patent licensing policies are described elsewhere. For most of the papers and patents listed here, links to online versions are provided. But to access some of the online papers, an ACM account might be needed. Other papers are available from

C. Mohan at:

IBM Almaden Research Center
K01/B1
650 Harry Road
San Jose, CA 95120
USA
Phone: +1 408 927 1733, mohan@almaden.ibm.com
http://www.almaden.ibm.com/u/mohan/
Assistant (Lucinda Rios): +1 408 927 1830, lucinda@almaden.ibm.com
  1. [MoLi83] Mohan, C., Lindsay, B. Efficient Commit Protocols for the Tree of Processes Model of Distributed Transactions, Proc. 2nd ACM SIGACT/SIGOPS Symposium on Principles of Distributed Computing, Montreal, Canada, August 1983. Reprinted in ACM/SIGOPS Operating Systems Review, July 1985. Reprinted in Tutorial: Recent Advances in Distributed Data Base Management, C. Mohan (Ed.), IEEE Computer Society Press, 1984. Also available as IBM Research Report RJ3881, IBM Research - Almaden, June 1983. Citations ResearchIndex

    This paper describes two efficient distributed transaction commit protocols, the Presumed Abort (PA) and Presumed Commit (PC) protocols, which have been implemented in the distributed data base system R*. PA and PC are extensions of the well-known two-phase (2P) commit protocol. PA is optimized for read-only transactions and a class of multi-site update transactions, and PC is optimized for other classes of multi-site update transactions. The optimizations result in reduced inter-site message traffic and log writes, and, consequently, a better response time for such transactions. We derive the new protocols in a step-wise fashion by modifying the 2P protocol.

    PA is now part of the ISO-OSI, X/Open XA, Transaction Internet Protocol (TIP) and OMG OTS standards for distributed transaction processing. It is also part of the IBM SNA LU6.2 and DRDA standards. It has been implemented in IBM's R*, DB2, OS/400 and QuickSilver, Tandem's TMF, DEC's VAX/VMS, Transarc's Encina Product Suite, CMU's Camelot, Unix System Laboratories' TUXEDO, Microsoft's DTC, Informix and University of Wisconsin's Shore.

  2. [LHMWY84] Lindsay, B., Haas, L., Mohan, C., Wilms, P., Yost, R. Computation and Communication in R*: A Distributed Database Manager, ACM Transactions on Computer Systems, Vol. 2, No. 1, February 1984. Reprinted in Tutorial: Recent Advances in Distributed Data Base Management, C. Mohan (Ed.), IEEE Computer Society Press, 1984. Also available as IBM Research Report RJ3740, IBM Research - Almaden, January 1983. Citations

    This article presents and discusses the computation and communication model used by R*, a prototype distributed database management system. An R* computation consists of a tree of processes connected by virtual circuit communication paths. The process management and communication protocols used by R* enable the system to provide reliable, distributed transactions while maintaining adequate levels of performance. Of particular interest is the use of processes in R* to retain user context from one transaction to another, in order to improve the system performance and recovery characteristics.

  3. [LMHDL85] Lohman, G., Mohan, C., Haas, L., Daniels, D., Lindsay, B., Selinger, P., Wilms, P. Query Processing in R*, In Query Processing in Database Systems, W. Kim, D. Reiner, and D. Batory (Eds.), Springer-Verlag, 1985. Also available as IBM Research Report RJ4272, IBM Research - Almaden, April 1984. Citations

    This chapter describes how statements in the SQL language are processed by the R* distributed relational database management system. R* is an experimental adaptation of System R to the distributed environment. The R* prototype is currently operational on multiple machines running the MVS operating system, and is undergoing evaluation. The R* system is a confederation of autonomous, locally-administrated databases that may be geographically dispersed, yet which appear to the user as a single database. Naming conventions permit R* to access tables at remote sites without resorting to a centralized or replicated catalog, and without the user having to specify either the current location of or the communication commands required to access that table. SQL data definition statements affecting remote sites are interpreted through a distributed recursive call mechanism. Tables may be moved physically to other databases without affecting existing SQL statements. SQL data manipulation statements are compiled at each site having a table referenced in the statement, coordinated by the site at which the statement originated. As part of compilation, the distributed optimization process chooses the best place and the best way to access tables and join them together. Optimization uses dynamic programming and careful pruning to minimize total estimated execution cost at all sites, which is a liner combination of CPU, I/O, and communications (both per-message and per-byte) costs.

  4. [LHMPW86] Lindsay, B., Haas, L., Mohan, C., Pirahesh, H., Wilms, P. A Snapshot Differential Refresh Algorithm, Proc. ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1986. Also available as IBM Research Report RJ4992, IBM Research - Almaden, January 1986. Citations

    This article presents an algorithm to refresh the contents of data base snapshots. A data base snapshot is a read-only table whose contents are extracted from other tables in the data base. The snapshot contents can be periodically refreshed to reflect the current state of the data base. Snapshots are useful in many applications as a cost effective substitute for replicated data in a distributed data base system.

    When the snapshot contents are a simple restriction and projection of a single base table, differential refresh techniques can reduce the message and update costs of the snapshot refresh operation. The algorithm presented annotates the base table to detect the changes which must be applied to the snapshot table during snapshot refresh. The cost of maintaining the base table annotations is minimal and the amount of data transmitted during snapshot refresh is close to optimal in most circumstances.

    This algorithm was implemented in the R* distributed data base management system.

  5. [LiMP86] Lindsay, B., Mohan, C., Pirahesh, H. Method for Reserving Space Needed for "Rollback" Actions, IBM Technical Disclosure Bulletin, Vol. 29, No. 6, November 1986.

    A method is disclosed for reserving space needed to perform "rollback" of actions which free space. Freed space must not be consumed by other transactions if the transaction which freed the space may need the space to undo the effect of the action which freed the space.

    This method is implemented in DB2/2 and DB2/6000.

  6. [MoLO86] Mohan, C., Lindsay, B., Obermarck, R. Transaction Management in the R* Distributed Data Base Management System, ACM Transactions on Database Systems, Vol. 11, No. 4, December 1986. Reprinted in Readings in Database Systems, Third Edition, M. Stonebraker, J. Hellerstein (Eds.), Morgan Kaufmann Publishers, 1998. Also available as IBM Research Report RJ5037, IBM Research - Almaden, February 1986. Citations

    This paper deals with the transaction management aspects of the R* distributed data base system. It concentrates primarily on the description of the R* commit protocols, Presumed Abort (PA) and Presumed Commit (PC). PA and PC are extensions of the well-known two-phase commit protocol. PA is optimized for read-only transactions and a class of multi-site update transactions, and PC is optimized for other classes of multi-site update transactions. The optimizations result in reduced inter-site message traffic and log writes, and, consequently, a better response time. The paper also discusses R*'s approach towards distributed deadlock detection and resolution.

    PA is now part of the ISO-OSI and X/Open standards for distributed transaction processing. It is also part of the IBM SNA LU6.2 and DRDA standards. It has been implemented in IBM's R*, DB2 V3 and QuickSilver, Tandem's TMF, DEC's VAX/VMS, Transarc's Encina Product Suite, CMU's Camelot and Unix System Laboratories' TUXEDO.

  7. [RoMo89] Rothermel, K., Mohan, C. ARIES/NT: A Recovery Method Based on Write-Ahead Logging for Nested Transactions, Proc. 15th International Conference on Very Large Data Bases, Amsterdam, August 1989. A longer version of this paper is available as IBM Research Report RJ6650, IBM Almaden Research Center, January 1989. This paper received the 10 Year Best Impact Paper Award at VLDB99. Abstract Citations DBLP ACM ResearchIndex

    In this paper, we present a simple and efficient recovery method for nested transactions, called ARIES/NT (Algorithm for Recovery and Isolation Exploiting Semantics for Nested Transactions), that uses write-ahead logging and supports semantically-rich modes of locking and operation logging. ARIES/NT applies to a very general model of nested transactions, which includes partial rollbacks of subtransactions, upward and downward inheritance of locks, and concurrent execution of ancestor and descendent subtransactions. The adopted system architecture encompasses aspects of distributed data base management also. ARIES/NT is an extension of the ARIES recovery and concurrency control method which was originally developed for the single-level transaction model by Mohan et al. and which has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2, Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS and Gamma data base machine.

  8. [MHWC90] Mohan, C., Haderle, D., Wang, Y., Cheng, J. Single Table Access Using Multiple Indexes: Optimization, Execution, and Concurrency Control Techniques, Proc. 2nd International Conference on Extending Database Technology, Venice, March 1990. A longer version of this paper is available as IBM Research Report RJ7341, IBM Research - Almaden, March 1990; Revised May 1990. Citations DBLP ResearchIndex

    Many data base management systems' query optimizers choose at most one index for accessing the records of a table in a given query, even though many indexes may exist on the table. In spite of the fact that there are some systems which use multiple indexes, very little has been published about the concurrency control or query optimization implications (e.g., deciding how many indexes to use) of using multiple indexes. This paper addresses these issues and presents solutions to the associated problems. Techniques are presented for the efficient handling of record ID lists, elimination of some locking, and determination of how many and which indexes to use. The techniques are adaptive in the sense that the execution strategies may be modified at run-time (e.g., not use some indexes which were to have been used), if the assumptions made at optimization-time (e.g., about selectivities) turn out to be wrong. Opportunities for exploiting parallelism are also identified.

    A subset of our ideas have been implemented in DB2 V2R2.

  9. [MoNP90] Mohan, C., Narang, I., Palmer, J. A Case Study of Problems in Migrating to Distributed Computing: Page Recovery Using Multiple Logs in the Shared Disks Environment, IBM Research Report RJ7343, IBM Research - Almaden, March 1990. Citations ResearchIndex

    Migrating to a distributed computing environment from an existing single system environment generally poses some special problems. This paper presents a case study of problems relating to migrating an existing data base management system (DBMS) to an environment in which all the disks containing the data bases are shared amongst multiple instances of the DBMS. Any DBMS instance in the complex may update any data and each instance has its own log and buffer pool. We describe a simple technique to perform data base recovery correctly in such an environment. In a single system DBMS like DB2, the log component assigns a monotonically increasing value called the log sequence number (LSN) for each log record that is written. The LSN is typically the logical address of the log record in the sequential log file. The DBMS stores in the header of each DB page the LSN of the log record describing the most recent update to that page. This is required for proper recovery after a system failure. In the shared disks environment, we illustrate the problems that would be caused if each system assigned LSNs independently of the other systems. We describe a technique to solve this problem without requiring migration of existing data, a realtime merged log, or communication amongst the systems.

  10. [LeMo90] Levine, F., Mohan, C. Method for Concurrent Record Access, Insertion, Deletion and Alteration Using an Index Tree, United States Patent 4,914,569, IBM, April 1990. Taiwan Patent NI-34575, February 1990. Canada Patent 1,285,072, June 1991. Korea Patent 0,052,225, June 1992. Republic of China Patent 0,027,768, July 1994.

    A method for fetching key record data in a group of record keys according to at least a portion of a key record through an index tree is provided. The index tree provides concurrent accesses of record keys by different transactions. The index tree includes a root node connected to at least one level of nodes, each node having a key record reference to one or more nodes in a next successive level and having bottom nodes that provide access to the key data. The method consists of the steps of (1) traversing across said nodes from said root node by using said key record portion until a bottom node is reached; (2) limiting all but read accesses to the node being traversed and a previously accessed node, to other concurrent transactions; (3) identifying said key record in said bottom node; (4) limiting all but read accesses to said key record; (5) removing all access limitations to traversed nodes; (6) fetching key record data; and (7) removing the access limitation to the key record after the record data has been fetched. Further, methods for inserting and deleting record keys are provided. Additionally, a method for changing the index tree structure while allowing concurrent accesses to take place is provided.

    This method has been implemented in DB2/2 and DB2/6000.

  11. [PMCLS90] Pirahesh, H., Mohan, C., Cheng, J., Liu, T.S., Selinger, P. Parallelism in Relational Data Base Systems: Architectural Issues and Design Approaches, Proc. 2nd International Symposium on Databases in Parallel and Distributed Systems, Dublin, July 1990. Reprinted in Query Processing in Parallel Relational Database Systems, H. Lu, B.-C. Ooi, K.-L. Tan (Eds.), IEEE Computer Society Press, 1994. A longer version of this paper is available as IBM Research Report RJ7724, IBM Research - Almaden, October 1990. Citations

    With current systems, some important complex queries may take days to complete because of: (1) the volume of data to be processed, (2) limited aggregate resources. Introducing parallelism addresses the first problem. Cheaper, but powerful computing resources solve the second problem. According to a survey by Brodie (presented at the ACM-SIGMOD International Conference on Management of Data, Chicago, May 1988), only 10% of computerized data is in data bases. This is an argument for both more variety and volume of data to be moved into data base systems. We conjecture that the primary reasons for this low percentage are that data base management systems (DBMSs) still need to provide far greater functionality and improved performance compared to a combination of application programs and file systems. This paper addresses the issues and solutions relating to intra-query parallelism in a relational DBMS supporting SQL. Instead of focusing only on a few algorithms for a subset of the problems, we provide a broad framework for the study of the numerous issues that need to be addressed in supporting parallelism efficiently and flexibly. We also discuss the impact that parallelization of complex queries has on short transactions which have stringent response time constraints. The pros and cons of the shared nothing, shared disks and shared everything architectures for parallelism are enumerated. The impact of parallelism on a number of components of an industrial-strength DBMS are pointed out. The different stages of query processing during which parallelism may be gainfully employed are identified. The interactions between parallelism and the traditional systems' pipelining technique are analyzed. Finally, the performance implications of parallelizing a specific complex query are studied. This gives us a range of sample points for different parameters of a parallel system architecture, namely, I/O and communication bandwidth as a function of aggregate MIPS.

  12. [Moha90a] Mohan, C. Commit_LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing Systems, Proc. 16th International Conference on Very Large Data Bases, Brisbane, August 1990. Also available as IBM Research Report RJ7344, IBM Research - Almaden, February 1990. A slightly revised version appears in Performance of Concurrency Control Mechanisms in Centralized Database Systems, V. Kumar (Ed.), Prentice Hall, 1995. Citations DBLP ACM ResearchIndex

    This paper presents a novel and simple method, called Commit_LSN, for determining if a piece of data is in the committed state in a transaction processing system. This method is a much cheaper alternative to the locking approach used by the prior art for this purpose. The method takes advantage of the concept of a log sequence number (LSN). In many systems, an LSN is recorded in each page of the data base to relate the state of the page to the log of update actions for that page. Our method uses information about the LSN of the first log record (call it Commit_LSN) of the oldest update transaction still executing in the system to infer that all the updates in pages with page_LSN less than Commit_LSN have been committed. This reduces locking and latching. In addition, the method may also increase the level of concurrency that could be supported. The Commit_LSN method makes it possible to use fine-granularity locking without unduly penalizing transactions which read numerous records. It also benefits update transactions by reducing the cost of fine-granularity locking when contention is not present for data on a page. We discuss in detail many applications of this method and illustrate its potential benefits for various environments. In order to apply the Commit_LSN method, extensions are also proposed for those systems in which (1) LSNs are not associated with pages (AS/400, SQL/DS, System R), (2) LSNs are used only partially (IMS), and/or (3) not all objects' changes are logged (AS/400, SQL/DS, System R).

    Commit_LSN has been implemented in DB2 V3 and MQSeries for MVS/ESA (Message Queue Manager/ESA).

  13. [Mohan90b] Mohan, C. ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes, Proc. 16th International Conference on Very Large Data Bases, Brisbane, August 1990. A different version of this paper is available as IBM Research Report RJ7008, IBM Almaden Research Center, September 1989. Citations DBLP ACM ResearchIndex

    This paper presents a method, called ARIES/KVL (Algorithm for Recovery and Isolation Exploiting Semantics using Key-Value Locking), for concurrency control in B+ tree indexes. A transaction may perform any number of nonindex and index operations, including range scans. Both serializable (repeatable read) and, optionally, nonserializable (cursor stability) executions of transactions are supported. The concurrent executions permitted by the locking protocols are such that correct logging and recovery are made possible. ARIES/KVL supports very high concurrency during tree traversals, structure modifications, and other operations. Unlike in System R, when one transaction is waiting for a lock on a key value in a page, reads and modifications of that page by other transactions are allowed. Further, transactions that are rolling back will never get into deadlocks. ARIES/KVL, by also using, for key value locking, the IX and SIX lock modes that were intended originally for table level locking, is able to better exploit the semantics of the operations to improve concurrency, compared to the System R index protocols. These techniques are also applicable to the concurrency control of the classical links-based storage and access structures which are beginning to appear in modern systems also.

    Some of the ARIES/KVL techniques are implemented in SQL/DS and the VM Shared File System.

  14. [PiMo91] Pirahesh, H., Mohan, C. Evolution of Relational DBMSs Toward Object Support: A Practical Viewpoint, Proc. Datenbanksysteme in Buro, Technik und Wissenschaft, Kaiserslautern, March 1991. Also available as IBM Research Report RJ8324, IBM Research - Almaden, September 1991. Citations ResearchIndex

    Object-oriented data base management systems (OODBMSs) have been the focus of intense research and commercial development activities in the last few years. Yet, many problems remain to be solved. In this paper, we discuss some aspects of supporting the object-oriented paradigm. In particular, we deal with type and collection hierarchies, complex objects, constraints and field-level types. We argue in favor of modifying a relational data base management system in order to support OO features rather than building an OODBMS from scratch. In so doing, we have tried to benefit from the implementation and user experiences gained from the different implementations of the relational model. We also discuss some issues that are not currently being given as much importance by the OODBMS developers as they deserve.

  15. [MoPi91] Mohan, C., Pirahesh, H. ARIES-RRH: Restricted Repeating of History in the ARIES Transaction Recovery Method, Proc. 7th International Conference on Data Engineering, Kobe, April 1991. Also available as IBM Research Report RJ7342, IBM Research - Almaden, July 1990. Citations DBLP ResearchIndex

    This paper presents a method, called ARIES-RRH (Algorithm for Recovery and Isolation Exploiting Semantics with Restricted Repeating of History), which is a modified version of the ARIES transaction recovery and concurrency control method introduced by Mohan, et al. in the IBM Research Report RJ6649 and implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2 V2, Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS and Gamma data base machine. ARIES redoes, during restart after a system failure, all updates which had been logged to stable storage but whose effects on the data base pages had not yet been reflected in nonvolatile storage before the failure. This repeating history paradigm of ARIES includes redoing the updates of even the transactions which are to be rolled back later in the undo pass of restart. The latter may lead to some wasted work being done. It was pointed out in the ARIES paper that repeating history was required to support fine-granularity (i.e., less than page-granularity) locking. This paper further analyzes this paradigm and proposes more efficient handling of redos, especially when the smallest granularity of locking is not less than a page, by combining the paradigm of selective redo from DB2 V1. Even with fine-granularity locking, it is not always the case that all the unapplied but logged changes need to be redone. ARIES-RRH, which incorporates these changes, still retains all the good properties of ARIES - avoiding undo of undos, single pass media recovery, nested top actions, etc. In this paper, we also explain the fundamentals behind why DB2 V1's selective redo works, in spite of failures during restart recovery.

  16. [CHHIM91] Cheng, J., Haderle, D., Hedges, R., Iyer, B., Messinger, T., Mohan, C., Wang, Y. An Efficient Hybrid Join Algorithm: A DB2 Prototype, Proc. 7th International Conference on Data Engineering, Kobe, April 1991. A longer version of this paper is available as IBM Research Report RJ7884, IBM Almaden Research Center, December 1990.

    Many commercial relational data base systems provide two join methods: 1) Nested Loop join and 2) Sort Merge join. Nested Loop join exploits indexes on the inner table's join column. Sort Merge join benefits from the efficiency of bulk sequential disk accesses. Both provide good performance when selected correctly by the query optimizer. However, if incorrectly selected, Nested Loop join's cost could be prohibitive because of the large number of synchronous I/Os against the inner table's data and index pages, especially when the index is non-clustered. Sort Merge join suffers from the inability to fully apply the join predicate before sorting both the relations being joined. The time to spin the disk to access all rows of both the relations and to sort them after applying local predicates could be quite large. Our new method, called Hybrid join, first sorts the outer table on the join column. Then, the outer is joined with the index on the join column of the inner. The inner tuple is represented by its surrogate, equivalent of its physical disk address, which is carried in the index. The partial join result is sorted on the surrogate and then the inner table is accessed sequentially to complete the join result. Local predicate filtering can also be applied before the access of the inner relation through index AND/ORing. Hybrid join takes advantage of join predicate filtering through the index, efficient disk accesses by employing sequential access, and local predicate filtering through index AND/ORing. We discuss details of the Hybrid join algorithm, its prototype implementation in DB2, modeling and validation via measurements. We also discuss the parallel execution of the Hybrid Join algorithm and an efficient algorithm for the inner table index scan.

    Hybrid join has been implemented in DB2 V2R3.

  17. [MoNa91] Mohan, C., Narang, I. Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment, Proc. 17th International Conference on Very Large Data Bases, Barcelona, September 1991. A longer version of this paper is available as IBM Research Report RJ8017, IBM Almaden Research Center, March 1991. Citations DBLP ResearchIndex

    This paper proposes schemes for fast page transfer between transaction system instances in a shared disks (SD) environment where all the sharing instances can read and modify the same data. Fast page transfer improves transaction response time and concurrency because one or more disk I/Os are avoided while transferring a page from a system which modified it to another system which needs it. The proposed methods work with the steal and no-force buffer management policies, and fine-granularity (e.g., record) locking. For each of the page-transfer schemes, unlike most of the papers in the literature, we present both recovery and coherency-control protocols in a comprehensive fashion. Updates can be made to a page by several systems before the page is written to disk. Many subtleties involved in correctly recovering such a page in the face of single system or complex-wide failures are also discussed. Assuming that each system maintains its own log, some methods require a merged log for restart recovery while others don't. Techniques for enhancing data availability when one or more systems have failed are also presented. Our proposals should also apply to distributed, recoverable file systems and distributed virtual memory in the SD environment, and to the currently popular client-server object-oriented DBMS environments where the clients cache data.

  18. [SPAM91] Schreier, U., Pirahesh, H., Agrawal, R., Mohan, C. Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS, Proc. 17th International Conference on Very Large Data Bases, Barcelona, September 1991. Also available as IBM Research Report RJ8306, IBM Research - Almaden, August 1991.

    Alert is an extension architecture designed for transforming a passive SQL DBMS into an active DBMS. The salient features of the design of Alert are reusing, to the extent possible, the passive DBMS technology, and making minimal changes to the language and implementation of the passive DBMS. Alert provides a layered architecture that allows the semantics of a variety of production rule languages to be supported on top. Rules may be specified on user-defined as well as built-in operations. Both synchronous and asynchronous event monitoring are possible. This paper presents the design of Alert and its implementation in the Starburst extensible DBMS.

  19. [MoNS91] Mohan, C., Narang, I., Silen, S. Solutions to Hot Spot Problems in a Shared Disks Transaction Environment, Proc. 4th International Workshop on High Performance Transaction Systems, Asilomar, September 1991. Also available as IBM Research Report RJ8281, IBM Research - Almaden, August 1991. Citations ResearchIndex

    In this paper, we consider the problems arising from very frequent update accesses to some data (hot spots) in the shared disks transaction environment. We present many techniques for increasing concurrency. These techniques take advantage of the semantics of the operations being performed on such data. A few of the features of our techniques include storing some user data in the global lock manager, centralized updating of data on disk by using log records generated by different systems and avoiding locking completely for certain types of operations. We also address the implications of different types of failures and recovery from them.

  20. [Moha92a] Mohan, C. Interactions Between Query Optimization and Concurrency Control, Proc. 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, Tempe, February 1992. Also available as IBM Research Report RJ8681, IBM Research - Almaden, March 1992.

    In this paper, we argue the importance of and need for taking into consideration concurrency control related issues in making query optimization and query processing decisions. Such considerations are very important not only for attaining good performance, but also for assuring the correctness of the results returned to the users under certain circumstances. Some of the topics that we deal with include degrees of consistency or isolation levels (repeatable read, cursor stability, ...), lock escalation, blocking of results and use of multiple indexes for a single table access (i.e., index AND/ORing). We identify some of the pieces of information relating to locking that must be available to the optimizer for it to make intelligent decisions. We also identify some situations in which locking can be avoided by taking advantage of the isolation level of the query being executed.

  21. [Moha92b] Mohan, C. Less Optimism About Optimistic Concurrency Control, Proc. 2nd International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, Tempe, February 1992. Also available as IBM Research Report RJ8686, IBM Research - Almaden, March 1992. ResearchIndex

    This paper attempts to document some of the shortcomings of the optimistic concurrency control (OCC) approach in supporting all the features expected in a full-function DBMS. Surprisingly, in spite of OCC having been around for a long time and its performance having been studied in various contexts, no complete system design, let alone a full-blown implementation, exists, as far as the author knows. The problems with OCC relate to support for access paths (indexes, hash-based storage), partial rollbacks, nested transactions, fine-granularity (e.g., record-level) conflict checking, different isolation levels, distributed transactions, and so on. The goal of this paper is to increase awareness of the implementation aspects of OCC amongst researchers and to initiate a debate about the practical utility of OCC.

  22. [MHLPS92] Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992. Reprinted in Readings in Database Systems, Third Edition, M. Stonebraker, J. Hellerstein (Eds.), Morgan Kaufmann Publishers, 1998. Also available as IBM Research Report RJ6649, IBM Research - Almaden, January 1989; Revised November 1990. Citations DBLP ACM ResearchIndex

    In this paper, we present a simple and efficient method, called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), which supports partial rollbacks of transactions, fine-granularity (e.g., record) locking and recovery using write-ahead logging (WAL). We introduce the paradigm of repeating history to redo all missing updates before performing the rollbacks of the loser transactions during restart after a system failure. ARIES uses a log sequence number in each page to correlate the state of a page with respect to logged updates of that page. All updates of a transaction are logged, including those performed during rollbacks. By appropriate chaining of the log records written during rollbacks to those written during forward progress, a bounded amount of logging is ensured during rollbacks even in the face of repeated failures during restart or of nested rollbacks. We deal with a variety of features that are very important in building and operating an industrial-strength transaction processing system. ARIES supports fuzzy checkpoints, selective and deferred restart, fuzzy image copies, media recovery, and high concurrency lock modes (e.g., increment/decrement) which exploit the semantics of the operations and which require the ability to perform operation logging. ARIES is flexible with respect to the kinds of buffer management policies that can be implemented. It supports varying length objects efficiently. By enabling parallelism during restart, page-oriented redo and logical undo, it enhances concurrency and performance. We show why some of the System R paradigms for logging and recovery, which were based on the shadow page technique, need to be changed in the context of WAL. We compare ARIES to the WAL-based recovery methods of DB2, IMS and Tandem systems. ARIES is applicable not only to data base management systems but also to persistent object-oriented languages, recoverable file systems and transaction-based operating systems.

    ARIES has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2/2, DB2/6000, DB2, AdStar Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina, and in the University of Wisconsin's EXODUS, Gamma and SHORE.

  23. [MoNa92a] Mohan, C., Narang, I. Efficient Locking and Caching of Data in the Multisystem Shared Disks Transaction Environment, Proc. 3rd International Conference on Extending Database Technology, Vienna, March 1992. Also available as IBM Research Report RJ8301, IBM Almaden Research Center, August 1991. Abstract Citations DBLP ResearchIndex

    This paper describes a technique for use when multiple instances of a data base management system (DBMS), each with its own cache (buffer pool), can directly read and modify any data stored on a set of shared disks. Global locking and coherency control protocols are necessary in this context for assuring transaction consistency and for maintaining coherency of the data cached in the multiple caches. The coordination amongst the systems is performed by a set of local lock managers (LLMs) and a global lock manager (GLM). This typically involves sending messages. We describe a technique, called LP locking, which saves locking calls when the granularity of locking by transactions is the same as the granularity of caching by the cache manager. The savings are gained by making the LLMs hide from the GLM the distinction between a transaction lock, called the L lock, and a cache-ownership lock, called the P lock, for the same object. The L and P locks for an object, though distinct at an LLM, are known as a single lock at the GLM. An LLM can grant an L or P lock request on an object locally if the combined lock mode of the L and P locks already held on that object by that LLM is equal to or higher than the requested mode. Such optimizations save messages between the LLMs and the GLM. Our ideas apply also to the client-server environment which has become very popular in the OODBMS area and to the distributed shared memory environment.

  24. [Moha92c] Mohan, C. Supporting Very Large Tables, Proc. 7th Brazilian Symposium on Database Systems, Porto Alegre, May 1992. Also available as IBM Research Report RJ8687, IBM Almaden Research Center, March 1992.

    There are many real-life data base management problems which haven't received as much attention as they deserve from the data base research community. A significant number of these problems relate to supporting the efficient and flexible storage, maintenance and manipulation of large volumes of data (e.g., >100 gigabytes of data in a single table). High availability is also an important consideration. While a classical DBMS like IMS Fast Path exhibits some of these desired features, the currently-popular relational DBMSs have been very slow in providing such support. To make it possible for relational DBMSs to be deployed for managing many large enterprises' operational data and to permit ad hoc querying and knowledge mining, these features are very crucial. We discuss some of the issues involved in improving the availability and efficient accessibility of partitioned tables via parallelism, fine-granularity locking, transient versioning and partition independence. We outline some solutions that have been proposed. These solutions relate to algorithms for index building, utilities for fuzzy dumps, recovery and reorganization, buffer management, transient versioning, concurrency control and record management. Algorithms of this nature are extremely important to produce industrial-strength DBMSs.

  25. [MoLe92] Mohan, C., Levine, F. ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging, Proc. ACM SIGMOD International Conference on Management of Data, San Diego, June 1992. A longer version of this paper is available as IBM Research Report RJ6846, IBM Almaden Research Center, August 1989. Citations DBLP ACM ResearchIndex

    Even though concurrency in search structures (e.g., B+ tree indexes) has been discussed frequently in the literature, the problem of providing recovery from transaction and system failures when transactions consist of multiple search structure operations has received very little attention. This paper attempts to provide a comprehensive treatment of index management in transaction systems. We present a method, called ARIES/IM (Algorithm for Recovery and Isolation Exploiting Semantics for Index Management), for controlling concurrency and logging changes to index data stored in B+ trees. ARIES/IM is based on the ARIES transaction recovery and concurrency control method which was introduced by Mohan et al. and which has been implemented, to varying degrees, in IBM's OS/2 Extended Edition Database Manager, DB2/2, DB2/6000, DB2, AdStar Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Workstation Data Save Facility/VM, Starburst and QuickSilver, in Transarc's Encina Product Suite, and in the University of Wisconsin's EXODUS extensible DBMS and Gamma data base machine. ARIES/IM supports transaction semantics for locking (repeatable read or degree 3 consistency) and uses write-ahead logging (WAL) for recovery. A transaction may consist of any number of index and nonindex operations. ARIES/IM supports very high concurrency by (1) not locking the index data per se (i.e., keys), (2) locking the underlying record data in data pages only (e.g., at the record level), (3) not acquiring commit duration locks on index pages even during index structure modification operations (SMOs) like page splits and page deletions, (4) allowing retrievals, inserts, and deletes to go on concurrently with even an SMO, and (5) optionally, supporting degree 2 consistency of locking (Cursor stability). During restart, any necessary redos of the index changes are always performed in a page-oriented fashion (i.e., without traversing the index tree) and, during normal processing and restart, undos are performed in a page-oriented fashion, whenever possible.

    A subset of ARIES/IM has been implemented in the OS/2 Extended Edition Database Manager, DB2/2 and DB2/6000. Since the locking ideas of ARIES/IM have general applicability, some of those ideas have also been incorporated in SQL/DS and the VM Shared File System even though those systems are based on System R which uses the shadow-page technique for recovery.

  26. [MoNa92b] Mohan, C., Narang, I. Algorithms for Creating Indexes for Very Large Tables Without Quiescing Updates, Proc. ACM SIGMOD International Conference on Management of Data, San Diego, June 1992. A longer version of this paper is available as IBM Research Report RJ8016, IBM Almaden Research Center, March 1991. Citations DBLP ACM ResearchIndex

    As relational DBMSs become more and more popular and as organizations grow, the sizes of individual tables are increasing dramatically. Unfortunately, current DBMSs do not allow updates to be performed on a table while an index (e.g., a B+ tree) is being built for that table, thereby decreasing the systems' availability. This paper describes two algorithms in order to relax this restriction. Our emphasis has been to maximize concurrency, minimize overheads and cover all aspects of the problem. Builds of both unique and nonunique indexes are handled correctly. We also describe techniques for making the index-build operation restartable, without loss of all work, in case a system failure were to interrupt the completion of the creation of the index. In this connection, we also present algorithms for making a long sort operation restartable. These include algorithms for the sort and merge phases of sorting.

  27. [MoPL92] Mohan, C., Pirahesh, H., Lorie, R. Efficient and Flexible Methods for Transient Versioning of Records to Avoid Locking by Read-Only Transactions, Proc. ACM SIGMOD International Conference on Management of Data, San Diego, June 1992. Also available as IBM Research Report RJ8683, IBM Research - Almaden, March 1992. Citations DBLP ACM ResearchIndex

    We present efficient and flexible methods which permit read-only transactions that do not mind reading a possibly slightly old, but still consistent, version of the data base to execute without acquiring locks. This approach avoids the undesirable interferences between such queries and the typically shorter update transactions that cause unnecessary and costly delays. Indexed access by such queries is also supported, unlike by the earlier methods. Old versions of records are maintained only in a transient fashion. Our methods are characterized by their flexibility (number of versions maintained and the timing of version switches, supporting partial rollbacks, and different recovery and buffering methods) and their efficiency (logging, garbage collection, version selection, and incremental, record-level versioning). Distributed data base environments are also supported, including commit protocols with the read-only optimization. We also describe efficient methods for garbage collecting unneeded older versions.

  28. [MoNa92c] Mohan, C., Narang, I. Data Base Recovery in Shared Disks and Client-Server Architectures, Proc. 12th International Conference on Distributed Computing Systems, Yokohama, June 1992. Also available as IBM Research Report RJ8685, IBM Research - Almaden, March 1992. Citations ResearchIndex

    This paper presents solutions for the problem of performing recovery correctly in shared disks (SD) and client-server (CS) architectures. In SD, all the disks containing the data bases are shared amongst multiple instances of the DBMS. Any DBMS instance in the complex may directly access and update any data. Each instance maintains its own log and buffer pool. The local logs are later merged for media recovery purposes and, possibly, for restart recovery purposes. In CS, the server manages the disk version of the data base. The clients, after obtaining data base pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. In write-ahead logging (WAL) systems, a monotonically increasing value called the log sequence number (LSN) is associated with each log record. Every data base page contains the LSN of the log record describing the most recent update to that page. This is required for proper recovery after a system failure. We describe a technique with some valuable features (e.g., avoiding reading empty pages and supporting the Commit_LSN optimization) for generating monotonically increasing LSNs in SD and CS architectures without using synchronized clocks.

  29. [LeMo92] Levine, F., Mohan, C. Method and Apparatus for Concurrent Modification of an Index Tree in a Transaction Processing System Utilizing Selective Indication of Structural Modification Operations, United States Patent 5,123,104, IBM, June 1992. Sri Lanka Patent 0,010,014, November 1989. Taiwan Patent NI-40987, December 1990. Republic of China Patent 0,022,452, February 1993. Philippines Patent 0,027,313, May 1993. Korea Patent 0,063,350, July 1993. Thailand Patent 0,003,973, September 1994.

    A method and apparatus for concurrent modifications of an index tree in a transaction processing system. The index tree includes at least one root node having a key record reference to one or more nodes in a next lower ordered level and at least one bottom node providing access to key records. Transactions including a structure modification operation are performed by traversing the index tree to the selected node and then setting an indication of the pendency of a structure modification operation. Concurrent key record inserts or deletes are permitted throughout the index tree where no indication of a pending structure modification operation is present and are delayed where a pending structure modification operation is indicated. Similarly, transactions which include a key record delete may require a structure modification operation in the event the transaction does not reach new point of consistency and must be undone. Therefore, an indication of each key record delete which has not yet reached a new point of consistency is set and concurrent key record inserts or deletes are also delayed until the possibility of a structure modification operation is completed.

    This method has been implemented in DB2/2 and DB2/6000.

  30. [MoOT92] Mohan, C., Obermarck, R., Treiber, K. Concurrently Applying Redo Records to Backup Database in a Log Sequence Using Single Queue Server per Queue at a Time, United States Patent 5,170,480, IBM, December 1992. Japan Patent 1,868,704, September 1994.

    Change processing of a replica database is accomplished by separating redo records obtained from the transaction log of a primary database into respective queues. The redo records are separated such that all transaction records for a unit of transfer (page) of the primary database are placed on the same queue in log sequence. Each queue is linked exclusively to one of a plurality of parallel queue servers. Each queue server applies to the replica database the redo records in the queues which it exclusively serves. The replica database is thereby made consistent with the primary data by a lock-free updating mechanism which services the pages of the replica database in parallel.

    This method has been implemented as part of the Remote Site Recovery (RSR) feature of IMS/ESA V5.

  31. [MoTO93] Mohan, C., Treiber, K., Obermarck, R. Algorithms for the Management of Remote Backup Data Bases for Disaster Recovery, Proc. 9th International Conference on Data Engineering, Vienna, April 1993. A longer version of this paper is available as IBM Research Report RJ7885, IBM Almaden Research Center, December 1990; Revised June 1991. Citations DBLP ResearchIndex

    For high availability in disaster recovery situations, it is desirable to reflect data base changes made by a transaction processing system continuously on a (replicated) data base maintained at a remote site. To be able to make the backup system take over as soon as possible and to keep the resource consumption on the backup system low, we describe a general method for exploiting parallelism in the processing of the log records received at the backup system. As much as possible, this method tries to take advantage of the existing code for performing restart recovery after a failure of the transaction system. The techniques described here are very general in that they could also be used in the context of restart and media recovery of the primary system to improve the time it takes to complete such processing. We also propose techniques for (1) checkpointing the state of the backup system so that recovery can be performed quickly in case the backup system fails and (2) allowing new transaction activity to begin even as the backup is taking over the role of the primary when the old primary fails. Our approach is general enough to accommodate even the ARIES-type recovery and concurrency control methods which support high concurrency and high efficiency via write-ahead logging, nested transactions, operation logging and semantically-rich modes of locking. We also discuss some problems relating to distributed transactions, the shared disks (data sharing) transaction environment, and combining executions of 1-Safe and 2-Safe transactions in a single system. We propose some possible solutions for dealing with these problems.

    The parallel redo method for the backup has been implemented as part of the Remote Site Recovery (RSR) feature of IMS/ESA V5.

  32. [Moha93a] Mohan, C. ARIES/LHS: A Concurrency Control and Recovery Method Using Write-Ahead Logging for Linear Hashing with Separators, Proc. 9th International Conference on Data Engineering, Vienna, April 1993. A longer version of this paper is available as IBM Research Report RJ8682, IBM Almaden Research Center, March 1992. Citations DBLP ResearchIndex

    Larson has proposed a dynamic hashing algorithm called Linear Hashing with Separators (LHS) that, given a unique primary key value, uses a table in memory to allow the retrieval of the corresponding record in the file in one page access to secondary storage. Larson considers LHS to be the first practical method offering one-access retrieval for large dynamic files. He did not discuss the impact of concurrent operations by different users, some of whom are reading the file while others are performing operations like inserts, deletes, updates, file expansions or file contractions which can cause relocations of records. We present a method, called ARIES/LHS (Algorithm for Recovery and Isolation Exploiting Semantics for Linear Hashing with Separators), for controlling such concurrent operations with fine-granularity (e.g., record) locking, while guaranteeing serializability. ARIES/LHS prevents rolling back transactions from getting involved in deadlocks. It also includes recovery techniques for handling transaction and system failures, while allowing multiple operations in each transaction.

  33. [SBCM93] Samaras, G., Britton, K., Citron, A., Mohan, C. Two-Phase Commit Optimizations and Tradeoffs in the Commercial Environment, Proc. 9th International Conference on Data Engineering, Vienna, April 1993. Citations DBLP ResearchIndex

    An atomic commit protocol can ensure that all participants in a distributed transaction reach consistent states, whether or not system or network failures occur. The atomic commit protocol used in industry and academia is the well-known two-phase commit (2PC) protocol, which has been the subject of considerable work and technical literature for some years. Much of the literature focuses on improving performance in failure cases by providing a nonblocking 2PC that streamlines recovery processing at the expense of extra processing in the normal case. We focus on improving performance in the normal case based on two assumptions: first that the networks and systems are becoming increasingly reliable, and second that the need to support high-volume transactions requires a streamlined protocol for the normal case. In this paper, various optimizations are presented and analyzed in terms of reliability, savings in log-writes, network traffic and reduction in resource lock time. Its unique contributions include the disclosure of a number of new optimizations, the analysis of the optimizations, and a thorough comparison of the different approaches.

  34. [MoNa93a] Mohan, C., Narang, I. An Efficient and Flexible Method for Archiving a Data Base, Proc. ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1993. A corrected version of this paper is available as IBM Research Report RJ9733, IBM Research - Almaden, March 1993.

    We describe an efficient method for supporting incremental and full archiving of data bases (e.g., individual files). Customers archive their data bases quite frequently to minimize the duration of data outage. Because of the growing sizes of data bases and the ever increasing need for high availability of data, the efficiency of the archive copy utility is very important. The method presented here minimizes interferences with concurrent transactions by not acquiring any locks on the data being copied. It significantly reduces disk I/Os by not keeping on data pages any extra tracking information in connection with archiving. These features make the archive copy operation be more efficient in terms of resource consumption compared to other methods. The method is also flexible in that it optionally supports direct copying of data from disks, bypassing the DBMS's buffer pool. This reduces buffer pool pollution and processing overheads, and allows the utility to take advantage of device geometries for efficiently retrieving data. We also describe extensions to the method to accommodate the multisystem shared disks transaction environment. The method tolerates gracefully system failures during the archive copy operation.

  35. [Moha93b] Mohan, C. IBM's Relational DBMS Products: Features and Technologies, Proc. ACM SIGMOD International Conference on Management of Data, Washington, D.C., May 1993.

    This paper very briefly summarizes the features and technologies implemented in the IBM relational DBMS products. The topics covered include record and index management, concurrency control and recovery methods, commit protocols, query optimization and execution techniques, high availability and support for parallelism and distributed data. Some indications of likely future product directions are also given.

  36. [PiMC93] Pirahesh, H., Mohan, C., Cheng, J. Sequential and Parallel Algorithms for Unified Execution of Outer Join and Subqueries, IBM Research Report, IBM Research - Almaden, June 1993.

    The outer join operation is being introduced in major relational DBMSs, and is already proposed as part of the emerging SQL2 standard. The outer join operation plays an important role in the handling of complex object queries, path expressions in object-oriented query languages, and universal and existential subqueries in relational languages. Thus, outer join is an important primitive. Similar to regular joins, good performance of outer join is critical in query processing. In this paper, we introduce a unified algorithm for handling of outer join and subqueries. We first introduce a general execution algorithm for an outer join of any complexity, as required by the SQL2 standard, where the outer join predicate can be a full WHERE clause of SQL, including conjuncts, disjuncts, subqueries, etc. Typically, outer join queries are much simpler, e.g., involving equijoins. For this category of queries we introduce, in considerable detail, a much more efficient algorithm, called the specialized algorithm. One important property of these algorithms is that they allow several existing efficient join algorithms to be extended to support outer join. We also present methods for parallelizing the execution of an outer join. We then refine these methods for handling SQL subqueries.

  37. [Moha93c] Mohan, C. A Cost-Effective Method for Providing Improved Data Availability During DBMS Restart Recovery After a Failure, Proc. 19th International Conference on Very Large Data Bases, Dublin, August 1993. Also available as IBM Research Report RJ8114, IBM Research - Almaden, May 1991.

    We present a cost-effective method for improving data availability during restart recovery of a data base management system (DBMS) after a failure. The method achieves its objective by enabling the processing of new transactions to begin even before restart recovery is completed by exploiting the Commit_LSN concept. It supports fine-granularity (e.g., record) locking with semantically-rich lock modes and operation logging, partial rollbacks, write-ahead logging, and the steal and no-force buffer management policies. The overhead imposed by this method during normal transaction processing is insignificant. We require very few changes to an existing DBMS in order to support our method. Our method can be implemented with different degrees of sophistication depending on the existing features of a DBMS.

  38. [CHHIM93] Cheng, J., Haderle, D., Hedges R., Iyer, B., Mohan, C., Wang, Y. A Hybrid Technique for Joining Tables, United States Patent 5,241,648, IBM, August 1993.

    Results of a relational data base management system are joined in a process requiring, first, existence of an index on the join column of the inner table, and, second, ordering on the join columns of the first table. First, the index on the inner table's join column is scanned for rows of the inner table having join column values matching such values of rows in the outer table. This is done in a single pass through the outer table. Next, a temporary work table containing the identifiers of inner table rows having join column values matching those of the outer table is produced by concatenating the row identifiers to their matching outer table rows. Following this, the temporary work table is ordered by the identifiers. Last, the identifier list of inner table rows is used to retrieve the corresponding rows of the inner table. All predicates local to the inner table are applied to the retrieved rows, and those satisfying these local predicates are combined with their matching outer table rows and returned to the user.

    This join method has been implemented in DB2 V2R3.

  39. [Moha93d] Mohan, C. Transaction Processing System and Method With Reduced Locking, United States Patent 5,247,672, IBM, September 1993. Japan Patent 1,938,731, June 1995.

    Apparatus and method for reading data pages in a transaction processing system without locking the pages are disclosed. The system maintains a Global_Committed_LSN identifying the oldest uncommitted transaction accessing any of the data, and Object_Committed_LSNs identifying the oldest uncommitted transactions accessing particular files, tables and indexes. Each data page includes a Page_LSN identifying the last transaction to have updated the page. To read a page, a transaction first latches the page, and compares the page's Page_LSN with the Global-Committed_LSN, or with the page's respective Object_Committed_LSN. If the Page_LSN is older than the Committed_LSN with which it was compared, then the transaction reads the page without locking it, since there can be no uncommitted transaction in process which might have updated the page's data. However, if the Page_LSN is younger than the Committed_LSN, the page is locked before being read.

    This method has been implemented in DB2 V3 and MQSeries for MVS/ESA (Message Queue Manager/ESA).

  40. [LLMP93] Lee, T., Lyle, R., Mohan, C., Peterson, A. Using Commit Bits to Reduce Locking in Transaction Processing Systems, IBM Technical Disclosure Bulletin, Vol. 36, No. 9A, September 1993.

    Disclosed is a new way to determine if data is in the committed state without locking through the use of commit bits. Under certain conditions, transactions can easily determine that the data is committed simply by checking the value of a bit.

    This method has been implemented in DB2/390 V3.

  41. [Moha93e] Mohan, C. A Survey of DBMS Research Issues in Supporting Very Large Tables, Proc. 4th International Conference on Foundations of Data Organization and Algorithms, Evanston, October 1993.

    A number of interesting problems arise in supporting the efficient and flexible storage, maintenance and manipulation of large volumes of data (e.g., >100 gigabytes of data in a single table). Very large tables are becoming common. Typically, high availability is an important requirement for such data. The currently-popular relational DBMSs have been very slow in providing the needed support. To make it possible for RDBMSs to be deployed for managing many large enterprises' operational data and to support complex queries efficiently, these features are very crucial. We discuss some of the issues involved in improving the availability and efficient accessibility of partitioned tables via parallelism, fine-granularity locking, transient versioning and partition independence. We outline some solutions that have been proposed. These solutions relate to algorithms for index building, utilities for fuzzy backups, incremental recovery and reorganization, buffer management, transient versioning, concurrency control and record management.

  42. [BCMS93] Britton, K., Citron, A., Mohan, C., Samaras, G. Method of Excluding Inactive Nodes from Two-Phase Commit Operations in a Distributed Transaction Processing System, United States Patent 5,258,982, IBM, November 1993. Japan Patent 1,938,748, June 1995.

    A method of reducing the number of messages required for sync point (commit or backout) operations by leaving out nodes that have not participated in the corresponding transaction. A two-phase sync point protocol is used in a distributed transaction processing network to commit or backout transactions in the network. In response to the beginning of sync point operations on a transaction x, each node determines if each of its partner nodes stated on the sync point operation for transaction x-1 that the partner could be left out of sync point operations for transaction x. If a partner node did so state that it could be left out, the present node determines if the partner node was included by the present node during the present transaction x. If the partner node was not included during the present transaction, the present node excludes the partner node from the present sync point operations.

    It is useful for an overall understanding of the invention to summarize the general rules first. It is not okay for a node to be left_out of sync_point operations if a partner node is also left_out. This can create a permanent lock out. Therefore, if a child node says to its parent that it is okay for the parent to leave_out the child, then the parent cannot also tell the child that it is okay to leave_out the parent.

    This method is now part of the SNA LU6.2 support for the presumed abort commit protocol. It has been implemented in DB2 V3.

  43. [MoNa94a] Mohan, C., Narang, I. Non-Blocking Serialization for Caching Data in a Shared Cache, United States Patent 5,276,835, IBM, January 1994.

    A method of controlling entry of a block of data is used with a high-speed cache which is shared by a plurality of independently-operating computer systems in a multi-system data sharing complex. Each computer system has access both to the high-speed cache and to lower-speed, upper-level storage for obtaining and storing data. Management logic in the high-speed cache assures that the block of data entered into the cache will not be overwritten by an earlier version of the block of data obtained from the upper-level storage.

    This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2 V4R1.

  44. [MoNT94] Mohan, C., Narang, I., Teng, J. Method for Managing Database Recovery from Failure of a Shared Store in a System Including a Plurality of Transaction-Based Systems of the Write-Ahead Logging Type, United States Patent 5,280,611, IBM, January 1994.

    In a multi-system data sharing complex, a database system can write updated pages to a shared electronic store for a fast write. Other database systems can obtain pages written to the shared store for further modification without the pages first being written to stable storage. However, pages are eventually written to the stable storage in a castout process. Recovery of a database from failure of the shared store is bounded by determination of a recovery boundary which, when applied to the union of database system transaction logs, establishes a point in front of which are found log records of modifications to pages which were in the shared store when it failed. These log records are applied to page versions obtained from stable storage to recover from failure of the shared store.

    This method is part of the S/390 Parallel Sysplex Coupling Facility and DB2 V4R1.

  45. [LoMP94] Lorie, R., Mohan, C., Pirahesh, H. Multiple Version Database Concurrency Control System, United States Patent 5,280,612, IBM, January 1994.

    An improved concurrency control system for application to a distributed concurrent transaction and query processing system using multi-version database records to overcome delays arising from lock conflicts. Read-only queries are afforded a consistent "stable state" of the database during the life of the query. Updating transactions requiring locks can proceed without waiting for the termination of long queries. At least two database versions are necessary, although availability of more versions permits long read-only queries to phase-out over time without forcing new queries to use aged "stable-state" data and without roll-back. Read-only queries can be terminated and converted to locking transactions to permit an update of the "stable-state" database version before the queries would normally terminate. A novel record key structure having a plurality of substructures corresponding to the several database versions is used to access database records. Rapid selection of proper record version and efficient version tracking and updated is effected using several bit-mapped transaction index tables.

  46. [MoNa94b] Mohan, C., Narang, I. Non-Blocking Serialization for Removing Data from a Shared Cache, United States Patent 5,287,473, IBM, February 1994.

    A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and sharing data. Management logic in the high-speed cache assures that a block of data obtained from the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache.

    This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2 V4R1.

  47. [MoHa94] Mohan, C., Haderle, D. Algorithms for Flexible Space Management in Transaction Systems Supporting Fine-Granularity Locking, Proc. 4th International Conference on Extending Database Technology, Cambridge, March 1994. A longer version of this paper is available as IBM Research Report RJ9732, IBM Research - Almaden, March 1994.

    We present several methods which relate to space management in a transaction system supporting fine-granularity (e.g., record) locking. These methods enable varying length records to be supported efficiently by permitting garbage collection to be performed within a page without the moved records having to be locked or the movements having to be logged. We present methods to do the following: (1) When a transaction releases space, efficiently prevent that space from being consumed by other transactions until that transaction terminates, while allowing the same transaction to reuse the space it freed. (2) Under the correct circumstances, avoid reading a totally empty deallocated page from disk during page reallocation. (3) Updating and logging of free space inventory pages' (FSIPs') changes for correct recovery. (4) Reduce locking during a table scan by a transaction using the isolation level of cursor stability. Our methods improve concurrency and space utilization, and provide I/O and CPU savings. Our space reservation and FSIP logging methods have been implemented in DB2 V3 in preparation for DB2's support of record locking.

  48. [MoDi94] Mohan, C., Dievendorff, R. Recent Work on Distributed Commit Protocols, and Recoverable Messaging and Queuing, Data Engineering Bulletin Special Issue on TP Monitors and Distributed Transaction Management, Vol. 17, No. 1, March 1994.

    This paper briefly summarizes some recent IBM work in the areas of distributed commit protocols, and recoverable messaging and queuing. We discuss the original Presumed Nothing commit protocol of SNA LU 6.2 and the current industry standard Presumed Abort (PA) protocol which we originally developed in IBM's R* project. We also discuss Generalized Presumed Abort (GPA) which resulted from the integration of PA into LU 6.2. GPA has been implemented in DB2 V3. We provide a brief introduction to the Message Queue Interface (MQI), an architected application programming interface, and Message Queue Manager (MQM) MVS/ESA, one of the IBM MQSeries products that implements MQI. Some internal design features of MQM are also described.

  49. [MPTW94] Mohan, C., Pirahesh, H., Tang, G., Wang, Y. Parallelism in Relational Database Management Systems, IBM Systems Journal, Vol. 33, No. 2, 1994. Reprinted in Japanese in Nikkei Electronics, No. 618 and 619, 1994.

    In order to provide realtime responses to complex queries involving large volumes of data, it has become necessary to exploit parallelism in query processing. This paper addresses the issues and solutions relating to intra-query parallelism in a relational DBMS. We provide a broad framework for the study of the numerous issues that need to be addressed in supporting parallelism efficiently and flexibly. The different stages of query processing during which parallelism may be gainfully employed are identified. Parallelism can be exploited for both CPU and I/O activities. As a first step in this direction, I/O parallelism has been introduced in DB2 V3R1. We describe many aspects of the DB2 implementation. These include compile time as well as run time decisions, especially regarding the degree of parallelism. At compile time, the best sequential plan produced by the preexisting optimizer logic is made parallel by the new "post-optimizer" logic in DB2 V3. If necessary, the degree of parallelism so determined is adjusted at run time based on host variable values and resource availability.

  50. [HLMPS94] Haderle, D., Lindsay, B., Mohan, C., Pirahesh, H., Schwarz, P. Method for Managing Subpage Concurrency Control and Partial Transaction Rollback in a Transaction-Oriented System of the Write-Ahead Logging Type, United Kingdom Patent 0,295,424, IBM, April 1994. France Patent 0,295,424, April 1994. Germany Patent 3,889,254,508, April 1994.

    This invention relates to a method for returning a log-based, transaction-oriented system to a transaction consistent state following a failure. More particularly, this invention relates to methods which permit the effects of a single transaction to be partially or fully annulled during normal system operation. In this regard, transactions are delimited at the user level by BEGIN, COMMIT, or ABORT primitives. With respect to the transaction log, UNDO and REDO information from the log is used to define and control both system recovery and partial or complete transaction rollback. More particularly, the method of the invention relates to transaction-oriented systems of the type which support concurrent execution of multiple transactions, and further of the type permitting fine-grained concurrency control mechanisms and consequent overlapping of transaction execution.

    This method (ARIES) has been implemented in OS/2 Extended Edition Data Base Manager, DB2 DB2 V3 and V4, DB2/2, DB2/6000, Workstation Data Save Facility/VM, ADSTAR Distributed Storage Manager (ADSM), MQSeries for MVS/ESA (Message Queue Manager/ESA), Starburst extensible DBMS, QuickSilver distributed operating system, Transarc's Encina Product Suite, and University of Wisconsin's Gamma and EXODUS DBMSs, and SHORE persistent object system.

  51. [MoNa94c] Mohan, C., Narang, I. ARIES/CSA: A Method for Database Recovery in Client-Server Architectures, Proc. ACM SIGMOD International Conference on Management of Data, Minneapolis, May 1994. Also available as IBM Research Report RJ9742, IBM Research - Almaden, March 1994.

    This paper presents an algorithm, called ARIES/CSA (Algorithm for Recovery and Isolation Exploiting Semantics for Client-Server Architectures), for performing recovery correctly in client-server (CS) architectures. In CS, the server manages the disk version of the database. The clients, after obtaining database pages from the server, cache them in their buffer pools. Clients perform their updates on the cached pages and produce log records. The log records are buffered locally in virtual storage and later sent to the single log at the server. ARIES/CSA supports write-ahead logging (WAL), fine-granularity (e.g., record) locking, partial rollbacks and flexible buffer management policies like steal and no-force. It does not require that the clocks on the clients and the server be synchronized. Checkpointing by the server and the clients allows for flexible and easier recovery.

  52. [MoNa94d] Mohan, C., Narang, I. Fast Intersystem Page Transfer in a Data Sharing Environment with Record Locking, United States Patent 5,327,556, IBM, July 1994.

    A fast technique for transferring units of data between transaction systems in a shared disk environment. The owning system, having updated the page, generates a version number for the page which is stored with a lock possessed by the owning system. When a requesting system seeks a record on the page, its request for a lock illicit an an indication that a more recent version of the page is required in the local memory. The buffer management component of a DBMS, with assistance from the lock management, triggers a memory to memory transfer of the page from the owning DBMS to the requesting DBMS using a low overhead communication protocol. The transfer of page is without disk I/O or the log I/O for the updates made to the page.

  53. [Moha94] Mohan, C. Method for Providing Data Availability in a Transaction-Oriented System During Restart After a Failure, United States Patent 5,333,303, IBM, July 1994.

    Enhanced data availability occurs in a write-ahead logging, transaction-oriented database system by permitting new transactions to acquire access to data while restart recovery operations are proceeding. The invention permits new transactions to acquire access to data during restart recovery UNDO processing on the condition that the last update to the data occurred before a commit point measured by the earliest-commencing transaction with uncommitted updates which was still executing when a system failure initiated restart recovery operations. During REDO processing, a transaction is permitted access to data which, in addition to meeting the commit point condition, is not in a data structure subject to the REDO processing.

  54. [AKAEG94] Alonso, G., Kamath, M., Agrawal, D., El Abbadi, A., Guenthoer, R., Mohan, C. Failure Handling in Large Scale Workflow Management Systems, IBM Research Report RJ9913, IBM Almaden Research Center, November 1994.

    Workflow management deals with the coordinated execution of business processes. These processes are often of long duration, weeks or even months, are very large, involving many agents and distributed resources, and critical to the enterprise. As organizations adopt workflow technology, they become increasingly dependent on the system to carry on their daily business activities. Hence, a workflow management system, WFMS, must provide availability, reliability and scalability to cope with environments where the range of potential failures is very broad, from system failures to semantic failures. In this paper, in the context of IBM's Exotica project, we propose a new architecture to address the availability and scalability issues, and to deal with system failures such as site and communication failures. We also discuss how features of advanced transaction models can be incorporated into workflow modeling to handle semantic failures. These solutions enhance the capabilities of workflow systems and make them more adequate for large enterprises.

  55. [MoNa95] Mohan, C., Narang, I. Locking and Latching Techniques for Transaction Processing Systems Supporting the Shared Disks Architecture, Research Report, IBM Almaden Research Center, February 1995.

    In several commercial database management systems, the shared disks (SD) architecture has been chosen to provide increased processing capacity and availability of data. In SD, any instance of a DBMS executing on a node of a cluster of loosely coupled processors can access all the data in the database. To coordinate accesses to the same data from the different systems, these instances use global locks. In addition, they implement a buffer coherency control protocol so that the current version of data is available locally when needed. It is important to reduce significant overheads associated with global locking and buffer coherency while supporting high concurrency. In this paper, we describe locking and latching techniques which would be used for synchronizing accesses to hot spot shared data structures in a typical relational DBMS in the SD environment. In particular, we describe synchronization techniques for operations involving B+-tree indexes and for updating space map pages. Many of these techniques have been implemented in Version 4 of DB2 for MVS/ESA. Our techniques are also applicable to the client-server context with client caching of data, as is done in many OODBMSs.

  56. [RaSM95] Rane, S., Seshadri, S., Mohan, C. Concurrency Control and Recovery Algorithms for hcC-trees, IBM Research Report, IBM Research - Almaden, February 1995.

    The hcC-tree is an index structure for multiple sets (on a common attribute) which are useful in object oriented databases as well as relational databases. This paper focuses on the problem of designing concurrency control and recovery algorithms for the hcC-tree. This work is based on ARIES/KVL and ARIES/IM which provide ACID properties for transactions containing multiple index operations on a B+ tree. The hcC-trees, though based on B+ trees, are significantly different from B+ trees. In this paper, we propose an algorithm that provides high concurrency while still retaining the performance advantages of a hcC-tree.

  57. [Moha95a] Mohan, C. Disk Read-Write Optimizations and Data Integrity in Transaction Systems Using Write-Ahead Logging, Proc. 11th International Conference on Data Engineering, Taipei, March 1995. Also available as IBM Research Report RJ9741, IBM Research - Almaden, March 1994.

    We discuss several disk read-write optimizations that are implemented in different transaction systems and disk hardware to improve performance. These include: (1) When multiple sectors are written to disk, the sectors may be written out of sequence (SCSI disk interfaces do this). (2) Avoiding initializing pages on disk when a file is extended. (3) Not accessing individual pages during a mass delete operation (e.g., dropping an index from a file which contains multiple indexes). (4) Permitting a previously deallocated page to be reallocated without the need to read the deallocated version of the page from disk during its reallocation. (5) Purging of file pages from the buffer pool during a file erase operation (e.g., a table drop). (6) Avoiding logging for bulk operations like index create.

    We consider a system which implements the above optimizations and in which a page consists of multiple disk sectors and recovery is based on write-ahead logging using a log sequence number on every page. For such a system, we present a simple method for guaranteeing the detection of the partial disk write of a page. Detecting partial writes is very important not only to ensure data integrity from the users' viewpoint but also to make the transaction system software work correctly. Once a partial write is detected, it is easy to recover such a page using media recovery techniques. Our method imposes minimal CPU and space overheads. It has been implemented in DB2/6000 and AdStar Distributed Storage Manager (ADSM).

  58. [MAGK95] Mohan, C., Alonso, G., Guenthoer, R., Kamath, M. Exotica: A Research Perspective on Workflow Management Systems, Data Engineering Bulletin Special Issue on Infrastructure for Business Process Management, Vol. 18, No. 1, March 1995.

    In this paper we present the Exotica research project, currently in progress at the IBM Almaden Research Center. One of the goals of the project is to bring together industrial trends and research issues in the workflow area. It is for this reason that we have focused on a particular commercial product, FlowMark, IBM's workflow product. However, our results are easily generalized to general workflow management systems since FlowMark's model is similar to that proposed by the Workflow Management Coalition. In particular, the paper contains a high-level overview of our research in six specific areas that are not product specific.

  59. [JMMNT95] Josten, J., Masatani, T., Mohan, C., Narang, I., Teng, J. Efficient Data Base Access Using a Shared Electronic Store in a Multi-System Environment with Shared Disks, United States Patent 5,408,653, IBM, April 1995.

    A computer-implemented method for minimizing the amount of time to access current data in a database which may be stored wholly in a DASD-oriented external storage subsystem or partly in DASD and partly in a high-speed electronic store while maintaining coherency of the data with respect to multiple user systems.

  60. [AGKAE95] Alonso, G., Guenthoer, R., Kamath, M., Agrawal, D., El Abbadi, A., Mohan, C. Exotica/FMDC: Handling Disconnected Clients in a Workflow Management System, Proc. 3rd International Conference on Cooperative Information Systems, Vienna, May 1995.

    Current computer and network technologies allow organizations to decentralize resources in ways not foreseeable few years ago. However, cooperative work in a decentralized environment requires tools to hide the complexity generated by heterogeneous and distributed systems. Workflow Management Systems (WFMS) are a first generation of products that attempt to manage the execution of business processes by large numbers of users distributed over a wide area and using heterogeneous resources. They are a very promising venue for collaborative systems but, in most cases, the autonomy of the users is greatly restricted due to architectural and design considerations. This is a severe restriction, especially when considering the emergence of mobile computing, and the increase in use of lap tops and small computers in which connectivity is only occasional. In this paper, we discuss how disconnected workflow clients can be supported while preserving the correctness of the overall execution and allowing coordinated interaction between the different users. Disconnected clients provide a great deal of flexibility to a workflow management system and enhance its resilience to failures.

  61. [Moha95b] Mohan, C. A Method and Means for Detecting Partial Page Writes and Avoiding Initializing New Pages on DASD in a Transaction Management System Environment, United States Patent 5,418,940, IBM, May 1995.

    A method for detecting partial page writes in pages spanning multiple sectors of a sector organized multiple tracked storage facility in a page oriented, log based transaction management system. During a page write to storage from a buffer, a status bit is embedded at the end of each page sector and a status byte in the last page sector, the status byte is complemented, and each status bit is swapped with a counterpart in the status byte as it is being written out to storage. During a page read in the buffer from storage the status bit values of each page are swapped with their byte counterpart and a partial write detected as a mismatch of the bits in the status byte. Page recovery involves recreating a page from said log upon detection of either a partial sector write or a partial page write by redoing all accessing events on the log between a predetermined point to an end of log including unconditionally redoing of all format page events logged in said interval. Partial page write error is also detected where page is allocated to the buffer while avoiding a page read from storage.

    This method has been implemented in DB2 Client Server (DB2/6000, DB2/2, ...) and AdStar Distributed Storage Manager (ADSM).

  62. [Moha95c] Mohan, C. Concurrency Control and Recovery Methods for B+-Tree Indexes: ARIES/KVL and ARIES/IM, Performance of Concurrency Control Mechanisms in Centralized Database Systems, V. Kumar (Ed.), Prentice Hall, 1995. Also available as IBM Research Report RJ9715, IBM Almaden Research Center, March 1994.

    This paper first describes the problems associated with concurrency control and recovery in B+ tree indexes and then presents two methods, called ARIES/KVL (Algorithm for Recovery and Isolation Exploiting Semantics using Key-Value Locking) and ARIES/IM (Algorithm for Recovery and Isolation Exploiting Semantics for Index Management), for solving those problems. These methods allow a transaction to perform any number of nonindex and index operations, including range scans. The concurrent executions permitted by them are such that serializability is guaranteed, and correct logging and recovery based on write-ahead logging are made possible. Compared to the index locking methods of System R and DB2, these methods support higher levels of concurrency during tree traversals, structure modifications, and other operations. By locking individual keys, rather than key values, ARIES/IM is able to support more concurrency than ARIES/KVL. Our methods have been implemented in the DB2 family of products, and in SQL/DS and the VM/ESA Shared File System.

  63. [AMGAE95] Alonso, G., Mohan, C., Guenthoer, R., Agrawal, D., El Abbadi, A., Kamath, M. Exotica/FMQM: A Persistent Message-Based Architecture for Distributed Workflow Management, Proc. IFIP WG8.1 Working Conference on Information Systems for Decentralized Organizations, Trondheim, August 1995. Also available as IBM Research Report RJ9912, IBM Research - Almaden, November 1994.

    In the past few years there has been an increasing interest in workflow applications as a way of supporting complex business processes in modern corporations. Given the nature of the environment and the technology involved, workflow applications are inherently distributed and pose many interesting challenges to the system designer. In most cases, a client/server architecture is used in which knowledge about the processes being executed is centralized in one node to facilitate monitoring, auditing, and to simplify synchronization. In this paper, we propose a novel distributed architecture, Exotica/FMQM, for workflow systems in which the need for such a centralized database is eliminated. Instead, we use persistent messages as the means to store the information relevant to the execution of a business process. Our approach is to completely distribute the execution of a process so individual nodes are independent. The advantages of this approach are increased resilience to failures and greater scalability and flexibility of the system configuration.

  64. [BCGHJ95] Bhide, A., Copeland, G., Goyal, A., Hsiao, H.-I, Jhingran, A., Mohan, C. Asynchronous Replica Management in Shared Nothing Architectures, United States Patent 5,440,727, IBM, August 1995.

    In a partitioned database system of the Shared Nothing type, one or more secondary replicas of each partition are maintained by spooling (i.e., asynchronously sending) modified (usually called dirty) pages from the primary replica to the secondary replica(s) rather than by using a synchronous page update or by sending log entries instead of entire pages. A Write-ahead Log protocol is used so that a dirty page is not forced to non-volatile storage until a log record of the modification is created and written to non-volatile storage. Replica updating does not delay the committing of transactions because replica updating is done asynchronously with respect to transaction processing. Since dirty pages are sent rather than only log entries, disk accesses and processing at the secondary replica(s) arising from the maintaining of the replicas are minimized as well. Only one centrally accessible log is maintained for all replicas of the same partition.

  65. [MAAEG95] Mohan, C., Agrawal, D., Alonso, G., El Abbadi, A., Guenthoer, R., Kamath, M. Exotica: A Project on Advanced Transaction Management and Workflow Systems, SIGOIS Bulletin Special Issue on "Business Process Management Systems: Concepts, Methods and Technology", Vol. 16, No. 1, August 1995.

    This paper presents an overview of the Exotica project currently in progress at the IBM Almaden Research Center. The project aims at exploring several research areas from advanced transaction management concepts to client/server architectures and mobile computing within the context of business processes and workflow management. The ultimate goal is to incorporate these ideas into IBM's products and prototypes. The project involves IBM groups in Almaden (U.S.A.), Hursley (U.K.), Boeblingen (Germany), and Vienna (Austria). In this paper, we briefly describe two IBM products, FlowMark, a workflow management system, and MQSeries, a messaging system, as the environments in which we are focusing our research. We also discuss some of our results in the areas of availability, replication, distribution, and advanced transaction models, as well as describe our future research directions.

  66. [MAGKR95] Mohan, C., Alonso, G., Guenthoer, R., Kamath, M., Reinwald, B. An Overview of the Exotica Research Project on Workflow Management Systems, Proc. 6th International Workshop on High Performance Transaction Systems, Asilomar, September 1995.

    In this paper, we present the Exotica research project currently in progress at the IBM Almaden Research Center. One of the goals of the project is to bring together industrial trends and research issues in the workflow area. It is for this reason that we have focused on a particular commercial product, FlowMark, IBM's workflow product. However, our results are easily generalized to other WFMSs since FlowMark's model is similar to that proposed by the Workflow Management Coalition.
     
  67. [DiMo95a] Dievendorff, D., Mohan, C. System and Method for Storing Persistent and Non-Persistent Queued Data and for Recovering the Persistent Data Responsive to a System Restart, United States Patent 5,452,430, IBM, September 1995.

    A data processing system for the storage of persistent and non-persistent data in a queue, and a method for the storage of data which is required to survive a system failure (persistent data) and data which is not required to survive a system failure (non-persistent data) on a single queue, are disclosed. The method involves receiving persistent and non-persistent data to be stored in a queue, then marking the data in time sequence order, before storing the persistent data in a first set of data pages and the non-persistent data in a second set of data pages. Upon receiving a request for removal of data from the queue, both the first and second sets of pages are checked and the data is removed in time sequence order. A log is preferably created to enable recovery in the event of failure and restart of the queue. When receiving and removing persistent data to be stored in and to be removed from the queue, log entries are made of changes to the persistent data only. Before the receiving of the data, a table in space map pages is created indicating which pages available in storage are free, which are allocated for persistent data, and which are allocated for non-persistent data. After receiving data and removing data, the table is updated. In the event of a failure and restart of the queue, space map page table is scanned and updated to indicate that all pages containing non-persistent data are free.

    This method has been implemented in Message Queue Manager/ESA (MQSeries).

  68. [SBCM95] Samaras, G., Britton, K., Citron, A., Mohan, C. Two-Phase Commit Optimizations in a Commercial Distributed Environment, Distributed and Parallel Databases - An International Journal, Vol. 3, No. 4, October 1995.

    An atomic commit protocol can ensure that all participants in a distributed transaction reach consistent states, whether or not system or network failures occur. The atomic commit protocol used in industry and academia is the well-known two-phase commit (2PC) protocol, which has been the subject of considerable work and technical literature for some years.

    Much of the literature focuses on improving performance in failure cases by providing a non-blocking 2PC that streamlines recovery processing at the expense of extra processing in the normal case. We focus on improving performance in the normal case based on two assumptions: first, that networks and systems are becoming increasingly reliable, and second, that the need to support high-volume transactions requires a streamlined protocol for the normal case.

    In this paper, various optimizations are presented and analyzed in terms of reliability, savings in log writes and network traffic, and reduction in resource lock time. The paper's unique contributions include the description of some optimizations not described elsewhere in the literature and a systematic comparison of the optimizations and the environments where they cause the most benefit. Furthermore, it analyzes the feasibility and performance of several optimization combinations, identifying situations where they are effective.

  69. [MoNT95] Mohan, C., Narang, I., Teng, J. Partial Page Write Detection for a Shared Cache Using a Bit Pattern Written at the Beginning and End of Each Page, United States Patent 5,455,942, IBM, October 1995.

    Disk check bits refer to bit patterns stored in particular bytes of a page which are used to detect errors in writing the page to storage. Every time a page is obtained from storage, changed from the version retained in storage, and written back to storage, the check bit pattern on the changed page is altered to be different from the bit pattern on the storage page. This is because the changed page overwrites the stored page. The invention provides a method for managing the check bits in a multi-DBMS system employing a high-speed shared electronic store as a store-in cache for all pages obtained from disk storage. When a page is first obtained from disk storage by a DBMS and changed, check bit information for the page is maintained in a directory of the storing cache which indicates what the patterns are for the version of the page in the disk storage. All pages which are modified are stored in the store-in cache and are only returned to disk storage from the cache. Therefore, when a page is to be written to disk storage, the DBMS writing the page to storage processes the check bits on the page itself, changing them as required based on the check bit information stored in the directory for the page.

    This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.

  70. [HaMo95] Haderle, D., Mohan, C. Method for Managing Logging and Locking of Page Free Space Information in a Transaction Processing System, United States Patent 5,455,944, IBM, October 1995.

    Database files containing records include pages called free space inventory pages (FSIPs) describing field space information relating to data pages. In a transaction processing system, the invention provides correct sequences for logging of updates to FSIPs when the updates are required by updates or UNDOs to data records. If, during operation to insert a data record to a data page, the FSIP containing free space information for the page indicates that the page is empty and there are no uncommitted deletes to the page, page I/O is avoided by formatting the page directly in a data buffer pool without reading the page from disk. During a cursor stability-level table scan with data record-level locking, excessive I/O and some record locking are avoided by using space reservation fields on an FSIP to ensure that there is no space reserved on the data page for a later undo of uncommitted data records deletes from the page.

    This method has been implemented in DB2/MVS V3.

  71. [MoNa95a] Mohan, C., Narang, I. Method and Means for Archiving Modifiable Pages in a Log Based Transaction Management System, United States Patent 5,455,946, IBM, October 1995.

    A method and means for achieving files of modifiable pages in a log based phased commit transaction management system (TMS) in which those pages which have been modified since the last full or incremental backup do not require during the copy operation any modifications to the page itself but merely to a common status page. This is accomplished by management of a pair of global log sequence numbers. Comparison between a first number (ICBU_LSN) and each data page LSN as the page is modified permits the common status page to be updated to correctly reflect the changed status. Subsequent modifications to the same page do not require amendment of the status page. The status page indicia are reset as part of the backup procedure and for ascertaining the page copy set for incremental copying. The ICBU_LSN assumes one of two values as a function of the copy operation and another value for processing page modifications after the copy operation. A second number (ICRF_LSN) is used in the restoration of a file after the file has been partially restored by a page merge in page number order from full and incremental copies. In this case, the ICRF_LSN defines the point in the log for redo since the most recent copy was made.

    This method has been implemented in the ADSTAR Distributed Storage Manager (ADSM) and in DB2/MVS V4.

  72. [DiMo95b] Dievendorff, D., Mohan, C. Fault-Tolerant Transaction-Oriented Data Processing, United States Patent 5,465,328, IBM, November 1995.

    In transaction processing systems, it is known for resource-updating operations within a transaction to be backed out at the request of an application program following detection of error conditions during processing of the transaction. If the error condition is very likely to recur, it may be undesirable for the operations request to be presented to the application exactly as before. A transaction-oriented data processing system and a method of transaction-oriented data processing are provided in which operation requests or data packets may be marked to be excluded from the effects of application-requested backouts.

    This method has been implemented in Message Queue Manager/ESA (MQSeries).

  73. [AAEKG96] Alonso, G., Agrawal, D., El Abbadi, A., Kamath, M., Guenthoer, R., Mohan, C. Advanced Transaction Models in Workflow Contexts, Proc. 12th International Conference on Data Engineering, New Orleans, February 1996. Also available as IBM Research Report RJ9970, IBM Research - Almaden, July 1995.

    In recent years, numerous transaction models have been proposed to address the problems posed by advanced database applications. A few of these models have been implemented as prototypes but almost none are being used in a commercial product. In this paper, we make the case that such models are too centered around databases to be useful in real environments. Many of the new applications are heterogeneous, both in the supporting platforms and tools involved, and distributed over a wide geographic area. They raise a variety of issues that are not addressed at all by transaction models, which may explain the lack of success of the latter. These same issues, however, are the basis for many existing workflow systems, which are having considerable success as commercial products in spite of not having a solid theoretical foundation. We explore some of these issues and show that, in many aspects, workflow models are a superset of transaction models and have the added advantage of incorporating many ideas that to this date have remained outside the scope of traditional transaction processing.

  74. [ReMo96] Reinwald, B., Mohan, C. Structured Workflow Management with Lotus Notes Release 4, Proc. 41st IEEE Computer Society International Conference (CompCon 96), Santa Clara, pp. 451-457, February 1996.

    In this paper we describe the design and implementation of workflow management applications on top of Lotus Notes Release 4. We elaborate on various design issues for Notes workflow applications and introduce Notes Release 4's native workflow concepts like agents, events, macros, LotusScript, OLE2 capabilities, and doclinks, which make Notes a powerful workflow tool. The idea of the paper is the use of the Workflow Reference Model of the Workflow Management Coalition to define structured workflows, and execute these workflows through the exploitation of Notes Release 4's native workflow concepts.

  75. [EFMNN96] Elko, D., Frey, J., Mohan, C., Narang, I., Nick, J., Strickland, J., Swanson, M. Multiple Processor System having Software for Selecting Shared Cache Entries of an Associated Castout Class for Transfer to an DASD with One I/O Operation, United States Patent 5,493,668, IBM, February 1996.

    A high-speed cache is shared by a plurality of independently-operating data systems in a multi-system data sharing complex. Each data system has access both to the high-speed cache and the lower-speed, secondary storage for obtaining and storing data. Management logic and the high-speed cache assures that a block of data obtained from the cache for entry into the secondary storage will be consistent with the version of the block of data in the shared cache with non-blocking serialization allowing access to a changed version in the cache while castout is being performed. Castout classes are provided to facilitate efficient movement from the shared cache to DASD.

    This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.

  76. [NaIM96] Narang, I., Iyer, B., Mohan, C. A Method to Off-Load Host-Based DBMS Predicate Evaluation to a Disk Controller, United States Patent 5,495,601, IBM, February 1996.

    A method is disclosed for a database system for off-loading, to disk controller, the extraction of committed data. The system first picks a Commit_LSN value and insures all the data modified prior to the Commit_LSN value is processed following the DBMS policy of reducing some disk I/Os or not for the modified pages cached in the system. If the policy is not to do disk I/Os for such pages, then the system places the identifiers of those pages in an ignore list. Otherwise, the system writes those pages to disk and empties the ignore list. Afterwards, the system forwards the ignore list and the Commit_LSN along with information regarding the data to be processed to the controller. The controller performs the off-load function by reading from disk every page identified by the system except those in the ignore list, and determining, for each page, if the page's Page_LSN value is less than the Commit_LSN. If it is, then the controller processes the page and adds any qualifying data from that page to a defined answer set. Otherwise, the controller adds the Page_ID for that page to a defined exception list. The controller then passes the answer set and the exception list to the system. The system processes the pages identified in the exception list and those in the ignore list. The system consolidates these answers with the answer set returned by the controller for presentation to the user.

  77. [KAGM96] Kamath, M., Alonso, G., Guenthoer, R., Mohan, C. Providing High Availability in Very Large Workflow Management Systems, Proc. 5th International Conference on Extending Database Technology, Avignon, March 1996, pp427-442. Also available as IBM Research Report RJ9967, IBM Almaden Research Center, July 1995.

    Workflow management systems (WFMSs) support the modeling, coordinated execution and monitoring of business processes within an organization. In particular, very large workflow management systems are used in organizations where the number of users may be in the tens of thousands, the number of process instances in the hundreds of thousands, and the number of sites in the thousands, all distributed over wide geographic areas. In these environments, failure of the WFMS or the underlying workflow database which stores the meta-information about the processes is not tolerable and hence continuous availability is a key aspect of the system. This paper addresses the problem of providing high availability in workflow management systems by proposing a backup technique which ensures that execution of a process instance can be resumed at any point in time in the event of failures. An essential characteristic of our backup scheme is that it allows the user to define different availability levels in order to avoid high costs for maintaining backups for all process instances. The backup scheme to support the different availability levels is implemented using the workflow semantics, which we believe will --- (i) make it independent of the underlying workflow database, thus permitting the use of heterogeneous databases as primary and backup, (ii) reduce overheads, especially when compared to backup schemes provided by database systems.

  78. [AGKAE96] Alonso, G., Guenthoer, R., Kamath, M., Agrawal, D., El Abbadi, A., Mohan, C. Exotica/FMDC: A Workflow Management System for Mobile and Disconnected Clients, Distributed and Parallel Databases - An International Journal, Vol. 4, No. 3, pp229-247, July 1996.

    Workflow Management Systems (WFMSs) automate the execution of business processes in environments encompassing large number of users distributed over a wide geographic area and using heterogeneous resources. Current implementations allow the definition and controlled execution of complex and long lived business processes as the basis for an enterprise-wide collaborative system but, in most cases, the autonomy of the users is greatly restricted due to architectural and design considerations. In particular, existing systems are built around a centralized server. As a result, users need to maintain an uninterrupted connection with the server to perform the different tasks assigned to them. This is a severe restriction, especially when considering the emergence of mobile computing, and the increase in use of laptops and small computers which are connected to the network only occasionally and which will, undoubtedly, be the tool of choice for many users. This paper addresses the problem of supporting disconnected workflow clients in large workflow management systems while still preserving the correctness of the overall execution and allowing coordinated interactions between the different users regardless of their location.

  79. [EFIMN96] Elko, D., Frey, J., Isenberg, J., Mohan, C., Narang, I., Nick, J., Strickland, J., Swanson, M. Sysplex Shared Data Coherency Method, United States Patent 5,537,574, IBM, July 1996.

    A method for controlling coherence of data elements sharable among a plurality of independently-operating CPCs (central processing complexes) in a multi-system complex (called a parallel sysplex) which contains sysplex DASDs (direct access storage devices) and a high-speed SES (shared electronic storage) facility. Sysplex shared data elements are stored in the sysplex DASD under a unique sysplex data element name, which is used for sysplex coherency control. Any CPC may copy any sysplex data element into a local cache buffer (LCB) in the CPC's main storage, where it has an associated sysplex validity bit. The copying CPC executes a sysplex coherence registration command which requests a SES processor to verify that the data element name already exists in the SES cache, and to store the name of the data element in a SES cache entry if found in the SES cache. Importantly, the registration command communicates to SES the CPC location of the validity bit for the LCB containing that data element copy. Each time another copy of the data element is stored in any CPC LCB, a registration command is executed to store the location of that copy's CPC validity bit into a local cache register (LCR) associated with its data element name. In this manner, each LCR accumulates all CPC locations for all LCB validity bits for all valid copies of the associated data element in the sysplex - for maintaining data coherency throughout the sysplex.

    This method is part of the S/390 Parallel Sysplex Coupling Facility. It is exploited by DB2/MVS V4.

  80. [ChMo96a] Choy, D., Mohan, C. Multi-Tiered Indexing Method for Partitioned Data, United States Patent 5,551,027, IBM, August 1996.

    A multi-tier indexing method is disclosed for a partitioned table in a parallel or distributed database system. A Local index is created and maintained for each partition of the table and a Coarse Global Index is created and maintained. The Coarse Global Index identifies the indexed partition(s) by partition identifiers (PIDs) and associates the individual Index Key Values with their target partitions so that an access request with a highly partition-selective search predicate on the Index Key can be quickly and easily directed to the target partition(s) for processing. An index maintenance locking protocol is also disclosed which handles the insertion and deletion of index entries and assures the consistency between the Local Index entries and the Global Index entries during concurrent index accesses by different transactions. The locking protocol minimizes locking only to those cases involving an inserted or deleted key and to the key following and possibly the key preceding the inserted or deleted key to allow high concurrency between simultaneous Readers, Inserters, and Deleters. This method enhances the efficiency of complex query evaluation and index maintenance and attains a high throughput for transaction processing.

  81. [ChMP96a] Cheng, J., Mohan, C., Pirahesh, H. Program Storage Device and Computer Program Product for Outer Join Operations Using Responsibility Regions Assigned to Inner Tables in a Relational Database, United States Patent 5,551,031, IBM, August 1996.

    A computer database system utilizes a method for performing a right outer join of database tables without sorting the inner table (T2). The processing of each tuple in the outer table (T1) includes the preservation in the join output of all tuples in T2 which are in its responsibility region. The initialization step of the process preserves in the join output all of the tuples in T2 which have column set values less than the lowest column set value in T1, i.e. the first tuple in T1, since T1 is sorted or accessed using a sorted index. The responsibility region for tuples in T1, other than the last tuple, is defined as those tuples which have column set values less than the column set value for the next tuple in T1 and greater than or equal to the column set value for the current T1 tuple. The last tuple in T1 must preserve all of the tuples in T2 which have not already been preserved in T2, i.e. all tuples greater than or equal to its column set value. If T1 has duplicate values for the column set value, only the last one preserves the associated T2 tuple. Additional methods for parallel execution of the outer join methods and methods for applying the outer join methods to subqueries (i.e., an All (or universal) Right Join (ARJOIN) and an Existential Right Join (ERJOIN)) are described.

  82. [MoNa96] Mohan, C., Narang, I. Method for Non-Hierarchical Lock Management in a Multi-System Shared Data Environment, United States Patent 5,551,046, IBM, August 1996.

    In a combination of multiple concurrently-executing database management systems which share data storage resources, efficient lock processing for shared data is implemented by hiding from a global lock manager the distinction between transaction-interest and cache-interest locks that are processed at the DBMS level. The local lock manager of a DBMS, in response to a request for either type of lock, may issue a request to the global lock manager for a system-level lock without disclosing to the global lock manager the type of lock requested at the local lock manager. After receipt of the system level lock, the local lock manager can grant either transaction or cache interest locks locally on a data resource if the combined mode of locally-held locks on that data resource is greater than or equal to the requested mode.

  83. [ChMo96b] Choy, D., Mohan, C. Locking Protocols for Two-Tier Indexing of Partitioned Data, Proc. International Workshop on Advanced Transaction Models and Architectures, Goa, August-September 1996.

    In a parallel or a distributed database management system, a relation is often horizontally partitioned across multiple nodes. To index a partitioned relation, usually either a global index is maintained for the entire relation, or alternatively, individual local indexes are maintained one for each partition of the relation. The former is costly to maintain because of remote updates and is also costly to use for complex queries, whereas the latter wastes computing resources for highly selective database searches such as those done in the case of transaction processing and is therefore not a scalable solution.

    In this paper, a two-tier index is proposed, which consists of local indexes and a coarse global index. This index method is suitable for transaction processing as well as for query processing. It not only is more efficient to maintain and use than the conventional methods, but it also exploits parallelism and is more versatile, scalable, and easier to migrate to for a non-partitioning DBMS. To maintain the consistency between the local indexes and the coarse global index, efficient locking protocols are presented, which are designed to allow a high level of concurrent operation.

     

  84. [MBCS96] Mohan, C., Britton, K., Citron, A., Samaras, G. Generalized Presumed Abort: Marrying Presumed Abort and SNA's LU 6.2 Commit Protocols, Proc. International Workshop on Advanced Transaction Models and Architectures, Goa, August-September 1996. Also available as IBM Research Report RJ8684, IBM Research - Almaden, March 1992.

    In this paper, we first describe R*'s Presumed Abort (PA) commit protocol and SNA's LU 6.2 commit protocol (also called Presumed Nothing (PN)). PA has been widely adopted by different vendors and academic researchers. It is now part of the ISO-OSI and X/Open distributed transaction processing standards. We point out the differences between PA and PN in terms of their features and the underlying models of distributed computations. PA was developed in the context of a distributed data base management system with a restricted model of distributed computation in which there are permanent master-slave relationships amongst the processes of the computation. PN was developed in the context of a more general model of distributed computation with peer to peer relationships amongst the processes. We describe how the two protocols may be merged to benefit from the optimizations present in PA and PN, and to support the more general distributed computation model of PN. This merged protocol is called Generalized Presumed Abort (GPA). GPA has been designed so that it can coexist with the existing PN protocol in the sense that the same transaction may execute in some sites using GPA and in other sites using PN.

    GPA is now part of the IBM LU6.2 and DRDA architectures. It has been implemented in DB2 V3.

  85. [ChMP96b] Cheng, J., Mohan, C., Pirahesh, H. Outer Join Operations Using Responsibility Regions Assigned to Inner Tables in a Relational Database, United States Patent 5,557,791, IBM, September 1996.

    A computer database system utilizes a method for performing a right outer join of database tables without sorting the inner table (T2). The processing of each tuple in the outer table (T1) includes the preservation in the join output of all tuples in T2 which are in its responsibility region. The initialization step of the process preserves in the join output all of the tuples in T2 which have column set values less than the lowest column set value in T1, i.e. the first tuple in T1, since T1 is sorted or accessed using a sorted index. The responsibility region for tuples in T1, other than the last tuple, is defined as those tuples which have column set values less than the column set value for the next tuple in T1 and greater than or equal to the column set value for the current T1 tuple. The last tuple in T1 must preserve all of the tuples in T2 which have not already been preserved in T2, i.e. all tuples greater than or equal to its column set value. If T1 has duplicate values for the column set value, only the last one preserves the associated T2 tuple. Additional methods for parallel execution of the outer join methods and methods for applying the outer join methods to subqueries (i.e., an All (or universal) Right Join (ARJOIN) and an Existential Right Join (ERJOIN)) are described.

  86. [AlRM96] Alonso, G., Reinwald, B., Mohan, C. Distributed Data Management in Workflow Environments, Proc. 7th International Workshop on Research Issues on Data Engineering: Transaction and Query Processing, Birmingham, April 1997.

    Most existing workflow management systems (WFMSs) are based on a client/server architecture. This architecture simplifies the overall design but it does not match the distributed nature of workflow applications and imposes severe limitations in terms of scalability and reliability. Moreover, workflow engines are not very sophisticated in terms of data management, forgetting the fact that workflow is, to a great extent, dataflow. In this paper, we propose a novel architecture to address the issue of data management in a WFMS. This architecture is based on a fully distributed workflow engine for control flow, plus a set of loosely synchronized replicated databases for dataflow. The resulting system offers greater robustness and reliability as well as much better data handling capabilities than existing approaches. To better illustrate this novel architecture and its implications, two commercial systems are employed in this paper: FlowMark, as the workflow engine, and the replication capabilities of Lotus Notes, as the support system for distributed data management.
  87. [KoMH96] Kornacker, M., Mohan, C., Hellerstein, J. Concurrency and Recovery in Generalized Search Trees, Proc. ACM SIGMOD International Conference on Management of Data, Tucson, May 1997.

    This paper presents general algorithms for concurrency control in tree-based access methods as well as a recovery protocol and a mechanism for ensuring repeatable read. The algorithms are developed in the context of the Generalized Search Tree (GiST) data structure, an index structure supporting an extensible set of queries and data types. Although developed in a GiST context, the algorithms are generally applicable to many tree-based access methods. The concurrency control protocol is based on an extension of the link technique originally developed for B-trees, and completely avoids holding node locks during I/Os. Repeatable read isolation is achieved with a novel combination of predicate locks and two-phase locking of data records. To our knowledge, this is the first time that isolation issues have been addressed outside the context of B-trees. A discussion of the fundamental structural differences between B-trees and more general tree structures like GiSTs explains why the algorithms developed here deviate from their B-tree counterparts. An implementation of GiSTs emulating B-trees in DB2/Common Server is underway.  

  88. Alonso, G., Agrawal, D., El Abbadi, A., Mohan, C. Functionalities and Limitations of Current Workflow Management Systems, Research Report, IBM Research - Almaden, 1997.

    Workflow systems hold the promise of facilitating the everyday operation of many enterprises and work environments. As a result, many commercial workflow management systems have been developed. These systems, although useful, do not scale well, have limited fault-tolerance, and are inflexible in terms of interoperating with other workflow systems. In this paper, we discuss the limitations of contemporary workflow management systems, and then elaborate on various directions for research and potential future extensions to the design and modeling of workflow management systems.

  89. [JMNT97] Josten, J., Mohan, C., Narang, I., Teng, J.  DB2's Use of the Coupling Facility for Data Sharing, IBM System Journal, Volume 36, Number 2, 1997.

    We examine the problems encountered in extending DB2/MVS, an industrial-strength relational data base management system originally designed for a single-system environment, to support the multi-system shared data architecture. The multi-system data sharing function was delivered in DB2 V4R1. DB2 data sharing requires an IBM S/390 Parallel Sysplex environment because DB2's use of the coupling facility technology plays a central role in delivering highly efficient and scalable data sharing functions. We call this the shared data architecture since the coupling facility is a unique feature which is employed with this architecture. 

  90. Alonso, G., Mohan, C. Workflow Management: The Next Generation of Distributed Processing Tools, Chapter 2 in Advanced Transaction Models and Architectures, S. Jajodia, L. Kerschberg (Eds.), Kluwer Academic Publishers, 1997.

    Workflow management systems have attracted a great deal of attention due to their ability to integrate heterogeneous, distributed applications into coherent business processing environments. In spite of their limitations, existing products are enjoying a considerable success but it would be a mistake not to try to see beyond current systems and applications. In today's computer environments, the trend towards using many small computers instead of a few big ones has revived the old dream of distributed computing. There is, however, a significant lack of tools for implementing, operating and maintaining such systems. In particular, there are no good programming paradigms for parallel architectures in which the basic building blocks are stand alone systems. Workflow management provides this key functionality, suggesting its potential as crucial component of any distributed environment. This chapter describes in detail such functionality and provides some insight on how it can be applied in environments other than business processing.
     
  91. Mohan, C. Recent Trends in Workflow Management Products, Standards and Research, Proc. NATO Advanced Study Institute (ASI) on Workflow Management Systems and Interoperability, Istanbul, August 1997. NATO ASI Series, Series F: Computer and Systems Sciences, Volume 164, Springer Verlag, 1998.

    In the last few years, workflow management (WFM) has been the focus of intense activity in terms of products, standards and research work worldwide. Work in many areas of computer science impacts workflow management. Many workflow companies and research groups are in existence now. Several conferences and workshops are being held regularly. In this paper, I briefly summarize the recent trends in WFM products, standards and research. I address technical as well as business trends.
     
  92. Mohan, C. Workflow Management in the Internet Age, Advances in Databases and Information Systems, W. Litwin, T. Morzy, G. Vossen (Eds.), Lecture Notes in Computer Science Volume 1475, pp26-34, Springer Verlag. Proc. 2nd East-European Symposium on Advances in Databases and Information Systems (ADBIS'98), Poznan, Poland, September 1998.
  93. [Moha99a] Mohan, C. Repeating History Beyond ARIES, Invited paper for receiving 10 Year Best Impact Paper Award, Proc. 25th International Conference on Very Large Data Bases, Edinburgh, September 1999.

    In this paper, I describe first the background behind the development of the original ARIES recovery method, and its significant impact on the commercial world and the research community. Next, I provide a brief introduction to the various concurrency control and recovery methods in the ARIES family of algorithms. Subsequently, I discuss some of the recent developments affecting the transaction management area and what these mean for the future. In ARIES, the concept of repeating history turned out to be an important paradigm. As I examine where transaction management is headed in the world of the internet, I observe history repeating itself in the sense of requirements that used to be considered significant in the mainframe world (e.g., performance, availability and reliability) now becoming important requirements of the broader information technology community as well.
     
  94. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P. Dynamic Scheduling of Complex Distributed Queries, Proc. 15emes Journees Bases de Donnees Avancees (BDA'99), Bordeaux, October 1999. Also available as INRIA Technical Report No. RR-3677, INRIA Rocquencourt, April 1999.
  95. Beyer, K., Cochrane, R., Pirahesh, H., Sidle, R., Shanmugasundaram, J., Mohan, C., Salem, K. Immediate Propagate Deferred Apply for Incremental Maintenance of Materialized Views, IBM Research Report RJ10173, December 1999.
  96. Mohan, C. System and Method for Performing Record Deletions Using Index Scans, United States Patent 6,009,425, IBM, December 1999.

    A data manager of a relational database management system (RDBMS) receives a command, such as a DELETE. As the RDBMS processes this command, an index manager looks at cursor control block information about the scan position (page number, logical key position within page and log sequence number (LSN) of the page) and immediately accesses the corresponding leaf page to do the key deletion. If the page's log sequence number has not changed since the scan was positioned on it, the index manager knows precisely where the key is and deletes it right away. Even if the page's log sequence number has changed, the index manager checks to see if the key is still on the same page. Only if the key is not on that page anymore will the index manager traverse the tree from the root to locate the key. Using this same information, together with a return_current flag in the cursor control block, the method can easily determine the next key in the sequence. The return_current flag is set when the current key is deleted. If the leaf page's log sequence number has not changed since the deletion and the return_current flag is set, the method of the invention establishes that the key which now resides in the same logical position as the previously deleted key is the next key, and may be returned as such. If the LSN is the same and the return_current flag is not set, then the next key is returned following locking. If the LSN has changed but the key is still bound on the page, the next key is searched and located, a lock is asserted (if necessary) and the key is returned. Only if the page logical sequence number has changed and the previous key is no longer bound on the page will the index manager traverse the index tree to locate the next key, lock the key (if necessary) and return it. 
     
  97. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P. Dynamic Query Scheduling in Data Integration Systems, Proc. 16th International Conference on Data Engineering, San Diego, March 2000.

    Execution plans produced by traditional query optimizers for data integration queries may yield poor performance for several reasons. The cost estimates may be inaccurate, the memory available at run-time may be insufficient, or data delivery rate can be unpredictable. In this paper, we address the problem of unpredictable data arrival rate. We propose to dynamically schedule queries in order to deal with irregular data delivery rate and gracefully adapt to the available memory. Our approach performs careful step-by-step scheduling of several query fragments and processes these fragments based on data arrivals. We describe a performance evaluation that shows important performance gains in several configurations.
     
  98. Bouganim, L., Fabret, F., Mohan, C., Valduriez, P. A Dynamic Query Processing Architecture for Data Integration Systems, Database Engineering, Vol. 23, No. 2, June 2000.

    Execution plans produced by traditional query optimizers for data integration queries may yield poor performance for several reasons. The cost estimates may be inaccurate, the memory available at run?time may be insufficient, or the data delivery rate can be unpredictable. All these problems have led database researchers and implementers to resort to dynamic strategies to correct or adapt the static QEP. In this paper, we identify the different basic techniques that must be integrated in a dynamic query engine. Following on our recent work [6] on the problem of unpredictable data arrival rates, we propose a dynamic query processing architecture which includes three dynamic layers: the dynamic query optimizer, the scheduler and the query evaluator. Having a three-layer dynamic architecture allows reducing significantly the overheads of the dynamic strategies. 
     
  99. [AHAEM00] Alonso, G., Hagen, C., Agrawal, D., El Abbadi, A., Mohan, C. Enhancing the Fault Tolerance of Workflow Management Systems, IEEE Concurrency, Vol. 8, No. 3, pp 74-81, July-September 2000. Citations

    Today's commercial workflow systems, although useful, do not scale well, have limited fault tolerance, and don't interoperate well with other workflow systems. The authors discuss current research directions and potential future extensions that might enable workflow services to meet the needs of mission-critical applications.
     
  100. [MBWSZ00] Mohan, C., Barber, R., Watts, S., Somani, A., Zaharioudakis, M. Evolution of Groupware for Business Applications: A Database Perspective on Lotus Domino/Notes, Proc. 26th International Conference on Very Large Databases, Cairo, Egypt, September 2000. DBLP

    In this paper, we first introduce the database aspects of the groupware product Lotus Domino/Notes and then describe, in some more detail, many of the logging and recovery enhancements that were introduced in R5. We discuss briefly some of the changes that had to be made to the ARIES recovery method to accommodate the unique storage management characteristics of Notes. We also outline some of the on-going logging and locking work in the Dominotes project at the IBM Research - Almaden.
  101. Barber, R., Herbert, D., Mohan, C., Somani, A., Watts, S., Zaharioudakis, M. Data Recovery in a Transactional Database Using Write-Ahead Logging and File Caching, United States Patent 6,173,292, IBM, January 2001.
     
    This method has been implemented in Lotus Domino/Notes R5.
  102. Mohan, C. Panel - Application Servers: Born-Again TP Monitors for the Web?, Proc. ACM SIGMOD International Conference on Management of Data, Santa Barbara, May 2001.

  103. [NMBS01] Narang, I., Mohan, C., Brannon, K., Subramanian, M. Coordinated Backup and Recovery between Database Management Systems and File Systems, IBM Research Report RJ10231, IBM Research - Almaden, February 2002.

    We consider a network of computers consisting of file servers and a Database Management System (DBMS) where a linkage is maintained, with referential integrity, between data in the DBMS and files in the file servers which are external to the DBMS. We present algorithms for performing backup and recovery of the DBMS data in a coordinated fashion with the files on the file servers. When a file is associated (linked) with a record in the DBMS, certain constraints are applied to support referential integrity, access control, and coordinated backup and recovery as if the file is stored in the DBMS. Backup of a referenced file is initiated when the file is linked. The file backup is performed asynchronously to the linking process so that the linking transaction is not delayed. In a typical scenario, when a database backup operation starts, all unfinished file backups are ensured to be completed before the database backup is declared successful. When a database is recovered to a state which includes references to files in one or more file servers, the DBMS ensures that the referenced files are also restored to their correct state in those file servers. However, since database backup and recovery are critical for database availability, the presence of an unavailable file server is tolerated during the database backup and recovery operations. Our algorithms for coordinated backup and recovery have been implemented in the IBM DB2/DataLinks product. The DataLinks concept is also part of the ISO SQL/MED standard.

  104. [LKMPW01] Luo, Q., Krishnamurthy, S., Mohan, C., Pirahesh, H., Woo, H., Lindsay, B., Naughton, J. Middle-tier Database Caching for e-Business, Proc. ACM SIGMOD International Conference on Management of Data, Madison, June 2002.

    Scaling up to the enormous and growing Internet population with unpredictable usage patterns, E-commerce applications face severe challenges in cost and manageability, especially for database servers that are deployed as those applications? back-ends in a multi-tier configuration. Middle-tier database caching is one solution to this problem. In this paper, we present a simple extension to the existing federated features in DB2 UDB, which enables a "regular" DB2 instance to become a DBCache without any application modification. On deployment of a DBCache at an application server, arbitrary SQL statements generated from the unchanged application hitherto intended for a back-end database server, can be answered: at the cache, at the back-end database server, or at both locations in a distributed manner. The factors that determine the distribution of workload include the SQL statement type, the cache content, the application requirement on data freshness, and cost-based optimization at the cache. We have developed a research prototype of DBCache, and conducted an extensive set of experiments with an E-Commerce benchmark to show the benefits of this approach and illustrate tradeoffs in caching considerations.

  105. [BBHM02] Bhattacharya, S., Brannon, K., Hsiao, H.-I, Mohan, C., Narang, I., Subramanian, M. Coordinating Backup/Recovery and Data Consistency Between Database and File Systems, Proc. ACM SIGMOD International Conference on Management of Data, Madison, June 2002.

    Managing a combined store consisting of database data and file data in a robust and consistent manner is a challenge of interest for database systems and content management systems. In such a hybrid system, images, videos, engineering drawings, etc. are stored as files on a file server while meta-data referencing/indexing such files is created and stored in a relational database to take advantage of efficient search. In this paper, we describe solutions for two potentially problematic aspects of such a data management system: backup/recovery and data consistency. We present algorithms for performing backup and recovery of the DBMS data in a coordinated fashion with the files on the file servers. Our algorithms for coordinated backup and recovery have been implemented in the IBM DB2/DataLinks product. We also propose an efficient solution to the problem of maintaining consistency between the content of the file and the associated meta-data stored in the DBMS from a reader's point of view without holding long duration locks on meta-data tables. In the model, an object is directly accessed and edited in-place through normal file system APIs using a reference obtained via an SQL query on the database. To relate file modifications to meta-data updates, the user issues an update through the DBMS, and commits both file and meta-data updates together.

  106. [Moha02a] Mohan, C. An Efficient Method for Performing Record Deletions and Updates Using Index Scans, Proc. 28th International Conference on Very Large Databases, Hong Kong, August 2002.

    We present a method for efficiently performing deletions and updates of records when the records to be deleted or updated are chosen by a range scan on an index. The traditional method involves numerous unnecessary lock calls and traversals of the index from root to leaves, especially when the qualifying records' keys span more than one leaf page of the index. Customers have suffered performance losses from these inefficiencies and have complained about them. Our goal was to minimize the number of interactions with the lock manager, and the number of page fixes, comparison operations and, possibly, I/Os. Some of our improvements come from increased synergy between the query planning and data manager components of a DBMS. Our patented method has been implemented in DB2 V7 to address specific customer requirements. It has also been done to improve performance on the TPC-H benchmark.

    This method has been implemented in DB2 V7.

  107. [Moha02b] Mohan, C. Dynamic e-Business: Trends in Web Servicesi>, Proc. 3rd VLDB Workshop on Technologies for E-Services (TES2002), Hong Kong, August 2002.

    In the last couple of years, the concept of a web service (WS) has emerged as an important paradigm for general application integration in the internet environment. More particularly, WS is viewed as an important vehicle for the creation of dynamic e-business applications and as a means for the J2EE and .NET worlds to come together. Several companies, including Microsoft, have been collaborating in proposing new WS standards. The World Wide Web Consortium has been the forum for many WS-related standardization activities. Many traditional concepts like business process management, security, directory services, routing and transactions are being extended for WS. This extended abstract traces some of the trends in the WS arena. After the TES2002 workshop is over, more information could be found in the presentation material at http://www.almaden.ibm.com/u/mohan/WebServices_TES2002_Slides.pdf

  108. Cabrera, L.F., Mohan, C., Narang, I. Method and Means for Backup and Restoration of a Database System Linked to a System for Filing Data, United States Patent 6,453,325, IBM, September 2002. 
  109. Altinel, M., Bornhoevd, C., Krishnamurthy, S., Mohan, C., Pirahesh, H., Reinwald, B. DBCache: Middle-tier Database Caching for Highly Scalable e-Business Architectures, Demo Description, Proc. ACM SIGMOD International Conference on Management of Data, November 2002.
  110. [ABKM03] Altinel, M., Bornhoevd, C., Krishnamurthy, S., Mohan, C., Pirahesh, H., Reinwald, B. Cache Tables: Paving the Way for an Adaptive Database Cache, Proc. 29th International Conference on Very Large Data Bases, Berlin, Germany, September 2003.

    We present an adaptive database cache system for improving the response time, throughput and scalability of transactional web applications. Our solution supports caching both at the edge of content-delivery networks and in the middle tier of an enterprise application infrastructure. We cache subsets of relations of backend databases in regular DB2 UDB instances at each of a set of middle-tier nodes. We show how these caches can be maintained with almost zero-administration overhead ? their contents can be defined statically or be faulted in on demand. We exploit the characteristics of transactional web applications: simple equality predicates, a high volume of short transactions and 3-4 way joins. Our system is built with a set of new technologies that complement existing features of DB2 UDB: dynamic materialized views, ?Janus? (two-headed) query plans, cache constraints and a caching daemon.

  111. Bornhoevd, C., Altinel, M., Mohan, C., Pirahesh, H., Reinwald, B. Adaptive Database Caching with DBCache, IEEE Data Engineering Bulletin, Volume 27, Number 2, 2004.
  112. Doraiswamy, S., Altinel, M., Shrinivas, L., Palmer, S., Parr, F., Reinwald, B., Mohan, C. Reweaving the Tapestry: Integrating Database and Messaging Systems in the Wake of New Middleware Technologies, Data Management in a Connected World, Theo H?rder, Wolfgang Lehner (Eds.), Essays Dedicated to H. Wedekind on the Occasion of His 70th Birthday, Lecture Notes in Computer Science 3551, Springer 2005.
  113. Rao, J., Pirahesh, H., Mohan, C., Lohman, G. Compiled Query Execution Engine using JVM, Proc. 22nd International Conference on Data Engineering, Atlanta, April 2006.