Next Generation Systems and Cloud - ESPAS 2013
Second International Workshop on
Extreme Scale Parallel Architectures and Systems (ESPAS 2013)
To be held in conjunction with the 8th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC 2013)
January 23, 2013 | Berlin, Germany
The 2nd International Workshop on Extreme Scale Parallel Architectures and Systems (ESPAS) will bring together researchers working on Experimental Infrastructures for Exascale Research and Development. Work in this area includes investigation of experimental components and systems for extreme-scale, simulation methods and tools targeting extreme scale, benchmarking, workload generation tools, and other related experimental systems and methods.
The workshop encourages discussion of disruptive approaches to address the challenges of research and development for systems that do not exist as-of-yet. Another aspect of equal importance is the creation of a palette of scientific methods and experimental infrastructures (in software and/or hardware) to evaluate novel ideas (technologies, algorithms, systems). Some of the pressing issues that need to be addressed in this context are scalability of experiments, validation/extrapolation of scientific results, and characterization of expected workloads and their synthetic generation. Given the high cost of ownership and the limited access to the top-end of parallel systems, it is important to pursue experimental architectures and systems comprising of off-the-shelf components, configured, modified, or enhanced in such a way that they can provide insight into aspects of exascale systems.
Topics of interest include, but are not limited to:
- Testbed design and evaluation
- Experimental clusters/systems targeting extreme scale
- Workload generation and benchmarking
- Analytical modelling and simulation of systems
- Techniques for extrapolation of experimental results to extreme-scale
- Validation of projection/extrapolation techniques
- Suitability/adaptability of commercial, off-the-shelf components (COTS)
- Cost, energy, performance and resilience
- Methodologies and tools
Registration Information
Participants of ESPAS 2013 will need to register through the official HiPEAC 2013 registration site.
Workshop Program
- This year's program consisted of one keynote talk and two invited talks.
10:00-10:05: Welcome
Kostas Katrinis, Exascale Systems, IBM Research - Ireland
10:05-11:00: Session I
- Addressing the System Software Challenges for Converged Simulation and Analysis on Extreme-Scale Systems, Ron Brightwell, Sandia National Laboratories, Albuquerque, NM.
Abstract: Achieving the next three orders of magnitude performance increase to move from petascale to exascale computing will require significant advancements in several fundamental areas. A recent report from the US Department of Energy (DOE) details several challenges and potential approaches for operating system and runtime system software for such extreme-scale systems. Many of these challenges, such as lowering energy use and increasing resilience, will likely require advancements in hardware as well as system software, increasing the risk associated with software-only approaches. The convergence of traditional extreme-scale computing modeling and simulation applications with large-scale data analytics applications is also driving the need for more sophisticated and robust system software. However, the challenges associated with coupling simulation and analysis are likely less susceptible to hardware enhancements. In this talk, I will summarize the report of the DOE OS/R Technical Council and describe several ongoing research projects at Sandia National Laboratories that are exploring enhancements to operating system, runtime system, and interconnect software to enable performance and scalability on extreme-scale systems. I will also discuss system software capabilities for application composability that are required to enable more advanced simulation and analysis workflows.
11:00-11:30: BREAK
11:30-12:55: Session II
- Models for fault-tolerance at very large scale, Yves Robert, Laboratoire de l'Informatique du Parallelisme, Ecole Normale Superieure de Lyon, France.
Abstract: This talk will introduce models for fault-tolerance techniques at very large scale. It will cover coordinated or hierarchical checkpoint protocols, fault prediction, and replication. The models will be instantiated with realistic scenarios for HPC applications on current petascale and future exascale platforms. - On determining a viable path to resilience at exascale, Frank Mueller, Department of Computer Science, North Carolina State University, Raleigh, NC.
Abstract: Exascale computing is projected to feature billion core parallelism. At such large processor counts, faults will become more common place. Current techniques to tolerate faults focus on reactive schemes for recovery and generally rely on a simple checkpoint/restart mechanism. Yet, they have a number of shortcomings. (1) They do not scale and require complete job restarts. (2) Projections indicate that the mean-time-between-failures is approaching the overhead required for checkpointing. (3) Existing approaches are application-centric, which increases the burden on application programmers and reduces portability.
To address these problems, we discuss a number of techniques and the level of maturity (or lack thereof) to address these problems: (a) Scalable network overlays track node failures and recoveries lifting part of the burden from the programmers. (b) Mechanisms for on-the-fly recovery without a need to restart compute jobs conserve large-scale resources much in contrast to today's techniques. (c) An approach for proactive fault tolerance that complements reactive schemes further reduces resource requirement. (d) Redundant computing to allow forward computation in the presence of failures. (e) Minimal API support for fault tolerance increases portability without requiring vendors to implement extensive functionality.
We discuss the advantages for process-level virtualization and integration into MPI message passing. These and further advances provide scalability, transparent recovery, portability and reduced checkpoint frequencies in large-scale clusters. We also discuss shortcomings in standardization, existing software stacks at HPC centers and challenges in fault tolerance for exascale computing.
12:55-13:00: Concluding Remarks
- Kostas Katrinis, Exascale Systems, IBM Research - Ireland
Organizers
Program Co-Chairs
- Shoukat Ali, Exascale Systems, IBM Research - Ireland
- Kostas Katrinis, Exascale Systems, IBM Research - Ireland
Steering Committee
- Shoukat Ali, Exascale Systems, IBM Research - Ireland
- Kostas Katrinis, Exascale Systems, IBM Research - Ireland
- Kurt Ferreira, Sandia National Laboratories, USA.
- Rolf Riesen, Exascale Systems, IBM Research - Ireland
- Georgios Theodoropoulos, Exascale Systems, IBM Research - Ireland
Program committee
- Deepak Ajwani, University College Cork, Ireland.
- David Bader, Georgia Institute of Technology, USA.
- Alexey Lastovetsky, University College Dublin, Ireland.
- Muthucumaru Maheswaran, McGill University, Canada.
- Viktor Prasanna, University of Southern California, USA.