Hiroshi Inoue  Hiroshi Inoue photo         

contact information

Ph.D., Research Staff Member
IBM Research - Tokyo


Professional Associations

Professional Associations:  ACM SIGPLAN  |  IEEE Computer Society  |  Information Processing Society of Japan (IPSJ)

"Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors"
Hiroshi Inoue, and Toshio Nakatani
2010 IEEE International Symposium on Workload Characterization (IISWC 2010). Atlanta, Georgia, USA. pp 209-218. December 2-4, 2010.

Full text [PDF]: IISWC2010_inoue.pdf
Slides [PDF]: IISWC2010_inoue_slides.pdf


Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.

Copyright (c) 2010 by IEEE. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.