Performance auto-tuning is gaining higher focus as multi-core processors are becoming more complex. Current Petascale machines contain hundreds of thousands of cores. It is very difficult to reach the best performance using only manual ways to optimize algorithms execution over these machines. Performance auto-tuning is becoming a very important area of research. Efforts to design and build Exascale machines are actively undergoing. These machines will run billions of threads concurrently working on 100’s of millions of cores. Performance monitoring and optimization will be more challenging and interesting problem at the same time.
Current auto-tuning efforts focus on optimizing the execution of algorithms at the micro-level which will aggregate and get better performance across thousands of CPUs with tens of thousands of cores. Willimas Samuel, for example, tested several in-core and out-of-core automated source code optimizations by optimizing Stencil algorithms. In his research he, among other researchers, built auto-tuners for leading HPC architectures such as the Cell processor, GPGPUs, Sun Niagra, Power6, and Xeon processors. I’m impressed by the relatively large number of architectures he and his team tested this algorithm on.
However, after reading his and other related papers, I had two questions: Does auto-tuning at the level of each core or microprocessor guarantee by default best performance for the whole system? Aren’t there run-time parameters that should be considered in auto-tuning instead of focusing only on compile-time auto-tuning? For example, memory latency is variable at run-time based on the resources scheduling policies and the change in workloads.
Auto-tuning should be done collaboratively across all layers of the system including: operating systems, programming models & frameworks, run-time libraries, and applications. It is now relatively simple since most of the multi-/many-core architectures are managed by the run-time libraries, and the operating systems are not yet into the game of multi-core processors management seriously. For example, NVDIA GPGPU is managed by the CUDA run-time environment transparently from the operating system. It might be better to keep it this way since GPGPUs do not have direct access to system wide resources, such as the host system’s memory and I/O devices. However, as these architectures evolve, they will need access to system’s resources and operating systems will play bigger roles managing hundreds of cores. Have a look at this posting to understand more about the concerns of performance auto-tuning.
Auto-tuning should focus also on run-time parameters that would affect performance of these automatically tuned applications. It is becoming very difficult to predict the exact system behavior and, consequently, estimate accurately different latencies that would affect performance. For example, memory latency and bandwidth are not affected by compile-time parameters only. They are affected by: threads affinity, threads scheduling, other run-time system parameters such as page size and TLB.
I think run-time performance auto-tuning should have more attention for large systems. It may look initially that the limited control given to developers in some microprocessors may make achieving the best run-time parameterization very difficult or impossible. However, I see some leading architectures are giving control back to developers, sometimes indirectly. For example, utilizing the streaming features inside the GPGPUs is opening the space to optimize size, time, and number of streams based on the run-time memory performance. Also the zero-copy feature introduced inside the NVIDIA GTX-295 GPUs makes it possible to do run-time performance optimization. I post more details about the auto-tuning possibilities on these architectures.