Multi- and Many-Core processors are here to stay for a really long time. They are microprocessors manufacturers response to the uni-core scalability walls. However, although software communities explored the parallel programming models heavily in the 80s and 90s, but these efforts were directed to less finer grained systems, mainly clusters, parallel machines, and Symmetric Multi Processors (SMP). Multi- and many-core architectures poped to the surface some classical problems, such as memory latency, data synchronization, and threads management. Also, they introduced new problems of massively parallel systems with to a great extent fine grained threading models, such as managing thousands of concurrent threads and inter-thread communication and data sharing. In this posting I would like to pinpoint some of these challenges from my research programming experiences on multi-core architectures.
Maintaining the current increase rate of processing power requires from micro-processors designers to introduce more processing cores per microprocessor. However, two important sides to be considered as more cores are introduced. First, processor cores will increase in high rate to double the speed every 18 months and keep Moore’s law in effect. Hence, it is expected to have, within five years, many cores processors with tens or even hundreds of processing cores on the same chip. Second, as number of cores is increasing, they will be with simpler designs and achieving simple tasks and each core will be faster from current single-core processors. Power and heat management issues will impose such design constraints on micro-processors manufactures. Such design aspects will increase overall processor’s speed while maintaining reasonable power consumption and heat dissipation.. As a result, parallelism will be finer grained. Developers will parallelize their applications at a more fine grained level to take full advantage of the multi or many-cores advancements. This granularity will increase the contention among these threads on shared resources. These resources can be a memory location, i.e. data, or an I/O device. Interdependencies among these threads will increase. In addition, as cores are getting simpler and faster, more data will be moving back and forth between processor and system’s main memory. On the other side, memory latency ration is increasing. Using hardware based techniques to hide this latency, such as branch prediction and embedded algorithms for cache replacement, may not be lucrative and efficient enough to hide this latency for parallel applications. Software based cache management and execution scheduling are now vital to fully utilize multi-core processors. Finally, programming complexity of multi-core processors and inherit complexity of parallel applications require tools to reduce some of these complexities.
Memory Latency Wall
As processors and programs become more parallelized, they will be more data hungry. On the other hand, the number of processor cycles to access system’s main memory grew from few cycles in 1980 to almost a thousand cycles today. Moreover, the cache per core ratio will continue to go down, which will make the memory latency problem worse if cache not managed properly. Although there is a great potential in the DRAM based memory to increase performance, but the growth rate of processors aggregate cycles will continue to be faster. The processor-to-memory performance gap is expected to grow by 50% per year according to some estimates. The good news is that memory latency problem can be solved using efficient software based scheduling for memory access. Multi-core processors are now returning some control back to software developer to manage each core’s cache. Such explicit cache management capabilities provide more space for programmers to maneuver around the processor-to-memory performance gap.
Using current synchronization mechanisms to synchronize tens or hundreds of threads access to one resource may lay on the line application’s performance. The whole system or application may suffer from deadlock or starvation due to weak synchronization mechanisms. In worst cases, current synchronization mechanisms will serialize the application in areas that need access to shared resources. As number of hardware threads in multi-core processors and parallelism increase in applications, the resulting performance lose increase as well. For example, implementing parallel shared counting algorithm would require from each of the participating threads to lock the counter before incrementing it. In worst performance case, each thread will have to wait for n-1 threads before it can update the counter, where n is the number of threads. If these processors are on the same die, efficiency of data synchronization can be greatly enhanced if data communication is done using available on-chip facilities, such as cores interconnect, shared cache, etc. Current synchronization techniques are using system’s main memory to write and read shared data, which makes it even worse. Such technique introduces the memory latency delay in addition to the delay of synchronization algorithms.
Parallel computing is inherently complex mainly due to the difficulty of design and intricacy of resources sharing and synchronization. Presence of multi-core processors at different scales, starting from embedded systems to super computers, made application’s adaptations to these new hardware platforms a critical issue. However, as multi-core processors are increasing their cores and parallelism is becoming more fine-grained, complexity will increase as well. Instead of designing a parallel application with 10 or 20 concurrent threads, an application may be executing 100s or 1000s of threads working on the same machine. A solution is required in this case to help reducing the programming complexity and also providing excellent scaling for the number of working threads.
Actually, these challenges are the main inspirational pillars for most of multi-core researchers, architects and developers. All microprocessors manufacturers are after faster processors without increasing programming complexity and without loosing developers ability to make best use of their new architectures. That’s why microprocessors manufacturers are now involved aggressively in the programming models. Intel for example created Intel’s Parallel Studio (Open CT framework included) for their general purpose multi-core microprocessors and specialized one as well, such as Larrabee GPGPU. Also, ATI built ATI Stream framework and NVIDIA also built CUDA framework to help developers make the best out of these new microprocessors without getting into the nitty-gritty architectural details of these advanced GPGPUs.