I’m out of energy to write and post right now. I decided to post photos of the slides of this panel discussion. Panelists were:
- Richard Vuduc, Georgia Tech.
- Wu-Chun Feng, Virginia Tech
- Charles Moore, AMD
June 14, 2011
I’m out of energy to write and post right now. I decided to post photos of the slides of this panel discussion. Panelists were:
June 2, 2011
May 31, 2011
If you are taking computer architecture classes, studying electronics, or doing research related to microprocessors, you may have heard of warnings about the end of Moore’s law: microprocessors will soon not be able to double their performance every 18 months for some physical limits related to making transistors smaller and keeping them fairly efficient in power consumption.
Microprocessors depended mainly on three main factors to keep Moore’s law in effect: (1) Reducing transistor size, so that we can have more in the same area with more sophisticated execution logic and be able to fit in more cache, (2) Increasing Transistor frequency, to execute more instructions, and (3) Economics of manufacturing, to keep the next generation of microprocessors affordable to everyone. Right now it is difficult to cram more transistors due to current limits on lithography. Also, as transistors get smaller and operate at higher frequencies, their power consumption is increasing at greater rates than the increase in performance. Finally, the manufacturing cost is increasing astronomically as we move from one generation to another.
I think Moore’s law may not live as it stands right now. The pattern may keep going by through different means. Here is my stab on it:
Reconsidering the execution pipelines to have shorter latency time per instruction.
Increasing the number of cores, which increase the overall throughput. This is possible through making a better use of the total number of transistors that can fit in one chip. It is possible to work since pipelines should be of less depth.
Homogeneity of instructions set and heterogeneity in implementations. For example, a multi-core processor may have 32 cores with same arithmetic and logical operations instructions, but only two or four of them implement other system control instructions, such as protected mode and interrupt handling instructions. Applications may not need radical rewriting in this case. Actually we can automate the process of migrating these traditional multi-threaded applications to this new heterogeneous architecture.
October 18, 2010
I’m sharing with you my latest poster about high performance computing arena. I just wanted to introduce HPC to both undergraduate and new graduate students. It contains interesting pointers to related topics to HPC. I’m organizing it into these main sections:
You can get a high resolution version here. A snapshot with a fair resolution is below.
May 20, 2010
OpenCL was initially developed by Apple, which holds trademark rights, in collaboration with technical teams at AMD, IBM, Intel, Nvidia, Ericsson, Nokia, Texas, and Motorola. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008 the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details. On December 8,2008 OpenCL was released.
If you’ve never heard of OpenCL, you need to stop whatever you’re doing and read ahead. We all know that multi-threaded applications have not been as abundant as we had hoped. For those precious few applications that are multi-core aware, few leverage the full potential of two cores. That is why OpenCL was developed, to standardize parallel programming and execution.
OpenCL architecture shares a range of computational interfaces with two competitors, NVidia’s Compute Unified Device Architecture and Microsoft’s directCompute.
What is OpenCL?
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs , GPUs , Cell, DSP and other processors. It includes a language for writing kernels (OS), plus APIs that are used to define and then control the platforms.
The Khronos Group, hopes that OpenCL will do for multi-core what OpenGL did for graphics, and OpenAL is beginning to do for audio, and that’s exactly what OpenCl achieved. OpenCL improved speed for a wide spectrum of applications from gaming, entertainment to scientific and medical software.
The following link is a link to a video which shows to what extent OpenCl speeds up the execution of an application.
How does OpenCl work?
OpenCL includes a language for writing compute kernels and APIs for creating and managing these kernels. The kernels are compiled, with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system.
One of OpenCL’s strengths is that this model does not specify exactly what hardware constitutes a compute device. Thus, a compute device may be a GPU, or a CPU.
OpenCL sees today’s heterogeneous world through the lens of an abstract, hierarchical platform model. In this model, a host coordinates execution, transferring data to and from an array of Compute Devices. Each Compute Device is composed of an array of Compute Units, and each Compute Unit is composed of an array of Processing Elements.
The platform layer API gives the developer access to routines that query for the number and types of devices in the system. The developer can then select and initialize the necessary compute devices to properly run their work load. It is at this layer that compute contexts and work-queues for job submission and data transfer requests are created.
The runtime API allows the developer to queue up compute kernels for execution and is responsible for managing the compute and memory resources in the OpenCL system.
OpenCL Memory Model
OpenCL defines four memory spaces: private, local, constant and global.
Private memory is memory that can only be used by a single compute unit. This is similar to registers in a single compute unit or a single CPU core.
Local memory is memory that can be used by the work-items in a work-group. This is similar to the local data share that is available on the current generation of AMD GPUs.
Constant memory is memory that can be used to store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on AMD GPUs.
Global memory is memory that can be used by all the compute units on the device. This is similar to the off-chip GPU memory that is available on AMD GPUs.
There are three basic components of executable code in OpenCL: Kernels, programs, applications queue kernels.
A compute kernel is the basic unit of executable code and can be thought of as similar to a C function. Each kernel is called a work item, where each of which has a unique ID.
Execution of such kernels can proceed either in-order or out-of-order depending on the parameters passed to the system when queuing up the kernel for execution. Events are provided so that the developer can check on the status of outstanding kernel execution requests and other runtime requests.
In terms of organization, the execution domain of a kernel is defined by an N-dimensional computation domain. This lets the system know how large of a problem the user would like the kernel to be applied to.
Each element in the execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes.
A program is a collection of kernels and other functions. So a group of kernels are called a program.
Applications queue kernels are queues of kernels which are queued in order and executed in order or out of order.
Since OpenCL is meant to target not only GPUs but also other accelerators, such as multi-core CPUs, flexibility is given in the type of compute kernel that is specified. Compute kernels can be thought of either as data-parallel, which is well-matched to the architecture of GPUs, or task-parallel, which is well-matched to the architecture of CPUs.
focuses on distributing the data across different parallel computing nodes.
To achieve data parallelism in OpenCL:
1.define N-Dimensional computation domain
2.Work-items can be grouped together — work-group
3.Execute multiple work-groups in parallel
focuses on distributing execution processes (threads) across different parallel computing nodes.
this can be achieved by synchronizing work items within a work group.
How to submit work to the computing devices in the system?
The following is a simple vector addition kernel written in OpenCL.You can see that the kernel specifies three memory objects, two for input, a and b, and a single output, c. These are arrays of data that reside in the global memory space. In this example, the compute unit executing this kernel gets its unique work-item ID and uses that to complete its part of the vector addition by reading the appropriate value from a and b and storing the sum into c.
Since, in this example, we will be using online compilation, the above code will be stored in a character array named program_source.
To complement the compute kernel code, the following is the code run on the host processor to:
FUTURE of OpenCL
It is really hard to decide if OpenCL will continue or not, but i think that the future lies with OpenCL as it is an open standard, not restricted to a vendor or specific hardware. Also because AMD is going to release a new processor called fusion.Fusion is AMD’s forthcoming CPU + GPU product on one hybrid silicon chip.
This processor would be perfect for OpenCL, As that doesn’t care what type of processor is available; as long as it can be used.
March 28, 2010
Performance auto-tuning is gaining higher focus as multi-core processors are becoming more complex. Current Petascale machines contain hundreds of thousands of cores. It is very difficult to reach the best performance using only manual ways to optimize algorithms execution over these machines. Performance auto-tuning is becoming a very important area of research. Efforts to design and build Exascale machines are actively undergoing. These machines will run billions of threads concurrently working on 100’s of millions of cores. Performance monitoring and optimization will be more challenging and interesting problem at the same time.
Current auto-tuning efforts focus on optimizing the execution of algorithms at the micro-level which will aggregate and get better performance across thousands of CPUs with tens of thousands of cores. Willimas Samuel, for example, tested several in-core and out-of-core automated source code optimizations by optimizing Stencil algorithms. In his research he, among other researchers, built auto-tuners for leading HPC architectures such as the Cell processor, GPGPUs, Sun Niagra, Power6, and Xeon processors. I’m impressed by the relatively large number of architectures he and his team tested this algorithm on.
However, after reading his and other related papers, I had two questions: Does auto-tuning at the level of each core or microprocessor guarantee by default best performance for the whole system? Aren’t there run-time parameters that should be considered in auto-tuning instead of focusing only on compile-time auto-tuning? For example, memory latency is variable at run-time based on the resources scheduling policies and the change in workloads.
Auto-tuning should be done collaboratively across all layers of the system including: operating systems, programming models & frameworks, run-time libraries, and applications. It is now relatively simple since most of the multi-/many-core architectures are managed by the run-time libraries, and the operating systems are not yet into the game of multi-core processors management seriously. For example, NVDIA GPGPU is managed by the CUDA run-time environment transparently from the operating system. It might be better to keep it this way since GPGPUs do not have direct access to system wide resources, such as the host system’s memory and I/O devices. However, as these architectures evolve, they will need access to system’s resources and operating systems will play bigger roles managing hundreds of cores. Have a look at this posting to understand more about the concerns of performance auto-tuning.
Auto-tuning should focus also on run-time parameters that would affect performance of these automatically tuned applications. It is becoming very difficult to predict the exact system behavior and, consequently, estimate accurately different latencies that would affect performance. For example, memory latency and bandwidth are not affected by compile-time parameters only. They are affected by: threads affinity, threads scheduling, other run-time system parameters such as page size and TLB.
I think run-time performance auto-tuning should have more attention for large systems. It may look initially that the limited control given to developers in some microprocessors may make achieving the best run-time parameterization very difficult or impossible. However, I see some leading architectures are giving control back to developers, sometimes indirectly. For example, utilizing the streaming features inside the GPGPUs is opening the space to optimize size, time, and number of streams based on the run-time memory performance. Also the zero-copy feature introduced inside the NVIDIA GTX-295 GPUs makes it possible to do run-time performance optimization. I post more details about the auto-tuning possibilities on these architectures.
March 5, 2010
Nvidia Fermi is the codename of nvidia’s new GPU architecture. This architecture was announced by nvidia sometime in the second half of 2009 as a game changing architecture.
Competition & Long Term Strategy
Nvidia is facing tough competition from its two main rivals Intel and AMD. Both these two produce their own CPUs and GPUs while nvidia produces only GPUs. Nvidia has tried to somehow ease itself into a new market, which is the chipset market. Releasing custom nvidia chipsets which also incorporated a low end nvidia GPU which acted as an alternative to Intel’s Media Accelerator. These chipsets showed superior performance graphics wise compared to Intel’s solution. Several companies included these chipsets in their laptops to provide consumers with a better GPU experience in the low end graphics market. Also several companies included this chipset into what is called the Hybrid SLI architecture. Basically the Hybrid SLI architecture allows a laptop to have two GPUs on board; one low end weak one which drains very little battery power and one high end strong GPU. The Hybrid SLI architecture allows a user to dynamically switch between both based on his preferences. Nvidia also released a chipset for the new Atom processor which is widely used in current netbooks. Intel didn’t like this competition and felt threatened by nvidia. Intel therefore didn’t give nvidia the right to release chipsets for its new core i architecture and also sold the atom processor with its chipset cheaper than the processor alone. Thus driving nvidia totally out.
With nvidia locked out of the CPU and its chipset market it had only the GPU market to compete in. With the five main markets like seismology, supercomputing, university research workstations, defense and finance; which can represent about 1 billion dollar turnover; nvidia had to find a way to compete better. Nvidia saw a great opportunity in the use of the GPU’s large amount of processor cores in general computing application. It saw it as a new and untapped market which is very promising and could allow nvidia to widen its market share and revenues.
Nvidia started to research in the use of GPUs for high performance computing applications such as protein folding, stock options pricing, SQL queries and MRI reconstruction. Nvidia released its G80 based architecture cards in 2006 to address these applications. This was followed by the GT200 architecture in 2008/2009 which was built on G80’s architecture but provided better performance. While these architectures targeted what is called GPGPU or general purpose GPU, they were somehow limited in the sense that they targeted only specific applications and not all applications. The drawbacks of the GPGPU model was that it required the programmer to possess intimate knowledge of graphics APIs and GPU architecture, problems had to be expressed in terms of vertex coordinates, textures and shader programs which greatly increased program complexity, basic programming features such as random reads and writes to memory were not supported which greatly restricted the programming model and finally the lack of double precision support meant that some scientific applications could not be run on the GPU.
Nvidia came around this by introducing two new technologies. The G80 unified graphics and compute architecture and CUDA which is software hardware architecture which allowed the GPU to be programmed with a variety of high level programming languages such as C and C++. Therefore instead of using graphics APIs to model and program problems the programmer can write C programs with CUDA extensions and target a general purpose massively parallel processor. This was of GPU programming is commonly known as “`GPU Computing”‘. This allowed for a broader application support and programming language support.
Nvidia took what it has learned from its experience in the G80 and GT200 architectures to build a GPU with strong emphasize on giving a better GPU Computing experience while at the same time giving a superior graphics experience for normal GPU use. Nvidia based on its Fermi architecture on these two goal and regarded them as its long term strategy.
The Fermi architecture
Things needed to be changed
To allow Fermi to truly support “`GPU Computing”‘ some changes to the architecture had to be done. These changes can be summarized as follows:
General Overview of the Fermi Architecture
3 Billion transistors is a huge number, which when compared with its closest competitor which is just over 2 Billion transistors; shows how big the nvidia Fermi will be. To be able to put this huge number of transistors nvidia had to switch from the 45nm fabrication processes to the 40nm processes. This allowed nvidia to put this huge number of transistors on a die without compromising with size and flexibility. But this also resulted in a very long delay to ship this chip. Due to relatively new fabrication processes and to the huge number of transistors on each chip, the yield of every wafer turned out to be very smaller, even smaller than expected. This hindered any hopes to mass produce the chip for large scale retail.
In Fermi nvidia aimed for a truly scalable architecture. Nvidia grouped every 4 SM (Stream Multiprocessor) into something called Graphics Processing Cluster or GPC. These GPC are in a sense a GPU on its own. Allowing nvidia to scale GPU cards up or down by increasing or decreasing the number of GPCs. Also scalability could be achieved by changing the number of SMs per GPC. Each GPC has its own rasterization engine which serves the 4 SMs that this GPC contains.
The SM Architecture
Each SM contains 32 stream processors or CUDA cores. This is 4x the amount of CUDA cores per SM compared to the previous GT200 architecture. These SM contain the following:
The SM 32 CUDA cores contain a fully pipelined ALU and FLU. These CUDA cores are able to perform 1 integer or floating point instruction per clock per thread in SP mode. There has been a huge improvement in the DP mode. DP instructions are now take only 2 times more than SP ones. This is a huge improvement when compared to 8 times the time in previous architectures. Also instructions can be mixed, for example FP + INT, FP + FP, SFU + FP and more. But if DP instructions are running then nothing else can run.
The Fermi also uses the new IEEE 754 – 2008 Standard for Floating Point Arithmetic instead of the new obsolete IEEE 754 – 1984 one. In previous architectures nvidia used the IEEE 754 1984 standard. In this standard nvidia nvidia handled one of the frequently used sequence of operations which is to multiply two numbers and add the result to a third number with a special instruction called MAD. MAD stands for Multiply-Add instruction which allowed both operations to be performed in a single clock. The MAD instruction performs multiplication with truncation. This was followed by addition with rounding to the nearest even. While this was acceptable for graphics applications, it didn’t meet the GPU Computing standards of needing a very accurate results. Therefore with the adoption of the new standard nvidia introduced a new instruction which is called FMA or Fused Multiply Add which supports both 32 bit single precision and 64 bit double precision floating point numbers. The FMA improves upon MAD in retaining full precision without any truncations or rounding to the nearest even. This allows precise mathematical calculations to be run on the GPU.
CUDA is a hardware and software blend that allows nvidia GPUs to be programmed with a wide range of programming languages. A CUDA program calls parallel kernels. Each kernel can execute in parallel across a set of parallel threads. The GPU first of all instantiates a kernel to a grid of parallel thread blocks, where each thread within a thread block executes and instance of the kernel.
Thread blocks are groups of concurrently executing threads that can cooperate among themselves through shared memory. Each thread within a thread block has its own per-Thread private local memory. While the thread block has its per-block shared memory. This per-block shared memory helps in inter thread communication, data sharing and result sharing between the different threads inside the same thread block. Also on a grid level, there is a per-Application context global memory. A grid is an array of blocks that execute the same kernel.
This hierarchy allows the GPU to execute one or more kernel grids, a streaming multiprocessor (SM) to execute one or more thread blocks and CUDA cores and other execution units in the SM to execute threads. Nvidia groups 32 threads in something called a warp. Each SM has two warp schedulers and two instruction dispatch units. This allows for two warps to be issued and executed in parallel on each SM.
The execution take place as follows. The dual warp schedulers inside each SM choose 2 warps for execution, one instruction from each warp is issued to be executed on a group of 16 cores, 16 load / store units or 4 SFU.
The streaming multiprocessor’s special function units (SFU) are used to execute special computations such as sine, cosine, reciprocal and square root. The SFU is decoupled from the dispatch unit. This decoupling means that the dispatch unit can still dispatch instructions for other execution units while the SFU is busy.
One of the biggest selling points of the new Fermi architecture is its implementation of a true cache. As stated earlier, earlier GPU architecture didn’t have a true L1 cache. Instead these architectures something called “`Shared Memory”‘. This was fine for graphics needs, but since nvidia is aiming to improve its GPU computing market share, it needed to implement a true L1 cache as it is often needed by some GPU computing applications. Nvidia included a 64KB configurable shared memory and L1 cache. To be able to handle both the graphics and GPU computing needs at the same time, this 64KB memory allows for the programmer to explicitly state the amount he needs to act as a shared memory and the amount to act as an L1 cache. Current options are for the programmer to have either 16KB L1 cache and 48KB shared memory or vice versa. This allowed the Fermi to keep the support for applications already written that made use of the shared memory while at the same time allowed new application to be written to make use of the new L1 cache.
For a long time there had been a huge gap between the geometry and shader performance. From the Geforce FX to the GT200, shader performance has increased with a factor of 150. But on the other hand the geometry performance only tripled. This was a huge problem that would bottleneck the GPU’s performance. This happened due to the fact that the hardware part that handles a key part of the setup engine has not been parallelized. Nvidia’s solution was to introduce something called a PolyMorph (geometry) Engine. The Engine facilitates a host of pre-rasterization stages, like vertex and hull shaders, tessellation and domain and geometry shaders. Each SM contains its own dedicated polymorph engine which will allow to overcome any bottlenecks by parallelizing the different units inside the PolyMorph Engine.
The SM also contains 4 separate texturing units. These units are responsible for rotate and resize a bitmap to be placed onto an arbitrary plane of a given 3D object as a texture.
Fermi Memory Hierarchy
In addition to the configurable 64 KB memory contained in each SM. The Fermi contains a unified L2 cache and DRAM. The size of the L2 cache is 768 KB.The 768KB unified L2 cache services all load, store and texture requests. The Fermi also contains 6 memory controllers. This large number of memory controllers allows the Fermi to support up to 6GB of GDDR5 memory. There can be several memory configurations supporting 1.5GB, 3GB or 6GB according to the field it will run in. It is important to mention that all types of memory from registers, to cache to DRAM memory are ECC protected.
The Unified Address Space
Fermi unified the address space between the three different memory spaces (thread private local, block shared and global). In the previous architecture the load and store operations were specific for each type of memory space. This posed a problem for GPU computing applications, as it made the task of the programmer more complex if not impossible to manage these different types of memory spaces, each with its own type of instruction. In the unified address space, Fermi puts all of the three different addresses into a single and continuous address space. Therefore Fermi unified the instruction to access all these types of memory spaces for a better experience. The unified address space uses a 40 bit addressing thus allowing for a Terabyte of memory to be addressed with the support of 64 bit addressing if needed.
The GigaThread Scheduler
The nvidia Fermi architecture makes use of two thread schedulers. The scope of each scheduler differs from the other. At the chip level there is global scheduler which schedules thread blocks to various SMs. This global scheduler is called the GigaThread Thread Scheduler. At a lower level and inside an SM there are two warp schedulers which schedule individual threads inside the warp / thread block. The GigaThread Scheduler handles a huge number of threads in real-time and also offers other improvements like faster context switching between GPU computing applications and graphics applications, concurrent kernel execution and improved thread block scheduling.
ROP stands for raster operators. The raster operator is the last step of the graphics pipeline which writes the textured / shaded pixels to the frame buffer. ROP are supposed to handle several chores towards the end of the graphics pipeline. Chores like anti-aliasing, Z and colour compression and ofcourse the writing of the pixels to the output buffer. Fermi contains 48 ROPs which are placed in a circle surrounding the L2 cache.
Nvidia Nexus is a development environment which was designed by nvidia to facilitate programming massively parallel CUDA C, OpenCL and DirectCompute applications for the nvidia Fermi cards. The environment is designed to be part of Microsoft Visual Studio IDE. Nexus allows for writing and debugging GPU source code in an easy way similar to the one used to develop normal CPU applications. It also allows to develop co-processing applications which make use of both the CPU and the GPU.
Presentation Slides and Report