Since the time developers found out, that increasing the frequency of the processor in order to increase its total performance, would not be efficient (or would not function at all), the research has been shifted to increasing the number of processors on the dye and enhancing communication between them, which started the multiprocessors “movement”.
The newer trend was to design multiprocessors composed of processors, that differ in the architecture (heterogeneous MP), as we have seen in previous presentations/blog posts with the IBM Cell Processor, rather than having a homogeneous multiprocessor.
The latest trend is to offload certain applications or instructions from the CPU to the GPU, creating the GPGPU “movement”.
Why GPGPUs?
GPUs have a huge advantage over the CPU which is described in the following chart:
GPUs are optimized for Floating Point Operations, as well as being optimized for parallelism. On typical Multiprocessor consists of 4 cores (as with most of the Intel i-Series multiprocessors), while a GPU is composed of tens of processors. This is because CPUs could be considered memory based processors while GPUs could be called ALU based, which allows the GPU to perform more operations in parallel resulting in the high GFLOPS (FLOPS = Floating Point Operation Per Second) compared to the CPU.

Basic structure of a typical CPU (left) and GPU (right) (Source: SuperComputing Tutorial 2007, Introduction by David Luebke)
What is CUDA?
CUDA is the missing link between the developer and the GPU. It was developed by NVIDIA and is implemented in all NVIDIA GPUs starting the G80s. Before having programming architectures dedicated to programming the GPU, a programmer had to choose either between dealing with the complex APIs of the GPUs or “tricking” it by passing a texture, that contains the data or the instructions, to the GPU and then receiving a the data in the form of a texture, which typically creates a lot of overhead.
CUDA processors are programmed in CUDA C, which is basically C/C++ with some CUDA extensions, which will be mentioned and explained later on. It is important to know that in early versions of CUDA the GPU had to be programmed in C, while the CPU could be programmed in either. This is important when writing code, since the developer must know at all times, whether the code is compiled for the CPU or the GPU. Starting from CUDA 3.0 more C++ features had been enabled for the code compiled for the GPU.
CUDA Structure and Terminology
Thread : The smallest unit executing an instruction.
Block : Contains several threads.
Warp : A group of threads physically executed in parallel (usually running the same application).
Grid : Contains several thread blocks.
Kernel : An application or program, that runs on the GPU.
Device : The GPU.
Host : The CPU.
Addressing
Threads and Blocks need to have a unique ID in order to access them while writing code. This is important, since threads and blocks are the main components when it comes to writing efficient parallel code. Within each block a thread is uniquely accessible through the Thread ID, which is an integer between 0 and n, where n is the total number of threads within the same block. In a more complex approach – when dealing with 2D or 3D blocks – the Thread ID is calculated as a function, rather than having a fixed integer, which represents the number of the thread inside the block. Inside a 2D Block threads are positioned at (0,0), (0,1), … (n-1, n-1), as shown in the previous figure. Since the Thread ID must be of type uint3 – which will be considered a normal integer for now - something like (3,2) for the Thread ID is not applicable.
The function for getting the Thread ID in a 2D Block is:
x + y * Dimx,
where ‘x’ and ‘y’ are the x and y indices of the thread and Dimx the x dimension of the block (in the case of the figure Dimx would be equal to 5).
In a 3D block the same applies with the difference of dealing with one more dimension which results in the function:
x + y * Dimx + z * Dimx * Dimy,
where x, y and z are the indices (same as for the 2D block) and Dimx and Dimy are the x, y dimensions of the block.
When addressing a block within a grid the same rules of addressing a thread in a block apply. We do not deal with 3D grids in CUDA, though.
Memory Model
Each thread has a private register set – which are accessible in 0 clock cycles – and local memory, which is the fastest accessible memory from the thread. Each block has a shared memory, which can be written to/read from by all threads in the same block (hence, the name shared memory). The shared memory is the best and fastest way of communication for the threads. It is expensive to store data in the shared memory though, due to its relatively small size, therefore only variables which are needed by all the threads should be stored in it. Each grid has a global memory, which is accessible from all the blocks inside it (and therefore also the threads inside all the blocks). On a higher level each processor has its own cache and the whole device (the GPU) has a DRAM.
Of course the higher the level of the memory the bigger its size and proportionally to that is also the cost (in time units or clock cycles) to access it (be it read or write – if allowed). This is due to the increased distance from the unit trying to access the memory to the memory itself, and the latency from the memory.
Execution
It is always the most beneficial to for the developer to run a block on the same processor. Consider an example, where we have a 5 x 5 block being executed, where 25 processors are idle (this is just an example, it is unlikely -almost impossible – to have such a big number of idle processors), which means that theoretically each thread could be run by a single processor. Since the running program is a program optimised for parallelism (or else it would not make much sense to run it on a GPU and it would make even less sense to share it among 25 threads), there is a lot of data that must be shared between the threads. And since the threads run on different processors this data cannot be put in the shared memory of the block and would have to go to a higher memory (the DRAM in the worst case), and each thread will have to access this memory to retrieve the data need for operation. And just a simple comparison: accessing the shared memory as well as the global memory costs about 4 clock cycles. Accessing the global memory consumes about 400 – 600 more clock cycles for memory latency.
Hence, a block, that is divided onto several processors in most of the cases would result in a better execution time BUT the time needed to fetch data from the memory and the related idle time would result in a much worse performance than when running the block on the same processor.
What actually happens is that the processor in charge of a certain block divides the threads into warps, which are then executed in parallel. Each warp in the block is given a share of the execution time until the whole kernel is executed.
CUDA Extensions
There are four main extensions (excluding custom libraries) done to C/C++ that create the CUDA C.
1. Function Type Qualifiers
These qualifiers are written while declaring a function to decide whether the function is executed and called from the device or the host. There are three qualifiers:
- __device__: the function is called and executed on the device
- __shared__: the function is called from the host and executed on the device
- __host__: the function is called and executed on the host
An function with a CUDA qualifier would have this form:
__device__ function_name (parameters),
where of course “__device__” can be replaced by any of the function qualifiers.
The two qualifiers __host__ and __device__ could be combined to create a function, that is compiled for both the host and the device.
2. Variable Type Qualifiers
Similar to the function qualifiers, the variable qualifiers decide the lifetime of a variable and in which memory it is stored.
- __device__: the variable has the lifetime of the application and is stored in the global memory, which makes it accessible from all blocks within the grid
- __shared__: the variable has the lifetime of the block and is stored in the shared memory, hence accessible from all threads within the block
3. Execution Configuration
When calling a global function the dimensions of the grid and the block in which this function is to be executed must be specified. This is called the execution configuration.
<<GridDimension, BlockDimension, Ns, S>>
GridDimension: Dimension of the Grid.
BlockDimension: Dimension of the Block.
Ns (optional): how much memory to allocate for this function.
S (optional): specifies the stream associated with this function.
4. Built – in Variables
There are four variables, that have been introduced to extend the C/C++ language. All of them are mainly associated with Thread/Block addressing:
- gridDim -> specifies the dimension of the grid. Type: dim3
- blockDim -> specifies the dimension of the block. Type: dim3
- BlockIdx -> the unique Block ID. Type: uint3
- ThreadIdx -> the unique Thread ID. Type: uint3
Compiler
The Compiler used to compile CUDA C code is a PathScale Open64 Compiler also known as NVCC. Open64 is an open source compiler developed under the GNU License. PathScale Open64 has been further developed by the company PathScale specifically for x86-64 and Itanium processors, which is the main reason why it is optimised for parallelism.
Two important functions of the compiler are #pragma unroll, and _use_fast_math, which both result in better performance when used in coding.
#pragma unroll x: when written before a loop the following loops are then unrolled depending on the optional number ‘x’ following the #pragma unroll. There are three cases.
- x = 1, the following loop is not unrolled.
- x = n, where 1 < n, the loop is unrolled for n loops
- no x, the whole loop is unrolled
-use_fast_math is useful whenever the developer cares more about the performance of the written code rather than its accuracy. -use_fast_math enhances the performances by doing faster maths calculation by decreasing the accuracy of the results. Functions in -use_fast_math mode usually start with a double underscore: __mathfunction(); .
Built-in Types
The built-in types are:
char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4,
where the first part indicates the type as we know it (in C, C++ or java), char, int, float, … , and the second part stands for the size of this type. So, for instance considering the type float4, it consists of a structure with 4 substructures which are the actual values (the actual floats). Each of those floats are accessible through the variables x,y,z and w.
As an example, we declare a variable of type float4:
float4 make_float4 (x, y, z, w);
to access x, y, z or w we write variable_name.x, variable_name.y, variable_name.z or variable_name.w, where variable_name should be replaced by the actual name of the variable, of course.
Example Code (serial -> parallel)
The following is an example of a simple code in CUDA C. The function takes as input an integer n, a float a and two float pointers x and y, and then stores in each cell in y the value a*x + y. Shown is also how this function is called.
To convert this code to parallel we need to work on two things. First, we need to divide the work on the threads, so each thread could execute one of the y[i] = a*x[i] + y[i] operations. To do so each thread needs to be uniquely addressed and given a unique instruction to execute. The second thing is the function call: we need to add the execution configuration (mentioned earlier) to the function to know how many blocks and threads to allocate for this function. The following figure shows the code in implemented in parallel.
In the parallel implementation of saxpy (Scalar Alpha X Plus Y) int i serves as the ID for the threads and it also corresponds to the location in the array to be read from. So, thread i reads of the i-th position of array x and y. As for the execution configuration, the function is given nblocks for the number of blocks, where nblocks is an integer depending on n, and the number of threads within each block is fixed to 256 threads per block.
Optimisation
To execute a single thread, what happens is that each thread reads the instructions off the memory (global or shared, depending on the instruction) and then the actual execution happens, and then the result is written back to the memory. Which means, that there is constant reading and writing from and to the memory. This is why the throughput heavily depends on the bandwidth between the threads and the memories and also the bandwidth between the CPU and the GPU. Another important aspect, that always blocks the performance is the memory latency, and as discussed before the memory latency increased the higher the memory is (global memory has a higher latency than shared memory), this is why it is smart to avoid accessing higher level memories whenever possible. As said before, accessing the shared and the global memory takes 4 clock cycles, while 400 – 600 more clock cycles are consumed due to memory latency. Sometimes it is even more beneficial to recompute data rather than caching it due to the memory latency.
Another thing bad for the total performance of the code are “if, while, do, for and switch” statements, that is because they diverge the execution path for the warps. A diverged warps are no longer executed in parallel, they must be serialised, adding to that after serialisation the diverged warp must be synchronised, which adds more instructions to be executed. Possible solutions for these statements are unrolling (as discussed earlier) and using branch prediction instead of ‘if’ and ‘switch’ statements whenever possible.
Coalesced Memory
Another important thing to take care of while writing code is coalesced and uncoalesced memory. A coalesced memory is one, where each thread reads off the address that corresponds to it. So if the base address of a certain block is n, then any thread i inside this block must access the address at: (n + i) * type_of_read, where type_of_read must be 1, 4 or multiples of 16. Any scheme other than that results in an uncoalesced memory. The following figure shows a coalesced memory.
Both scenarios are examples of coalesced memories, whereas the right part of the figure is an example of a coalesced memory, where some threads do not participate, which results in a (relatively) insignificant worse performance.
An example of a uncoalesced memory is shown in the next figure.
In the left example Thread 3 and Thread 4 are reading off of the wrong addresses, while in the right example all threads are shifted by 4 bytes, since the base address is 128, so Thread 0 should be reading off of address 128.
To have a concrete and strong argument why coalesced memory outperforms uncoalesced memory we take a look at the results of reading 12M of floats in three ways:
1. coalesced: 356 us
2. coalesced (some threads do not participate): 357 us -> that is why the decrease in performance was labelled as “(relatively) insignificant”
3. uncoalesced: 3494 us
Banks
The shared memory is the fastest memory in a CUDA processor following the local memories of the threads. This is a result of dividing the shared memory into banks, which allows threads to access the shared memory simultaneously. An occurring problem due to banks is bank conflicts which is the result of either two or more threads trying to access the same bank, or accessing an element, that has not equal to 32 bytes.
The first problem causing bank conflicts is obvious, a bank can only serve a single thread at a time, when two or more threads try gain access, one thread is served and the rest of the threads are serialised. As for the second problem consider the following: an array of ELEMENTS is stored in the shared memory, where the size of ELEMENTS is 8 bytes, which means, that 4 ELEMENTS are stored in one bank. Assuming thread i is accessing ELEMENTS[i], which is stored in bank number j, and thread i+1 is accessing ELEMENTS[i+1]. Typically, when dealing with a 32 byte element ELEMENTS[i+1] would be stored in bank number j+1. But since we said, that ELEMENTS is of size 8 bytes ELEMENTS[i+1] is stored in the same bank, which is bank number j just as ELEMENTS[i]. So, at the end both thread i and thread i+1 try to access the same bank, although for different elements.
CUDA Libraries
CUBLAS: CUDA accelerated Basic Linear Algebra Subprograms
CUFFT: CUDA Fast Fourier Transform
MAGMA: Matrix Algebra on GPU Multicore Architectures
CULA: implementation of LAPACK interface
CUDA Tools
CUDA – gdb Debugger
CUDA – Memory Checker
CUDA – Visual Profiler
NEXUS – NVIDIA IDE
More than C/C++
In order to allow a wider spectrum of developers to code in CUDA, the range of languages, that can be compiled to CUDA has been broadened. There exist converters from Fortran to CUDA (NOAA F2C-ACC), Python to CUDA (PyCUDA) and Java to CUDA (jaCUDA). Unfortunately code generated using these converters is not fully optimised, which means, that some manual optimisation is still needed to generate optimised CUDA code, when dealing with CUDA converters.






May 7, 2010 at 2:39 pm
Hi majed,
Iam confused in the optimisation Part. In the i7 architecture i mentioned that there is 3 memory levels L1, L2 and L3. Level 3 is the only shared memory (Global) with memory latency about 40 cycles, Level 1 and 2 are private cashes but there is a copy of each one in level 3. so if there is a miss in level 3 there is no need to search level 1 and 2.
How this compared to the global memory of CUDA.?
What i understand now that the global memory has a higher latency than shared memory this is why it is smart to avoid accessing higher level memories whenever possible. But does CUDA contains any type of private memory?
May 9, 2010 at 2:19 pm
Well, first there is a big difference between a GPU and a CPU in terms of processors. While the global memory in the i-series processors is being accessed from 4 processors, the global memory in a GPU is being accessed from tens or hundred(s) of processors. This – imo – results in the higher latency for accessing the global memory in the GPU.
As for private memory, there do exist registers for each thread as well as a local memory (also per thread). The next levels are the shared memory per block, the global memory per grid, and more on, the global memory per processor and finally the DRAM of the whole device.
You also have to keep in mind that shared != global, since this might be causing the confusion. In CUDA when speaking of shared it is the per block memory shared by all threads. The global memory is the memory accessible from all blocks and grids.
Hope that cleared your confusion.
May 10, 2010 at 5:58 pm
hi majid, hope you are fine. really i like you presentation. i just wanted to know the difference between CUDA and OpenMP and OpenCl. and what are advantages and disadvantages of CUDA?
May 11, 2010 at 2:07 pm
Although I did not read thoroughly about OpenMP the difference is clear. CUDA is designed to (mainly) program the GPU and to offload some of the instructions from the CPU to the GPU, while OpenMP is only targeted at parallelizing code written on the CPU. So, simply:
CUDA -> GPU
OpenMP -> CPU
OpenCL adapts the same idea as CUDA, both are targeted to make use of the GPU. Again, I did not thoroughly inspect OpenCL, yet there is one major difference between both. CUDA is the an architecture while OpenCL is a framework. If this statement does not make much sense:
CUDA can only be used with NVIDIA GPUs, starting from G80xx, which all consist of CUDA cores. So, you get a framework, that is optimized to function on a certain GPU.
On the other hand, there does not exist a GPU, that is designed using OpenCL cores, since such a thing does not exist.
So, CUDA is optimized to work ONLY on NVIDIA GPUs, which could be considered an advantage as well as a limitation -> disadvantage, while OpenCL (and I am assuming here, so please correct me if I am wrong) runs on any GPU, that uses recent technologies.
A statistic showing how fast CUDA runs certain programs compared to OpenCL – imo – cannot exist. And even if it did, it would not mean that one is better than the other, since the efficiency of a program running on either, at the end depends on the code itself and whether or not it is optimized to the maximum (if such a thing exists; a code being optimized to the maximum, that is). At the end of the day it comes down to how you write your code, which leads us to probably the biggest disadvantage of CUDA (OpenCL probably suffers from the same).
As Dr. Mohamed said: “It’s easy to write CUDA, but it’s hard to
optimize it.”
I know I tend to write a lot, so to sum it up:
Differences:
1. OpenMP Vs CUDA:
OpenMP: targeted at CPUs
CUDA: targeted at GPUs
2. OpenCL Vs CUDA:
OpenCL: Software only (runs on any GPU)
CUDA: Software and hardware (optimized for NVIDIA GPUs)
Advantages
- More optimization since it comes in a package (SW + HW)
Disadvantages
- Limited to NVIDIA GPUs
- Easy to code, hard to optimize
May 15, 2010 at 12:47 pm
Hello Majed,
I wanted to ask how can i manage selecting between the coalesced and uncoalesced memory while i am writing the code?
Another thing, you mentioned the last slide in the presentation something related to the MD5, can you elaborate more?
Thanks in advance
ismail
May 16, 2010 at 12:23 pm
I had mentioned MD5 as a way of showing the computational power of CUDA. You can read further about MD5 at wikipedia:
http://en.wikipedia.org/wiki/MD5#Security
MD5 is a hashing function, that was used in cryptography until it was proven to be week versus numerous attacks in December 2008.
So, using CUDA MD5 was brute forced (which means trying all possible hash-values, which is the simplest and usually dumbest and longest way of cracking a hash-value) in only 8 seconds!
About coalesced and uncoalesced: there are several ways to solve the problem of having an uncoalesced memory. First of them is to use “cudaMallocPitch()” when declaring variables or allocating memory for certain variables, because allocating memory this way makes sure there is no uncoalesced memory.
This approach is simple and very easy compared to the other approach I will briefly explain. The drawback is, that what it actually does is, reserve a memory block, that will never be used. So, basically it is redundant. So for example, if you are creating an array of type float3 (which takes 3 * 8 bytes, hence having the tendency to cause uncoalesced memory), using “cudaMallocPitch()”, what happens is, that every fourth byte in the memory is reserved without being used. So, you store your first float3 in the first 3 bytes, then you skip one byte and store the next float3 in the next 3 bytes and skip one, … and so on.
The other approach would be to specify the reads of each thread to be in-line (hence creating a coalesced memory), yet without skipping any memory bytes.
This is a bit more complicated. But, basically every thread reads of the in-line memory location and then you change the writing position. I know this sound very confusing, but it is really hard to explain it without charts and a sample code.
So this would be an uncoalesced read and write code sample:
“int index = blockIdx.x * blockDim.x + threadIdx.x;
float3 a = d_in[index];”
d_in is the array of type float3, so d_in[index], will result in a uncoalesced memory, since it takes three bytes, so, each thread reads of three bytes and then the next thread reads of three bytes, and so on.
The following code solves this problem:
“int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float s_data[256*3];
s_data[threadIdx.x]
= g_in[index];
s_data[threadIdx.x+256] = g_in[index+256];
s_data[threadIdx.x+512] = g_in[index+512];
__syncthreads();
float3 a = ((float3*)s_data)[threadIdx.x];”
First everything is stored in the float3 array in the shared memory ‘s_data’.
The basic idea is that, each thread stores the data from g_in, at s_data[threadIdx.x], and it is reading from g_in[index], where index is not the usual “blockIdx.x * blockDim.x + threadIdx.x”. Index is replaced by “3 * blockIdx.x * blockDim.x + threadIdx.x;”, then the read and the write is shifted by an offset for the next two operations by 256 and 512 bytes.
I know you might still be confused, but it’s tricky. And, this is not a generic algorithm to use whenever, this is something you could use just for this certain example. This is why is much more complicated, but it saves the redundancy caused by “cudaMallocPitch()”.
May 15, 2010 at 8:03 pm
Hey Majid,
I have a question regarding the compiling of the CUDA applications. If you write a CUDA application, do you then specify the specific card model it will run on or is it compiled as a general CUDA application which could run on any CUDA capable GPU ?
My main concern is that if you handle some tasks in your CUDA code manually, and these edits are specific to this card’s specifications, then if you try to run the same application on a card with a different specification the application will not work or it would be buggy.
Thanks,
Ahmed Labib
May 29, 2010 at 7:51 am
Dear Labib,
That’s a very interesting question. The most important thing is to have a CUDA capable GPU as you said. You need to know the specs of the GPU (or device) you are working with to know how to run your functions (remember function configuration?). This is essential to optimize your code.
If you exceed the maximum number for any for blocks (for example), I honestly don’t know whether an error will be returned, or if the block number would be set to the maximum automatically. But, I believe the second thing would happen.
As for being a bit more professional, you could get to know the device you’re working with through some CUDA functions, which mainly are:
cudaGetDeviceCount();
which returns the number of CUDA capable GPUs at hand.
cudaSetDevice()
which specifies, which device to use, if you have access to more than one device.
cudaGetDeviceProperties()
which returns the properties of the accessed GPU. The properties are:
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;
int minor;
int clockRate;
size_t textureAlignment;
So, if you want to write CUDA code, that runs on any CUDA capable GPU regardless of its properties, you should use the variables in the properties array for configurations and such.
May 21, 2010 at 4:53 am
Hi Majid,
In your paper you mention the turn from increasing GHz to increasing the number of cores on the chip to GPGPUs. What do you think would be the newer ‘trend’?
Best regards,
Amasis
May 27, 2010 at 5:14 am
One of the biggest problems in GPGPUs still, is the communication between the CPU and the GPGPU, which is considered the biggest bottleneck in the system. That is because at the end of the day the CPU gives the instructions to the GPGPU, which could be considered acting as a slave for the CPU. The GPGPU then replies back to the CPU through the same bottlenecked connection.
An approach, that might solve some of this might be similar to the “monolithic” approach in multi-core processors. Which means putting the biggest number of processors on the same dye/chip, in order to maximize the communication between them.
Similarly, if this approach was adopted by GPGPUs, meaning, creating a processor, that has a built in GPGPU on the same dye/chip, I believe would enhance the performance a LOT!
In some cases it is faster for the CPU to do some operations, even though the GPGPU would do it much faster, just to save communication time.
In a “monolithic” approach (I don’t really know if the term “monolithic” could be applied in this case, since it is used for CPUs) the communication would be much faster, allowing the CPU to offload any FP operations to the GPGPU without caring about the bottleneck.
May 21, 2010 at 7:01 am
Hey Majid
Nice blog post, I wish I’ve attended the presentation.
But do you have any concrete example, that shows how CUDA outperforms a CPU?
May 27, 2010 at 5:26 am
There is a statistic showing this, but I don’t know how to upload it here. So, I will just write the numbers.
The numbers labeled ‘*’ are the numbers taken from a CPU with a Tesla GPU, and the numbers labeled ‘#’ are for the same system using CUDA.
Computational Chemistry:
*4.6 Days*
#27 Minutes#
Neurological Modeling:
*2.7 Days*
#30 Minutes#
Cell Phone RF Simulation:
*8 Hours*
#13 Minutes#
The numbers are taken from the SuperComputing Tutorial in 2009 from the introduction by David Luebke.
May 23, 2010 at 7:27 am
Hey Majid,
This is really a great blog =)…
In your blog, you mentioned briefly towards the end, that converters DO exist to convert your favorite language into CUDA. Why is it, that manual optimization is still needed?
Peace,
Mostafa Magdi
May 27, 2010 at 5:33 am
The manual optimization is needed because in CUDA C, there are extensions, that help the developer optimize his code. Examples would be things, that cannot be automated, like the coalesced memory example. Any converter would not know how to generate CUDA code, while keeping a coalesced memory without using cudaMallocPitch(), which – as discussed before – does redundant operations, and allocates redundant memory.
December 29, 2010 at 11:44 pm
Hello Majid,
This is the most helpful tutorial on CUDA I have found. Thank you so much!
1. I’m new to HPC, GPGPU, and CUDA. I’m looking for books to learn more about both fields. Do you have any favorites?
2. What do you think are the most interesting or urgent problems which can be resolved using CUDA? I’m exploring CUDA as a possibility for a thesis project.
Thanks,
-sheldon