Parallel Programming on an NVIDIA GPU

September 12, 2022

1

This text is the primary of a two-part sequence that presents two distinctly completely different approaches to parallel programming. Within the two articles, I exploit completely different approaches to resolve the identical drawback: discovering the best-fitting line (or regression line) for a set of factors.

The 2 completely different approaches to parallel programming introduced on this and the next Insights article use these applied sciences:

Single-instruction multiple-thread (SIMT) programming is supplied on the Nvidia® household of graphics processing models (GPUs). In SIMT programming, a single instruction is executed concurrently on a whole bunch of microprocessors on a graphics card.
Single-instruction a number of information (SIMD) as supplied on x64 processors from Intel® and AMD® (this text). In SIMD programming, a single instruction operates on huge registers that may include vectors of numbers concurrently.

The main target of this text is my try and train my pc’s Nvidia card utilizing the GPU Computing Toolkit that Nvidia supplies. The follow-on article, Parallel Programming on a CPU with AVX-512 | Physics Boards, makes use of Intel AVX-512 meeting directions and contains comparability instances of the outcomes from each packages.

Introduction

Though I put in the Nvidia GPU Computing Toolkit and related samples about 5 years in the past, I didn’t do a lot with it, writing solely a small variety of easy packages. Extra lately, I made a decision to take one other take a look at GPU programming, utilizing the newer graphics card on my extra highly effective pc.

After I downloaded a more moderen model of the GPU Computing Toolkit (ver. 10.0 – newer variations exist) and acquired every part arrange, I proceeded to construct a few of the many samples supplied on this toolkit. I then determined to strive my ability at placing collectively the instance program that’s described on this article.

Regression line calculations

Given a set of factors (X_i, Y_i), the place i ranges from 1 to N, the slope m and y-intercept b of the regression line will be discovered from the next formulation.

$$m = frac{N left(sum_{i = 1}^N X_iY_i proper)~ -~ left(sum_{i = 1}^N X_i proper) left(sum_{i = 1}^N Y_i proper)}{Nleft(sum_{i = 1}^N X_i^2right) – left(sum_{i = 1}^N X_iright)^2 }$$

$$b = frac{sum_{i = 1}^N Y_i – sum_{i = 1}^N X_i} N$$

As you’ll be able to see from these formulation, there are plenty of calculations that have to be carried out. This system should calculate the sum of the x coordinates and the sum of the y coordinates. It should additionally calculate the sum of the squares of the x coordinates and the sum of the term-by-term xy merchandise. For the latter two sums, it’s handy to create two new vectors. The primary vector consists of the term-by-term merchandise of the x coordinates with themselves. The second consists of the term-by-term product of the x- and y-coordinates.

Disclaimer

Though this system I’m presenting right here makes use of the GPU to calculate the term-by-term vector merchandise ##< X_i Y_i>## and ##<X_i^2>## , it does not use the GPU to calculate the 4 sums. It seems that this can be a extra difficult drawback to sort out. Though the NVidia Toolkit supplies samples of summing the weather of a vector, these examples are thought-about superior matters. For that motive, my code doesn’t use the GPU to calculate these sums.

Primary phrases

thread – the fundamental unit of computation. Every core is able to working one thread. Threads are organized into blocks, which will be one-dimensional, two-dimensional, or three-dimensional. Because of this, a thread index, threadIdx, can have one, two, or three parts, relying on how the blocks are laid out. The three parts are threadIdx.x, threadIdx.y, and threadIdx.z. Thread indexes that aren’t used have default values of 1.
block – a group of threads. As a result of blocks will be organized into one-dimensional, two-dimensional, or three-dimensional preparations, a person block will be recognized by a number of block indexes: blockIdx.x, blockIdx.y, or blockIdx.z. Block indexes that aren’t used have default values of 1. All vectors in my program are one-dimensional, so threadIdx.y, threadIdx.z, blockIdx.y, and blockIdx.z aren’t related. A program working with picture information would possible use a two-dimensional grid, and would due to this fact use a few of these different builtin variables.
grid – a group of blocks, with the blocks containing threads.

The CUDA kernel

To entry the NVidia GPU structure, a programmer makes use of the API supplied within the NVidia GPU Toolkit to put in writing a CUDA (Compute Unified Machine Structure) program. Such a program will include at the least one CUDA kernel, NVidia’s time period for a per-thread operate that runs on one of many GPU’s cores. For the kernel proven beneath, every core will multiply the i-th components of two vectors, and retailer the outcome within the corresponding aspect of a 3rd vector. The NVidia GPU I’m working has 1,024 cores, it ought to make brief work of multiplying the weather of two vectors of pretty excessive dimension.

__global__ void
vectorMult(double *C, const double * A, const double * B, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements)
    {
        C[i] = A[i] * B[i];
    }
}

When this kernel runs, every of doubtless 1,024 streaming multiprocessors (SMs) multiplies the values from two enter vectors, and shops the product as an output worth in a 3rd vector. Determine 1 exhibits these actions.

Fig. 1 GPU vector multiplication

vectorMult kernel particulars

The __global__ key phrase is a CUDA extension that signifies a operate is a kernel, and is meant to run on the GPU. A kernel operate’s return kind have to be void.

The operate header signifies that this operate takes 4 parameters. So as, these parameters are:

a pointer to the output vector,
a pointer to the primary enter vector,
a pointer to the second enter vector,
the variety of components in every of the three vectors.

The variable i establishes the connection between a selected thread and the corresponding index of the 2 enter vectors and the output vector. For this connection, this system has to determine a selected block throughout the grid, in addition to the thread inside that block. For my program, every enter vector accommodates 262,144 double values. This quantity occurs to be 512 X 512, or ##2^{18}##. If I select a block measurement of 256 (which means 256 threads per block), there might be 1024 blocks within the grid. A block measurement of 256 in a one-dimensional association implies that blockDim.x is 256.

For example within the code pattern above, to entry thread 2, in block 3, blockDim.x is 256, blockIdx.x is 3, and threadIdx.x is 2. In Determine 2, block 3 and thread 2 are proven. The index for the enter and output vectors is calculated as 256 * 3 + 2, or 770. Thread 2 of block 3 is the ##770^{th}## thread of the grid. Needless to say a GPU can carry out a whole bunch of some of these per-thread computations concurrently.

Fig. 2 One dimensional grid with a block and threads

The CUDA program

To make use of the capabilities of a GPU, a program should carry out the next steps.

Allocate reminiscence on the host.
Initialize enter vector information.
Allocate reminiscence on the gadget.
Copy information from the host to the gadget.
Arrange parameters for a name to the kernel, together with the variety of blocks within the grid, and the variety of threads per block, in addition to the parameters of the kernel operate itself.
Name the kernel.
Copy output information from the gadget again to the host.
Free gadget reminiscence and host reminiscence.

Every of those steps is mentioned within the subsequent sections.

Allocate host reminiscence

This system allocates reminiscence on the host by utilizing the C customary library operate, malloc. This ought to be acquainted to anybody with expertise in C programming. All through the code, an h_ prefix denotes a variable in host reminiscence; a d_ prefix denotes a variable in gadget reminiscence on the GPU.

int numElements = 512 * 512; // = 262144, or 2^18
size_t measurement = numElements * sizeof(double);

double *h_X = (double *)malloc(measurement);
double *h_Y = (double *)malloc(measurement);
double *h_XY = (double *)malloc(measurement);
double *h_XX = (double *)malloc(measurement);

Initialize enter information

For the sake of simplicity in addition to a examine on program accuracy, the information within the enter vectors are factors that each one lie on the road ##y = 1.0x + 0.5##. The loop beneath initializes the h_X and h_Y vectors with the coordinates of those factors. The loop beneath is “unrolled” with every loop iteration producing 4 units of factors on the road.

const double slope = 1.0;
const double y_int = 0.5;

// Fill the host-side vectors h_X and h_Y with information factors.
for (int i = 0; i < numElements; i += 4)
{
    h_X[i] = slope * i;
    h_Y[i] = h_X[i] + y_int;
    h_X[i + 1] = slope * i;
    h_Y[i + 1] = h_X[i + 1] + y_int;
    h_X[i + 2] = slope * (i + 2);
    h_Y[i + 2] = h_X[i + 2] + y_int;
    h_X[i + 3] = slope * (i + 3);
    h_Y[i + 3] = h_X[i + 3] + y_int;
}

Allocate gadget reminiscence

This system allocates reminiscence on the gadget by utilizing the CUDA cudaMalloc operate. The primary parameter of cudaMalloc is a pointer to a pointer to the kind of aspect within the vector, solid as kind void**. The second parameter is the variety of bytes to allocate. The cudaMalloc operate returns cudaSuccess if reminiscence was efficiently allotted. Some other return worth signifies that an error occurred.

double *d_X = NULL;
err = cudaMalloc((void **)&d_X, measurement);

The code for allocating reminiscence for d_Y is almost an identical.

Copy information from host to gadget

To repeat information from host (CPU) to gadget (GPU) or from gadget to host, use the CUDA cudaMemcpy operate. The primary parameter is a pointer to the vacation spot vector, and the second parameter is a pointer to the supply vector. The third parameter is the variety of bytes to repeat, and the fourth parameter signifies whether or not the copy is from host to host, host to gadget, gadget to host, or gadget to gadget.

err = cudaMemcpy(d_X, h_X, measurement, cudaMemcpyHostToDevice);

This operate returns cudaSuccess if the information was efficiently copied. Related code copies the values in h_Y to d_Y on the gadget.

Arrange parameters for the decision to the kernel

Earlier than calling a kernel, this system should decide the variety of threads per block and the quantity blocks in a grid. Within the following instance, the variety of blocks per grid is successfully the variety of components divided by 256. The marginally extra difficult calculation guards towards grid sizes that aren’t a a number of of the block measurement.

int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;

Name the kernel

The NVidia Toolkit contains its personal compiler that extends C or C++ syntax for calling kernels. The syntax makes use of a pair of triple angle brackets (<<< >>>) that observe the title of the kernel. These angle bracket pairs include two parameters — the variety of blocks within the grid, and the variety of threads per block. A 3rd parameter is elective.

The parameters of the vectorMult kernel proven within the following instance include, respectively, the vacation spot vector’s handle, the addresses of the 2 enter vectors, and the variety of components of every vector. This kernel name passes management to the GPU. The CUDA system software program handles all the small print concerned in scheduling the person threads working within the processors of the GPU.

vectorMult <<<blocksPerGrid, threadsPerBlock >>> (d_XY, d_X, d_Y, numElements);

Copy output information again to the host

Copying information from the gadget again to the host is the reverse of copying information from the host to the gadget. As earlier than, the primary parameter is a pointer to the vacation spot vector, and the second parameter is a pointer to the supply vector. The third parameter, cudaMemcpyDeviceToHost, signifies that information might be copied from the gadget (GPU) to the host. If the reminiscence copy is profitable, cudaMemcpy returns cudaSuccess. The code beneath copies the values in d_XY on the gadget to the vector h_XY in host reminiscence. Related code copies the values in d_XX on the gadget to the vector h_XX in host reminiscence.

err = cudaMemcpy(h_XY, d_XY, measurement, cudaMemcpyDeviceToHost);

Free gadget and host reminiscence

The primary line beneath frees the reminiscence allotted for the d_X vector, utilizing the cudaFree operate. As is the case with most of the CUDA features, it returns a worth that signifies whether or not the operation was profitable. Any worth apart from cudaSuccess signifies that an error occurred. To free the host vector, use the C reminiscence deallocation operate, free. Related code deallocates the reminiscence utilized by the opposite gadget and host vectors.

err = cudaFree(d_X);

free(h_X);

Program output

The final 4 strains present that the computed values for slope and y-intercept agree with people who had been used to generate the factors on the road, which confirms that the calculations are right.

[Linear regression of 262144 points]
CUDA kernel launch with 1024 blocks of 256 threads
Sum of x: 34359541760.000000
Sum of y: 34359672832.000000
Sum of xy: 6004765143567284.000000
Sum of x^2: 6004747963793408.000000
Processed 262144 factors
Predicted worth of m: 1.000000
Computed worth of m: 1.0000000000
Predicted worth of b: 0.500000
Computed worth of b: 0.5000000000

Full code

For the sake of brevity, I haven’t included the entire supply code right here on this article. In case your curiosity is piqued, you could find the supply code for this text, RegressionLine2.cu right here: https://github.com/Mark44-AVX/CUDA-vs.-AVX-512.

Former faculty arithmetic professor for 19 years the place I additionally taught quite a lot of programming languages, together with Fortran, Modula-2, C, and C++. Former technical author for 15 years at a big software program agency headquartered in Redmond, WA. Present affiliate school at a close-by neighborhood faculty, educating courses in C++ and pc structure/meeting language.
I get pleasure from traipsing round off-trail in Olympic Nationwide Park and the North Cascades and elsewhere, in addition to using and tinkering with my 4 bikes.

Previous articleWho Owns the Ocean’s Genes? Pressure on the Excessive Seas

Next articleCan the Flu Shot Scale back Your Danger of Stroke?

Parallel Programming on an NVIDIA GPU

Introduction

Regression line calculations

Disclaimer

Primary phrases

The CUDA kernel

vectorMult kernel particulars

The CUDA program

Allocate host reminiscence

Initialize enter information

Allocate gadget reminiscence

Copy information from host to gadget

Arrange parameters for the decision to the kernel

Name the kernel

Copy output information again to the host

Free gadget and host reminiscence

Program output

Full code

Sound Pulse Drives Single-Electron Present

Why does science information suck a lot?

Open-source software permits researchers to calculate their lab’s carbon footprint – Physics World

LEAVE A REPLY Cancel reply

Most Popular

A rose by another identify

Freewrite’s $350 Digital Typewriter Makes use of a Tiny LCD Display

Six local weather tipping factors are prone to happen if we breach 1.5°C purpose

Studying Journey Pit Stops And Nature Can Beat Display Fatigue

Recent Comments

ABOUT US

POPULAR POSTS

A rose by another identify

Freewrite’s $350 Digital Typewriter Makes use of a Tiny LCD Display

Six local weather tipping factors are prone to happen if we breach 1.5°C purpose

POPULAR CATEGORY