The reason why GPUs are better at machine learning than CPUs


In the previous chapters, we have presented some aspects of parallel computing. In this section, we will get into the last aspect of this module by talking about GPU. We will explain why GPUs are far better than CPUs for parallel computing and consequently for big data tasks. We will also present Cuda in the upcoming paragraphs.

GPU acronym for Graphics processing unit and is the component of the computer responsible for graphics rendering.  CPU – Central processing unit is the heart of the computer. It is responsible for almost all the computation of most computer tasks. For instance, a tab in your navigator is a CPU process. 

CPU is used for parallel computing but in a moderate way mainly because of the small number of cores it contains (a normal CPU can contain up to 16 cores). It is for heavy tasks because each core has a big computing capability. Most of the parallel computing executed by common apps is done on the CPU. 

But the GPU support more massive parallel computing (A single mainstream GPU like GeForce RTX 2080 Ti can contain  4352 cores). And that is important to know because as a data scientist you will have to do a huge amount of small computations and in that case, it is better to have a large number of weak processors instead of a small number of strong processing units. And that is the main advantage that a GPU has over a CPU.

CPU vs CPU Cuda


It is the main reason why many deep learning/machine learning frameworks are built with support for GPU computing(Cuda for instance). When the instructions are executed on a GPU (parallel) they can become 100 times faster than a normal CPU. The concept of Streaming Multiprocessors (SMs) is used to materialise that architecture. They are multiple cores executing the same instruction in parallel on different data. They have a shared but slower clock rate and less cache than CPUs.

You can find a list of Nvidia GPUs here with the generation, cores, ….

GPU Computing

In this field, we have 2 main projects:

  • CUDA: Compute Unified Device Architecture, is a runtime API developed by Nvidia as a C/C++ language extension mainly developed to enable parallel computing. It is an interface that helps developers to access the full power of GPU cores to execute tasks there). It is proprietary by Nvidia and there are extensions for languages like PyCUDA for python.

[Nvidia website]

OpenCL: It is an interface for parallel computing across heterogeneous hardware: it also looked like C/C++. 

Example code for matrix addition with Cuda [Credit: ]

/* File:
 * Purpose:  Implement matrix addition on a gpu using cuda
 * Compile:  nvcc [-g] [-G] -arch=sm_21 -o mat_add 
 * Run:      ./mat_add <m> <n>
 *              m is the number of rows
 *              n is the number of columns
 * Input:    The matrices A and B
 * Output:   Result of matrix addition.  
 * Notes:
 * 1.  CUDA is installed on all of the machines in HR 530, HR 235, and
 *     and LS G12
 * 2.  If you get something like "nvcc: command not found" when you try
 *     to compile your program.  Type the following command
 *           $ export PATH=/usr/local/cuda/bin:$PATH
 *     (As usual the "$" is the shell prompt:  just type the rest 
 *     of the line.)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

 * Kernel:   Mat_add
 * Purpose:  Implement matrix addition
 * In args:  A, B, m, n
 * Out arg:  C
__global__ void Mat_add(float A[], float B[], float C[], int m, int n) {
   /* blockDim.x = threads_per_block                            */
   /* First block gets first threads_per_block components.      */
   /* Second block gets next threads_per_block components, etc. */
   int my_ij = blockDim.x * blockIdx.x + threadIdx.x;

   /* The test shouldn't be necessary */
   if (blockIdx.x < m && threadIdx.x < n) 
      C[my_ij] = A[my_ij] + B[my_ij];
}  /* Mat_add */

 * Function:  Read_matrix
 * Purpose:   Read an m x n matrix from stdin
 * In args:   m, n
 * Out arg:   A
void Read_matrix(float A[], int m, int n) {
   int i, j;

   for (i = 0; i < m; i++)
      for (j = 0; j < n; j++)
         scanf("%f", &A[i*n+j]);
}  /* Read_matrix */

 * Function:  Print_matrix
 * Purpose:   Print an m x n matrix to stdout
 * In args:   title, A, m, n
void Print_matrix(char title[], float A[], int m, int n) {
   int i, j;

   printf("%s\n", title);
   for (i = 0; i < m; i++) {
      for (j = 0; j < n; j++)
         printf("%.1f ", A[i*n+j]);
}  /* Print_matrix */

/* Host code */
int main(int argc, char* argv[]) {
   int m, n;
   float *h_A, *h_B, *h_C;
   float *d_A, *d_B, *d_C;
   size_t size;

   /* Get size of matrices */
   if (argc != 3) {
      fprintf(stderr, "usage: %s <row count> <col count>\n", argv[0]);
   m = strtol(argv[1], NULL, 10);
   n = strtol(argv[2], NULL, 10);
   printf("m = %d, n = %d\n", m, n);
   size = m*n*sizeof(float);

   h_A = (float*) malloc(size);
   h_B = (float*) malloc(size);
   h_C = (float*) malloc(size);
   printf("Enter the matrices A and B\n");
   Read_matrix(h_A, m, n);
   Read_matrix(h_B, m, n);

   Print_matrix("A =", h_A, m, n);
   Print_matrix("B =", h_B, m, n);

   /* Allocate matrices in device memory */
   cudaMalloc(&d_A, size);
   cudaMalloc(&d_B, size);
   cudaMalloc(&d_C, size);

   /* Copy matrices from host memory to device memory */
   cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

   /* Invoke kernel using m thread blocks, each of    */
   /* which contains n threads                        */
   Mat_add<<<m, n>>>(d_A, d_B, d_C, m, n);

   /* Wait for the kernel to complete */

   /* Copy result from device memory to host memory */
   cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

   Print_matrix("The sum is: ", h_C, m, n);

   /* Free device memory */

   /* Free host memory */

   return 0;
}  /* main */

Let's Innovate together for a better future.

We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.

Contact us