The Role of GPUs in Deep Learning

GPUs, or Graphics Processing Units, are important pieces of hardware originally designed for rendering computer graphics, primarily for games and movies. However, in recent years, GPUs have gained recognition for significantly enhancing the speed of computational processes involving neural networks.

GPUs now play a pivotal role in the artificial intelligence revolution, predominantly driving rapid advancements in deep learning, computer vision, and large language models, among others.

In this article, we will delve into the utilization of GPUs to expedite neural network training using PyTorch, one of the most widely used deep learning libraries.

Note: An NVIDIA GPU-equipped machine is required to follow the instructions in this article.

Inroduction to GPUs with PyTorch

PyTorch is an open-source, simple, and powerful machine-learning framework based on Python. It is used to develop and train neural networks by performing tensor computations like automatic differentiation using the Graphics Processing Units.

PyTorch employs the CUDA library to configure and leverage NVIDIA GPUs. CUDA is a GPU computing toolkit developed by Nvidia, designed to expedite compute-intensive operations by parallelizing them across multiple GPUs. PyTorch offers support for CUDA through the torch.cuda library.

Utilising GPUs in Torch via the CUDA Package

The CUDA library in PyTorch is instrumental in detecting, activating, and harnessing the power of GPUs. Let’s delve into some functionalities using PyTorch.

Verifying GPU Availability

Before using the GPUs, we can check if they are configured and ready to use. The following code returns a boolean indicating whether GPU is configured and available for use on the machine.

import torch

The number of GPUs present on the machine and the device in use can be identified as follows:


This output indicates that there is a single GPU available, and it is identified by the device number 0.

Initialize the Device

The active device can be initialized and stored in a variable for future use, such as loading models and tensors into it. This step is necessary if GPUs are available because CPUs are automatically detected and configured by PyTorch.

The torch.device function can be used to select the device.

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
>>> device

With the device variable, we can now create and move tensors into it.

Creating and Moving tensors to the GPU

The models and datasets are represented as PyTorch tensors, which must be initialized on, or transferred to, the GPU prior to training the model. This can be accomplished in several ways, as outlined below:

  1. Creating Tensors Directly on the GPU

Tensors can be directly created on the desired device, such as the GPU, by specifying the device parameter. By default, tensors are created on the CPU. You can determine the device where the tensor is stored by accessing the device parameter of the tensor.

x = torch.tensor([1, 2, 3])
print("Device: ", x.device)
tensor([1, 2, 3]) Device: cpu

Now, let’s generate the tensors directly on the device.

y = torch.tensor([4, 5, 6], device=device)
print("Device: ", y.device)
tensor([4, 5, 6], device='cuda:0') Device: cuda:0

Lastly, the device number where the tensors are stored can be retrieved using the get_device() method.


In the output above, -1 represents the CPU, while 0 represents GPU number 0.

  1. Transferring Tensors Using the to() Method

Tensors can be transferred from the CPU to the device using the to() method, which is supported by PyTorch tensors.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

x = torch.tensor([1, 2, 3])
x =
print("Device: ", x.device)
Device: cuda:0

When multiple GPUs are available, tensors can be transferred to specific GPUs by passing the device number as a parameter.

For instance, cuda:0 is for the first GPU, cuda:1 for the second GPU, and so on.

x = torch.tensor([8, 9, 10])
x ="cuda:0")

Attempting to transfer to a GPU that is not available or to an incorrect GPU number will result in a CUDA error.

  1. Transferring Tensors Using the cuda() Method

Below is an example of creating a sample tensor and transferring it to the GPU using the cuda() method, which is supported by PyTorch tensors.

tensor = torch.rand((100, 30)) tensor = tensor.cuda()
device(type='cuda', index=0)

Now let’s explore techniques to load the tensors into multiple GPUs through parallelisation i.e. one of the most important features responsible for high computational speeds in GPUs.

Multi-GPU Distributed Training

Distributed training involves deploying both the model and the dataset across multiple GPUs, thereby dramatically accelerating the training process via the capability of parallelization. We will cover some of the distributed training classes offered by PyTorch in the following sections.

CPUs vs GPUs for intense computational activities
Source: NVIDIA


DataParallel is an effective way for conducting multi-GPU training of models on a single machine. It achieves data parallelization at the module level by dividing the input across the designated devices via chunking, and then propagating it through the model by replicating the inputs on all devices.

Let’s create and initialise a basic LinearRegression model class prior to wrapping it within the DataParallel class.

import torch.nn as nn class LinearRegression(nn.Module): def __init__(self, input_size, output_size): super(LinearRegression, self).__init__() self.linear = nn.Linear(input_size, output_size) def forward(self, x): return self.linear(x) model = LinearRegression(2, 5)
LinearRegression( (linear): Linear(in_features=2, out_features=5, bias=True) )

Now, let’s wrap the model to execute data parallelization across multiple GPUs. This can be achieved by utilizing the nn.DataParallel class and passing the model along with the device list as parameters.

model = nn.DataParallel(model, device_ids=[0])
DataParallel( (module): LinearRegression( (linear): Linear(in_features=2, out_features=5, bias=True) ) )

In the above code, we have passed the model along with the list of device ids as parameters. Now we can proceed by directly loading the model on to device and perform model training as required.

model = input_data = 

DistributedDataParallel (DDP)

The DistributedDataParallel class from PyTorch supports training across multiple GPU training on multiple machines. The DistributedDataParallel class is recommended over the DataParallel class, as it manages single machine scenarios by default and exhibits superior speed compared to the DataParallel wrapper.

The DistributedDataParallel module operates on the principle of data parallelism. Here, the model and data are duplicated across multiple processes, and each process conducts training on a data subset.

Setting up the DistributedDataParallel class entails initializing the distributed environment and subsequently wrapping the model with the DDP object.

import torch
import torch.nn as nn

class LinearRegression(nn.Module): def __init__(self, input_size, output_size): super(LinearRegression, self).__init__() self.linear = nn.Linear(input_size, output_size) def forward(self, x): return self.linear(x) model = LinearRegression(2, 5)

torch.distributed.init_process_group(backend='nccl') model = nn.parallel.DistributedDataParallelCPU(model) 

This provides a basic wrapper to load the model for multi-GPU training across multiple nodes.


In this article, we’ve explored various methods to leverage NVIDIA GPUs using the CUDA library in the PyTorch ML library. These strategies help us harness the power of robust GPUs, accelerating the model training process by a factor of ten compared to traditional CPUs in deep learning applications. This significant reduction in training time expedites a broad array of compute-intensive tasks.