How to build PyTorch with LMS support for CUDA 10.2 in Pop!_OS 20.04.

Aldrian Kwan
5 min readJun 15, 2020
Pytorch Logo
“PyTorch, the PyTorch logo, and any related marks are trademarks of Facebook, Inc.”

First time writing in Medium. I wanted to document the process while troubleshooting the problem, so I took it as an opportunity to write a Medium post. Let’s do this!

Background

At the time of writing, I was learning super-resolution using deep learning with fast.ai. My PC has 8GB VRAM, but my model produced OOM-errors while training. The solution was simple: just use a compute engine instance from GCP and call it a day.

While using the GCP CE instance, I noticed that while training the model, it always took almost all of the VRAM available in that instance, but never produced any OOM-errors. That’s when I saw the python process also consumed around 20GB ish of the System RAM. I googled away and found IBM’s large model support for PyTorch will allow my model to use System RAM.

As a self-proclaimed system tinkering hobbyist, I want to have that “Unified Memory” (not a valid term) in my system to take advantage of my PC. Besides, keeping your GCP instance off when not using it is quite a hassle for me. That and without access to a credit card is just a turndown (I used a visa-capable debit card, but I needed to continuously remind myself to keep the funds in that card, which is also a turndown).

I was not aware of any other solution for PyTorch specifically, so here it is: How to allow PyTorch to use underutilized RAM on my PC using IBM’s LMS patch.

Pre-requisites

  1. Assumes you are using Debian-based Linux:
    I use Pop!_OS 20.04 because it was just convenient to have the Nvidia drivers installed for you.
  2. This guide assumes you are using zsh.
  3. Cuda toolkit and CUDNN (specifically for Pop!_OS 20.04):
    If not using Pop!_OS, just directly install CUDA from Nvidia’s official documentation here. My system driver version is 440.82 with CUDA 10.2.89.
  4. An Anaconda environment:
    Remember to create and activate an environment. If you are using pip, remember to set up a python virtual environment make sure to download all the necessary dependencies correctly.
  5. PyTorch, torchvision, and IBM’s LMS for PyTorch:
    I am using PyTorch 1.5.0, and that requires Python 3.6, so remember the Python version required for your environment. I am using Python 3.7.7 in my environment.
  6. Take note of your GCC version and the appropriate PyTorch GCC requirements. In my case, Pop!_OS 20.04 comes with GCC 9.3, and PyTorch 1.5.0 (CUDA 10.2) cannot be compiled with anything above GCC 8 by default. If you cannot determine the GCC version that comes with your system, then just continue. We will come to this later.

Steps

  1. Download everything above and git clone all the necessary repositories.

2. Patch PyTorch with LMS:
I am using PyTorch 1.5.0, so to get to that version do:

cd /path/to/pytorch # go to pytorch's repo dir
git checkout v1.5.0
git submodule sync
git submodule update --init --recursive
git am /path/to/lms/repo/patches/pytorch_v1.5.0_large_model_support.patch

Now that PyTorch is patched, we can start downloading other dependencies.

3. Download these dependencies in a conda environment. Download both Common and Linux’s dependencies.

4. Make sure nvcc is installed. Try doing this in the terminal:

nvcc --version

If your system cannot find nvcc, then you are seeing the reason why I wrote this article. If yours can, just move on to the next step. In my case, nvcc cannot be found by my shell and exists in usr/lib/cuda/bin.

Without nvcc compiling PyTorch on my system will use another compiler. Honestly, I do not understand much, but I think it will use CMake (CMIIW HERE!), and the resulting PyTorch build is not CUDA capable. Now let’s make sure our shell can find nvcc. I am using zsh, so if you are using other shells, please use the equivalent startup file for your shell. I put mine in ~/.zshenv.

echo "\nexport PATH=/usr/lib/cuda/bin:$PATH" >> ~/.zshenv
# Note I put :$PATH at the end. This will be discussed later.
# For ubuntu, nvcc probably exists in
# /usr/local/cuda{version}/bin or
# /usr/local/cuda/bin

Run exec $SHELL(or create a new session) to restart our shell and make sure nvcc is reachable by our shell by using which nvcc , then repeat nvcc --version . The result should be something like:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

5. Now we can build PyTorch. Before we start, make sure to activate your conda (or python) env:

conda activate <YOUR_ENV>

let’s put some variables into the current shell:

# Assuming you are using conda...
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# DO NOT COPY PASTE BLINDLY!!!
# Read pytorch/setup.py for explanation
export USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=1

(Courtesy of Zhanwen Chen): before building PyTorch, you might want to avoid older Anaconda symbolic linking problems by doing this (Assuming you’re using anaconda3):

cd ~/anaconda3/envs/{YOUR_ENV}/compiler_compat
mv ld ld.old

If you are having PyTorch not finding OpenMP problems, you might want to install OpenMP:

sudo apt install libomp-dev

To build PyTorch, run:

cd /path/to/pytorch
python setup.py install

When you are finished and everything runs correctly congrats! Make sure your torch build is correct by running these in Python Interpreter:

import torch
torch.cuda.is_available() # should return True
torch.cuda.set_enabled_lms(True)

Rename back the Anaconda compiler linker:

cd ~/anaconda3/envs/{YOUR_ENV}/compiler_compat
mv ld.old ld

Continue installing torchvision:

cd /path/to/torchvision
python setup.py install

To allow torchvision to download its dataset, install tqdm:

pip install tqdm

Now you are done! But if you read these while building PyTorch

cuda unsupported GNU version! gcc versions later than 8 are not supported

Then that means your system GCC is incompatible with the CUDA’s required GCC version.

Solving the problem

  1. Take note of the maximum version of GCC in the above error report and install it. In this case, I need to install GCC 8:
sudo apt install gcc-8 g++-8

2. Take note the path for installed gcc by using which gcc-8 and which g++8 . Take note of the path given by which gcc too. If newer gcc exists (e.g. 9) go to step 3.

3. Use update-alternatives to manage multiple versions of gcc & g++ ( in case other versions of gcc exist)

sudo update-alternatives --install [WHICH GCC] gcc [WHICH GCC-8] 8
sudo update-alternatives --install [WHICH G++] g++ [WHICH G++-8] 8
sudo update-alternatives --install [WHICH GCC] gcc [WHICH GCC-9] 9
sudo update-alternatives --install [WHICH G++] g++ [WHICH G++-9] 9
# Last number means priority. See man update-alternatives for detail.

Run sudo update-alternatives --config gcc and choose which version to use.

Once all that, run git clean -xfd in pytorch repo directory, then try rebuilding again.

Hopefully, this will help you build PyTorch from the source. This problem took me around 2 days to solve, and I hope you (and my future self) will not repeat the same mistake. That’s it for now.

If my solution does not work, you might want to take a look at this SO post:

https://stackoverflow.com/questions/6622454/cuda-incompatible-with-my-gcc-version

References:

https://medium.com/repro-repo/build-pytorch-from-source-on-ubuntu-18-04-1c5556ca8fbf
https://github.com/pytorch/pytorch/
https://github.com/pytorch/vision

Edit 1 : Use update-alternatives instead of manually using softlinks

--

--