How to build PyTorch with LMS support for CUDA 10.2 in Pop!_OS 20.04.
First time writing in Medium. I wanted to document the process while troubleshooting the problem, so I took it as an opportunity to write a Medium post. Let’s do this!
Background
At the time of writing, I was learning super-resolution using deep learning with fast.ai. My PC has 8GB VRAM, but my model produced OOM-errors while training. The solution was simple: just use a compute engine instance from GCP and call it a day.
While using the GCP CE instance, I noticed that while training the model, it always took almost all of the VRAM available in that instance, but never produced any OOM-errors. That’s when I saw the python process also consumed around 20GB ish of the System RAM. I googled away and found IBM’s large model support for PyTorch will allow my model to use System RAM.
As a self-proclaimed system tinkering hobbyist, I want to have that “Unified Memory” (not a valid term) in my system to take advantage of my PC. Besides, keeping your GCP instance off when not using it is quite a hassle for me. That and without access to a credit card is just a turndown (I used a visa-capable debit card, but I needed to continuously remind myself to keep the funds in that card, which is also a turndown).
I was not aware of any other solution for PyTorch specifically, so here it is: How to allow PyTorch to use underutilized RAM on my PC using IBM’s LMS patch.
Pre-requisites
- Assumes you are using Debian-based Linux:
I use Pop!_OS 20.04 because it was just convenient to have the Nvidia drivers installed for you. - This guide assumes you are using zsh.
- Cuda toolkit and CUDNN (specifically for Pop!_OS 20.04):
If not using Pop!_OS, just directly install CUDA from Nvidia’s official documentation here. My system driver version is 440.82 with CUDA 10.2.89. - An Anaconda environment:
Remember to create and activate an environment. If you are using pip, remember to set up a python virtual environment make sure to download all the necessary dependencies correctly. - PyTorch, torchvision, and IBM’s LMS for PyTorch:
I am using PyTorch 1.5.0, and that requires Python 3.6, so remember the Python version required for your environment. I am using Python 3.7.7 in my environment. - Take note of your GCC version and the appropriate PyTorch GCC requirements. In my case, Pop!_OS 20.04 comes with GCC 9.3, and PyTorch 1.5.0 (CUDA 10.2) cannot be compiled with anything above GCC 8 by default. If you cannot determine the GCC version that comes with your system, then just continue. We will come to this later.
Steps
- Download everything above and git clone all the necessary repositories.
2. Patch PyTorch with LMS:
I am using PyTorch 1.5.0, so to get to that version do:
cd /path/to/pytorch # go to pytorch's repo dir
git checkout v1.5.0
git submodule sync
git submodule update --init --recursive
git am /path/to/lms/repo/patches/pytorch_v1.5.0_large_model_support.patch
Now that PyTorch is patched, we can start downloading other dependencies.
3. Download these dependencies in a conda environment. Download both Common and Linux’s dependencies.
4. Make sure nvcc is installed. Try doing this in the terminal:
nvcc --version
If your system cannot find nvcc, then you are seeing the reason why I wrote this article. If yours can, just move on to the next step. In my case, nvcc cannot be found by my shell and exists in usr/lib/cuda/bin.
Without nvcc compiling PyTorch on my system will use another compiler. Honestly, I do not understand much, but I think it will use CMake (CMIIW HERE!), and the resulting PyTorch build is not CUDA capable. Now let’s make sure our shell can find nvcc. I am using zsh, so if you are using other shells, please use the equivalent startup file for your shell. I put mine in ~/.zshenv.
echo "\nexport PATH=/usr/lib/cuda/bin:$PATH" >> ~/.zshenv
# Note I put :$PATH at the end. This will be discussed later.# For ubuntu, nvcc probably exists in
# /usr/local/cuda{version}/bin or
# /usr/local/cuda/bin
Run exec $SHELL
(or create a new session) to restart our shell and make sure nvcc is reachable by our shell by using which nvcc
, then repeat nvcc --version
. The result should be something like:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
5. Now we can build PyTorch. Before we start, make sure to activate your conda (or python) env:
conda activate <YOUR_ENV>
let’s put some variables into the current shell:
# Assuming you are using conda...
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# DO NOT COPY PASTE BLINDLY!!!# Read pytorch/setup.py for explanation
export USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=1
(Courtesy of Zhanwen Chen): before building PyTorch, you might want to avoid older Anaconda symbolic linking problems by doing this (Assuming you’re using anaconda3):
cd ~/anaconda3/envs/{YOUR_ENV}/compiler_compat
mv ld ld.old
If you are having PyTorch not finding OpenMP problems, you might want to install OpenMP:
sudo apt install libomp-dev
To build PyTorch, run:
cd /path/to/pytorch
python setup.py install
When you are finished and everything runs correctly congrats! Make sure your torch build is correct by running these in Python Interpreter:
import torch
torch.cuda.is_available() # should return True
torch.cuda.set_enabled_lms(True)
Rename back the Anaconda compiler linker:
cd ~/anaconda3/envs/{YOUR_ENV}/compiler_compat
mv ld.old ld
Continue installing torchvision:
cd /path/to/torchvision
python setup.py install
To allow torchvision to download its dataset, install tqdm:
pip install tqdm
Now you are done! But if you read these while building PyTorch
cuda unsupported GNU version! gcc versions later than 8 are not supported
Then that means your system GCC is incompatible with the CUDA’s required GCC version.
Solving the problem
- Take note of the maximum version of GCC in the above error report and install it. In this case, I need to install GCC 8:
sudo apt install gcc-8 g++-8
2. Take note the path for installed gcc by using which gcc-8
and which g++8
. Take note of the path given by which gcc
too. If newer gcc exists (e.g. 9) go to step 3.
3. Use update-alternatives to manage multiple versions of gcc & g++ ( in case other versions of gcc exist)
sudo update-alternatives --install [WHICH GCC] gcc [WHICH GCC-8] 8
sudo update-alternatives --install [WHICH G++] g++ [WHICH G++-8] 8
sudo update-alternatives --install [WHICH GCC] gcc [WHICH GCC-9] 9
sudo update-alternatives --install [WHICH G++] g++ [WHICH G++-9] 9# Last number means priority. See man update-alternatives for detail.
Run sudo update-alternatives --config gcc
and choose which version to use.
Once all that, run git clean -xfd
in pytorch repo directory, then try rebuilding again.
Hopefully, this will help you build PyTorch from the source. This problem took me around 2 days to solve, and I hope you (and my future self) will not repeat the same mistake. That’s it for now.
If my solution does not work, you might want to take a look at this SO post:
https://stackoverflow.com/questions/6622454/cuda-incompatible-with-my-gcc-version
References:
https://medium.com/repro-repo/build-pytorch-from-source-on-ubuntu-18-04-1c5556ca8fbf
https://github.com/pytorch/pytorch/
https://github.com/pytorch/vision
Edit 1 : Use update-alternatives instead of manually using softlinks