机器学习系列文章-GPU开发环境部署总结

背景

本文环境：Ubuntu 16.04

GPU：Nvidia GTX 1080

第一部分 GPU环境

1.1 种类

目前市场上商用GPU主要是N卡（Nvidia生产）和A卡（AMD生产）。本文主要是介绍N卡环境部署。

root@deeplearning:~# lspci -k | grep -A 2 -i "VGA"
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)
	DeviceName:  Onboard IGD
	Subsystem: ASUSTeK Computer Inc. Device 872f
--
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
	Subsystem: eVga.com. Corp. Device 5186
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
	Subsystem: eVga.com. Corp. Device 5186
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

1.2 安装驱动

参看显卡的驱动版本：

root@deeplearning:# ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001B80sv00003842sd00005186bc03sc00i00
vendor   : NVIDIA Corporation
driver   : xserver-xorg-video-nouveau - distro free builtin
driver   : nvidia-410 - third-party free
driver   : nvidia-418 - third-party free
driver   : nvidia-415 - third-party free
driver   : nvidia-384 - third-party non-free
driver   : nvidia-390 - third-party non-free
driver   : nvidia-430 - third-party free recommended

其中建议安装版本是nvidia-430 - third-party free recommended

所以使用下面的命令进行安装：

sudo apt-get update

# 关闭图形界面
sudo service lightdm stop
# 安装 
sudo apt-get install nvidia-430
# 开启图形界面 
sudo service lightdm start
# 重启os
reboot

上面的方法失败率较高，建议使用手动下载驱动介质后，手动安装。介质在下面NVIDIA官方网址下载：

https://www.nvidia.cn/geforce/drivers/

按照要求检索后，选择最新版本的介质，例如：NVIDIA-Linux-x86_64-510.54.run（2022-03-22）

# 使用root用户
# 先卸载旧的驱动残余
$apt-get remove --purge nvidia*
# 安装部分依赖
$apt-get update
$apt-get install dkms build-essential linux-headers-generic
# 安装
$chmod u+x NVIDIA-Linux-x86_64-510.54.run
$./NVIDIA-Linux-x86_64-510.54.run

按照完成后重启os。

使用下面命令查看，注意确认版本：

root@deeplearning:# nvidia-smi
Sun Mar 20 19:17:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54       Driver Version: 510.54       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   31C    P0    34W / 180W |      0MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

1.3 安装cuda

CUDA(Compute Unified Device Architecture，统一计算设备架构)，是显卡厂商NVIDIA在2007年推出的并行计算平台和编程模型。CUDA仅能在有NVIDIA显卡的设备上才能执行，并不是所有的NVIDIA显卡都支持CUDA，目前NVIDIA的GeForce、ION、Quadro以及Tesla显卡系列上均可支持。根据显卡本身的性能不同，支持CUDA的版本也不同。

首先检查当前系统的 GPU 型号：

1
2
3

root@deeplearning:~# lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

去官网（https://developer.nvidia.com/cuda-toolkit）下载介质：

根据系统环境选择配置：

1 2	wget https://developer.download.nvidia.com/compute/cuda/11.6.1/local_installers/cuda_11.6.1_510.47.03_linux.run sudo sh cuda_11.6.1_510.47.03_linux.run

安装前线卸载旧的版本，卸载脚本位置：

1	/usr/local/cuda-9.0/bin/uninstall_cuda_9.0.pl

验证安装：

1 2	root@deeplearning:/usr/local/cuda/extras/demo_suite# ./deviceQuery # 回显最后有：Result = PASS

1.4 安装 cuDNN

官网地址：https://developer.nvidia.com/cudnn

下载包：

Navigate to your <cudnnpath> directory containing the cuDNN tar file.
Unzip the cuDNN package.
$ tar -zxvf cudnn-linux-x86_64-8.3.3.40_cuda11.5-archive.tar.xz
Copy the following files into the CUDA toolkit directory.

$ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include 
$ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64 
$ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

第二部分 Tensorflow-GPU环境部署

使用下面的命令直接安装：

1	pip install tensorflow-gpu

测试：

1
2
3

import tensorflow as tf
tf.config.list_physical_devices('GPU')
# [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

第三部分 Pytorch-GPU环境部署

1.这里pytorch和cudatoolkit版本对应关系：

https://pytorch.org/get-started/previous-versions/

使用下面的语句安装：

1	pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

测试：

1
2
3

import torch
print(torch.cuda.is_available())
# True

参考文献及资料

1、https://pytorch.org/get-started/locally/

目录

背景