Windows11 WSL2 深度学习环境配置

发表于 2023-08-02 分类于 os ， windows 阅读次数：本文字数： 1.2k 阅读时长 ≈ 4 分钟

本文介绍了在 Windows 11 上使用 WSL2 配置深度学习环境的方法。首先需要安装 NVIDIA 显卡驱动程序以获得 GPU 支持，推荐使用官方工具 GeForce Experience 进行安装或更新。然后安装 WSL2，并在其中安装 CUDA 和 cuDNN。在测试中，使用 CPU 运算和 GPU 运算分别运行了小规模和大规模神经网络，并打印了运行时间。结果表明，使用 GPU 运算可以显著提高运行速度。

Windows11 WSL2 深度学习环境配置

CUDA on WSL User Guide

1. 安装 NVIDIA 驱动程序以获得 GPU 支持

1.1 方法一（推荐）

英伟达显卡驱动安装或者更新最好的方法就是利用官方的工具 GeForce Experience 来进行，每一次显卡驱动更新后 CUDA 支持的最高版本会发生变化。

从 https://www.nvidia.cn/geforce/geforce-experience/ 上下载 GeForce Experience 安装后，进入程序安装/更新 GeForce Game Ready 驱动程序。

1.2 方法二

从 https://www.nvidia.com/Download/index.aspx?lang=en-us 获取机器对应显卡版本的驱动程序，然后在系统上安装 NVIDIA GeForce Game Ready 或 NVIDIA RTX Quadro Windows 11 显示驱动程序。

注意：这是您需要安装的唯一驱动程序。不要在 WSL 中安装任何 Linux 显示驱动程序。

2. 安装 WSL2

启动您首选的 Windows Terminal / Command Prompt / Powershell 并安装 WSL：

1	wsl.exe --install

确保您拥有最新的 WSL 内核：

1	wsl.exe --update

查看 WSL 内核版本：

1
2

PS Microsoft.PowerShell.Core\FileSystem::\\wsl$\Ubuntu\root> wsl cat /proc/version
Linux version 5.10.102.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Wed Mar 2 00:30:59 UTC 2022

注意：WSL 2 Support Constraints

Ensure you are on the latest WSL Kernel or at least 4.19.121+. We recommend 5.10.16.3 or later for better performance and functional fixes.

3. 安装 CUDA

3.1 确认 NVIDIA 驱动支持的 CUDA 版本

C:\Windows>nvidia-smi
Wed Jul 27 22:57:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 516.59       Driver Version: 516.59       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   42C    P0    11W /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

C:\Windows>nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-f55a7fb2-0dc9-1d4f-dca9-8f7ea311ec67)

NVIDIA 显卡硬件版本为：NVIDIA GeForce RTX 3050 Laptop GPU
当前显卡驱动程序版本为：516.59
CUDA 版本为：11.7

意味着我笔记本当前的显卡驱动版本 516.59 最高支持的 CUDA 版本是 11.7（11.7 以下都是适配/兼容的）。可以查询 NVIDIA 显卡硬件版本 NVIDIA GeForce RTX 3050 Laptop 所支持的最高显卡驱动版本，通过升级显卡驱动后，来支持更高的 CUDA 版本。

详细的显卡驱动和 CUDA 的版本关系请参考

3.2 下载 CUDA

从 https://developer.nvidia.com/cuda-toolkit-archive 获取合适版本的 CUDA Toolkit，这里以 11.2.2 版本为例进行说明。

进入 CUDA Toolkit 11.2.2 下载页面
选择 Linux 操作系统
选择 x86_64 架构
选择 WSL-Ubuntu 发行版
选择 runfile(local) 安装类型

即可得到安装指示信息：

1 2	$ wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run $ sudo sh cuda_11.2.2_460.32.03_linux.run

3.3 在 WSL2 中安装 CUDA

根据上一步获取的安装指示，下载安装程序：

1	wget -c https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run

执行安装程序：

1	sh cuda_11.2.2_460.32.03_linux.run

输入 accept 回车确认同意用户使用协议，光标移至 Install，安装全部，安装完成后会打印如下提示：

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-11.2/
Samples:  Installed in /root/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-11.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.2/lib64, or, add /usr/local/cuda-11.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.2/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 460.00 is required for CUDA 11.2 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run --silent --driver

Logfile is /var/log/cuda-installer.log

提示中的警告 "WARNING: Incomplete installation! This installation did not install the CUDA Driver." 是正常的，因为我们只需在 Windows 下安装了 NVIDIA 驱动，而非 WSL 下。

3.4 设置环境变量

编辑 /etc/profile 或 ~/.bashrc 文件，以 ~/.bashrc 为例

1	vim ~/.bashrc

在文件末尾添加如下配置：

# Cuda Environment
export CUDA_HOME=/usr/local/cuda-11.2
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

export PATH=$CUDA_HOME/bin:$PATH

使配置生效

1	source ~/.bashrc

3.5 测试 CUDA

进入你的 CUDA Example 所在目录，默认是主目录，找到 "NVIDIA_CUDA-11.2_Samples"。依次打开 "1_Utilities" –> "deviceQuery"，然后重新打开一个终端输入：

root@ubuntu:~# cd ~/NVIDIA_CUDA-11.2_Samples/1_Utilities/deviceQuery
root@ubuntu:~/NVIDIA_CUDA-11.2_Samples/1_Utilities/deviceQuery# make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery.o -c deviceQuery.cpp
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery deviceQuery.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/x86_64/linux/release
cp deviceQuery ../../bin/x86_64/linux/release
root@ubuntu:~/NVIDIA_CUDA-11.2_Samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3050 Laptop GPU"
  CUDA Driver Version / Runtime Version          11.7 / 11.2
  CUDA Capability Major/Minor version number:    8.6
  Total amount of global memory:                 4096 MBytes (4294508544 bytes)
  (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
  GPU Max Clock rate:                            1057 MHz (1.06 GHz)
  Memory Clock rate:                             5501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.7, CUDA Runtime Version = 11.2, NumDevs = 1
Result = PASS

出现 "Result = PASS" 字样时，说明安装成功

4. 安装 cuDNN

4.1 下载 cuDNN

从 https://developer.nvidia.com/rdp/cudnn-archive 获取符合 CUDA 版本的 cuDNN。这里示例 CUDA 是 11.2，所以选择了 Download cuDNN v8.1.1 (Feburary 26th, 2021), for CUDA 11.0,11.1 and 11.2，选择下载 cuDNN Library for Linux (x86_64)

1	wget -c https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.1.1.33/11.2_20210301/cudnn-11.2-linux-x64-v8.1.1.33.tgz

注意：这里需要注册 NVIDIA 账号。

4.2 在 WSL2 中安装 cuDNN

安装过程实际上是把 cuDNN 的头文件复制到 CUDA 的头文件目录里面去，把 cuDNN 的库复制到 CUDA 的库目录里面去。

解压：

root@ubuntu:~/Downloads# tar -zxvf cudnn-11.2-linux-x64-v8.1.1.33.tgz
cuda/include/cudnn.h
cuda/include/cudnn_adv_infer.h
cuda/include/cudnn_adv_infer_v8.h
cuda/include/cudnn_adv_train.h
cuda/include/cudnn_adv_train_v8.h
cuda/include/cudnn_backend.h
cuda/include/cudnn_backend_v8.h
cuda/include/cudnn_cnn_infer.h
cuda/include/cudnn_cnn_infer_v8.h
cuda/include/cudnn_cnn_train.h
cuda/include/cudnn_cnn_train_v8.h
cuda/include/cudnn_ops_infer.h
cuda/include/cudnn_ops_infer_v8.h
cuda/include/cudnn_ops_train.h
cuda/include/cudnn_ops_train_v8.h
cuda/include/cudnn_v8.h
cuda/include/cudnn_version.h
cuda/include/cudnn_version_v8.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.8
cuda/lib64/libcudnn.so.8.1.1
cuda/lib64/libcudnn_adv_infer.so
cuda/lib64/libcudnn_adv_infer.so.8
cuda/lib64/libcudnn_adv_infer.so.8.1.1
cuda/lib64/libcudnn_adv_train.so
cuda/lib64/libcudnn_adv_train.so.8
cuda/lib64/libcudnn_adv_train.so.8.1.1
cuda/lib64/libcudnn_cnn_infer.so
cuda/lib64/libcudnn_cnn_infer.so.8
cuda/lib64/libcudnn_cnn_infer.so.8.1.1
cuda/lib64/libcudnn_cnn_train.so
cuda/lib64/libcudnn_cnn_train.so.8
cuda/lib64/libcudnn_cnn_train.so.8.1.1
cuda/lib64/libcudnn_ops_infer.so
cuda/lib64/libcudnn_ops_infer.so.8
cuda/lib64/libcudnn_ops_infer.so.8.1.1
cuda/lib64/libcudnn_ops_train.so
cuda/lib64/libcudnn_ops_train.so.8
cuda/lib64/libcudnn_ops_train.so.8.1.1
cuda/lib64/libcudnn_static.a
cuda/lib64/libcudnn_static_v8.a

复制 cuDNN 头文件

1	cp cuda/include/* /usr/local/cuda-11.2/include/

复制 cuDNN 库文件

1	cp cuda/lib64/* /usr/local/cuda-11.2/lib64/

添加可执行权限

1 2	chmod +x /usr/local/cuda-11.2/include/cudnn.h chmod +x /usr/local/cuda-11.2/lib64/libcudnn*

4.3 测试 cuDNN

打开一个新的终端：

root@ubuntu:~# conda activate ailab
(ailab) root@ubuntu:~# ipython
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.34.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf

In [2]: tf.test.is_gpu_available()
WARNING:tensorflow:From <ipython-input-2-17bb7203622b>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2022-07-27 23:46:19.645726: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-27 23:46:21.108297: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:21.136684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:21.137199: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:22.147971: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:22.148598: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:22.148635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2022-07-27 23:46:22.149146: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-07-27 23:46:22.149214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /device:GPU:0 with 1626 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
Out[2]: True

In [3]:

输出中提示 "Your kernel may not have been built with NUMA support" ，这是无害告警，可以忽略。

https://stackoverflow.com/questions/51733128/your-kernel-may-have-been-built-without-numa-support

https://forums.developer.nvidia.com/t/numa-error-running-tensorflow-on-jetson-tx2/56119

https://stackoverflow.com/questions/40426502/is-there-a-way-to-suppress-the-messages-tensorflow-prints

5. CPU vs GPU

硬件	型号
CPU	AMD Ryzen 7 5800H with Radeon Graphics
GPU	NVIDIA GeForce RTX 3050 Laptop GPU

5.1 小规模神经网络

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# TensorFlow and tf.keras
import tensorflow as tf
# Helper libraries
from time import time

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# 用 CPU 运算
startTime1 = time()

with tf.device('/cpu:0'):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'],
    )
    model.fit(x_train, y_train, epochs=10)
    model.evaluate(x_test, y_test)
t1 = time() - startTime1

# 用 GPU 运算
startTime2 = time()

with tf.device('/gpu:0'):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'],
    )
    model.fit(x_train, y_train, epochs=10)
    model.evaluate(x_test, y_test)

t2 = time() - startTime2

# 打印运行时间
print('使用cpu花的时间：', t1)
print('使用gpu花的时间：', t2)

运行结果

1 2	使用cpu花的时间： 26.906126737594604 使用gpu花的时间： 54.50972127914429

5.2 大规模神经网络

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# TensorFlow and tf.keras
import tensorflow as tf
# Helper libraries
from time import time

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# CPU 运行
startTime1 = time()

with tf.device('/cpu:0'):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(1000, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1000, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'],
    )
    model.fit(x_train, y_train, epochs=10)
    model.evaluate(x_test, y_test)

t1 = time() - startTime1

# GPU 运行
startTime2 = time()

with tf.device('/gpu:0'):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28)),
        tf.keras.layers.Dense(1000, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1000, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax'),
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy'],
    )
    model.fit(x_train, y_train, epochs=10)
    model.evaluate(x_test, y_test)

t2 = time() - startTime2

# 打印运行时间
print('使用cpu花的时间：', t1)
print('使用gpu花的时间：', t2)