论文网站建设格式,网站设计是不是会要用代码做,北师大 网页制作与网站建设 考试,南宁小程序制作的公司0. 前言
需要使用ddqn完成某项任务#xff0c;为了快速训练#xff0c;使用带有GPU的服务器进行训练。记录下整个过程#xff0c;以及遇到的坑。
1. 选择模板代码
参考代码来源 GitHub 该代码最后一次更新是Mar 24, 2020。
环境配置#xff1a; python3.8 运行安装脚本…0. 前言
需要使用ddqn完成某项任务为了快速训练使用带有GPU的服务器进行训练。记录下整个过程以及遇到的坑。
1. 选择模板代码
参考代码来源 GitHub 该代码最后一次更新是Mar 24, 2020。
环境配置 python3.8 运行安装脚本
apt-get update
apt-get install xvfb
apt-get install python-opengl
apt-get install python3-pip
python -m pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple/安装python requirements 所需requirements文件
tensorflow
tensorlayer
opencv-python-headless
matplotlib
pyglet1.5.27
gym0.20.0python -m pip install -r requirements -i https://pypi.tuna.tsinghua.edu.cn/simple/运行模板代码
xvfb-run -s -screen 0 1400x900x24 python double_DQN\ \\ dueling_DQN.py2. ubuntu 环境准备 本部分为踩坑记录不需要跟着做 服务器之前有其他人用过也可能是系统自带python因此会有python环境首先查看python版本
python --version显示python命令对应的版本是2.7.17之后查找该命令对应的符号链接文件位置。
which python会显示python命令使用的符号链接文件
/usr/bin/python查看该路径下还有没有其它python版本
ls -al | grep python输出如下
lrwxrwxrwx 1 root root 9 Apr 16 2018 python - python2.7
lrwxrwxrwx 1 root root 9 Apr 16 2018 python2 - python2.7
-rwxr-xr-x 1 root root 3628904 Nov 29 02:51 python2.7
lrwxrwxrwx 1 root root 9 Jun 22 2018 python3 - python3.6
-rwxr-xr-x 2 root root 4526456 Nov 25 22:10 python3.6
-rwxr-xr-x 2 root root 4526456 Nov 25 22:10 python3.6m
-rwxr-xr-x 1 root root 1018 Oct 29 2017 python3-jsondiff
-rwxr-xr-x 1 root root 3661 Oct 29 2017 python3-jsonpatch
-rwxr-xr-x 1 root root 1342 May 2 2016 python3-jsonpointer
-rwxr-xr-x 1 root root 398 Nov 16 2017 python3-jsonschema
lrwxrwxrwx 1 root root 10 Jun 22 2018 python3m - python3.6m发现python命令使用的是2.7但python3可以使用3.6。因为目前有的tensorflow版本不支持2.7了先使用3.6.
接着查看tensorflow gpu各个版本的要求官网。如下图所示我选择了2.0.0发布于2019年9月30日和代码更新时间比较近并且支持python 3.6。 准备使用pip安装tensorflow但pip并没有安装使用一下命令安装pip。
apt install python3-pip之后执行安装命令
python -m pip install tensorflow-gpu2.0.0比较难受的是pip源中并没有2.0.0换了清华源也没有输出如下
Could not find a version that satisfies the requirement tensorflow-gpu2.0.0 (from versions: 1.13.1, 1.13.2, 1.14.0, 2.12.0)
No matching distribution found for tensorflow-gpu2.0.0可以看到最新的版本只有2.12.0那只能安装最新的版本。 还需要选择python版本至少需要python3.7。我为了之后使用方便直接把python命令的软连接接入到新安装的python3.7上。
apt install python3.7
rm -f /usr/bin/python
ln -s /usr/bin/python3.7 /usr/bin/python
python --version最后显示版本为3.7.5替换成功 这时候需要更新一下pip后面装tensorflow的时候需要安装很多相关包如果不升级pip的话会有很多包装不上其中一个报错是 Failed building wheel for grpcio。
python -m pip install --upgrade pip继续安装tensorflow-gpu。
python -m pip install tensorflow-gpu输出如下
The tensorflow-gpu package has been removed!Please install tensorflow instead.Other than the name, the two packages have been identical
since TensorFlow 2.1, or roughly since Sep 2019. For more
information, see: pypi.org/project/tensorflow-gpu意思是tensorflow2.1之后gpu包没得了直接pip install tensorflow就可以。安装速度感人切换清华源
python -m pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple/安装成功
3. 模板代码运行 本部分为踩坑记录不需要跟着做 将代码下载到相应文件夹之后可以使用如下语句运行ddqn模板代码。
python double_DQN\ \\ dueling_DQN.py当然会报很多no module的错误使用pip依次安装requirements总结如下
tensorlayer
opencv-python
opencv-python-headless
matplotlib
pyglet将上述内容写文件之后一键安装
python -m pip install -r requirments -i https://pypi.tuna.tsinghua.edu.cn/simple/还需要安装gym它是一个经常用于测试强化学习的示例目前新的版本中获取新的状态时参数数量增加了即以下语句会报错step函数不仅输出变多了而且s_的输出也不太正常。因此更换为早一点的版本。
s_,r,done,_ self.env.step(a)我根据模板代码的时间查看了gym的tag发现时间上和模板代码相似再打开gym的core文件查看step函数果然从输出数量上合适的。进行安装
python -m pip install gym0.20.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/报错如下
Collecting gym0.20.0Using cached https://pypi.tuna.tsinghua.edu.cn/packages/f1/16/a421155206e7dc41b3a79d4e9311287b88c20140d567182839775088e9ad/gym-0.20.0.tar.gz (1.6 MB)Preparing metadata (setup.py) ... errorerror: subprocess-exited-with-error× python setup.py egg_info did not run successfully.│ exit code: 1╰─ [1 lines of output]error in gym setup command: extras_require must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers.[end of output]note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed× Encountered error while generating package metadata.
╰─ See above for output.note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
原因找了好久从【参考】中找到了解决办法更新为指定版本的setuptools
python -m pip install --upgrade pip setuptools57.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/运行代码
python double_DQN\ \\ dueling_DQN.py发现pyglet最低要求python3.8。。。重新安装python3.8之后直接使用requirements文件一键安装到新环境。
运行过程中报错
Traceback (most recent call last):File /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py, line 27, in modulefrom pyglet.gl import *File /usr/local/lib/python3.8/dist-packages/pyglet/gl/__init__.py, line 47, in modulefrom pyglet.gl.gl import *File /usr/local/lib/python3.8/dist-packages/pyglet/gl/gl.py, line 7, in modulefrom pyglet.gl.lib import link_GL as _link_functionFile /usr/local/lib/python3.8/dist-packages/pyglet/gl/lib.py, line 98, in modulefrom pyglet.gl.lib_glx import link_GL, link_GLXFile /usr/local/lib/python3.8/dist-packages/pyglet/gl/lib_glx.py, line 11, in modulegl_lib pyglet.lib.load_library(GL)File /usr/local/lib/python3.8/dist-packages/pyglet/lib.py, line 134, in load_libraryraise ImportError(fLibrary {names[0]} not found.)
ImportError: Library GL not found.During handling of the above exception, another exception occurred:Traceback (most recent call last):File double_DQN dueling_DQN.py, line 195, in moduleddqn.train(200)File double_DQN dueling_DQN.py, line 161, in trainif self.is_rend:self.env.render()File /usr/local/lib/python3.8/dist-packages/gym/core.py, line 254, in renderreturn self.env.render(mode, **kwargs)File /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py, line 179, in renderfrom gym.envs.classic_control import renderingFile /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py, line 29, in moduleraise ImportError(
ImportError: Error occurred while running from pyglet.gl import *HINT: make sure you have OpenGL installed. On Ubuntu, you can run apt-get install python-opengl.If youre running on a server, you may need a virtual frame buffer; something like this should work:xvfb-run -s -screen 0 1400x900x24 python your_script.py最后说明了原因缺少OpenGL 。并且在服务器上运行显示有点问题就按照他给的解决方案处理。
apt-get install python-opengl
xvfb-run -s -screen 0 1400x900x24 python double_DQN\ \\ dueling_DQN.py处理之后再次报错
Traceback (most recent call last):File double_DQN dueling_DQN.py, line 195, in moduleddqn.train(200)File double_DQN dueling_DQN.py, line 161, in trainif self.is_rend:self.env.render()File /usr/local/lib/python3.8/dist-packages/gym/core.py, line 254, in renderreturn self.env.render(mode, **kwargs)File /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/cartpole.py, line 229, in renderreturn self.viewer.render(return_rgb_arraymode rgb_array)File /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py, line 126, in renderself.transform.enable()File /usr/local/lib/python3.8/dist-packages/gym/envs/classic_control/rendering.py, line 232, in enableglPushMatrix()
NameError: name glPushMatrix is not defined
原因是pyglet版本太高降为1.5.27即可。
4. 安装GPU支持
根据tensorflow版本选择cuda和cudnn。
4.1 安装cuda 11.2
wget https://developer.download.nvidia.com/compute/cuda/11.2.2/local_installers/cuda_11.2.2_460.32.03_linux.run
chmod x cuda_11.2.2_460.32.03_linux.run
sudo ./cuda_11.2.2_460.32.03_linux.run安装完成后需要将CUDA的路径添加到环境变量中。打开~/.bashrc文件在文件末尾添加以下两行代码
export PATH/usr/local/cuda-11.2/bin${PATH::${PATH}}
export LD_LIBRARY_PATH/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH::${LD_LIBRARY_PATH}}然后运行以下命令使环境变量生效
source ~/.bashrc验证CUDA的安装是否成功。运行以下命令
nvcc -V4.2 安装cudnn 8.1
在NVIDIA官网下载。 之后上传到服务器
tar -xzvf cudnn-11.2-linux-x64-v8.1.1.33.tgz
cp -P cuda/include/cudnn*.h /usr/local/cuda-11.2/include
cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.2/lib64/
chmod ar /usr/local/cuda-11.2/include/cudnn*.h /usr/local/cuda-11.2/lib64/libcudnn*使用如下代码测试gpu是否正常使用
import tensorflow as tf# 显示当前GPU设备信息
print(tf.config.list_physical_devices(GPU))# 创建一个TensorFlow的Session并在其中进行一个简单的运算
with tf.compat.v1.Session() as sess:# 创建一个TensorFlow的常量张量a tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape[2, 3], namea)# 创建一个TensorFlow的变量张量b tf.Variable(tf.random.normal([3, 2], stddev0.1), nameb)# 进行矩阵乘法运算c tf.matmul(a, b, namec)# 初始化所有变量sess.run(tf.compat.v1.global_variables_initializer())# 运行TensorFlow图print(sess.run(c))
输出如下可以正常使用
2023-03-05 14:43:55.699137: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:55.861866: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS0.
2023-03-05 14:43:56.628130: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library libnvinfer.so.7; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628241: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library libnvinfer_plugin.so.7; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.2/lib64
2023-03-05 14:43:56.628256: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-05 14:43:58.016301: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.031069: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.032359: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[PhysicalDevice(name/physical_device:GPU:0, device_typeGPU)]
2023-03-05 14:43:58.035332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-05 14:43:58.036833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.038127: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:58.039367: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.106971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.108369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.109663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-03-05 14:43:59.110894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30969 MB memory: - device: 0, name: Tesla V100S-PCIE-32GB, pci bus id: 0000:00:06.0, compute capability: 7.0
2023-03-05 14:43:59.127126: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
[[0.50951785 0.10452858][1.070737 0.27480656]]
模板代码使用GPU加速感觉速度没快多少可能是神经网络层数比较少的原因。