slowness of first cudaMalloc (K40 vs K20), even after cudaSetDevice(即使在 cudaSetDevice 之后,第一个 cudaMalloc(K40 与 K20)的速度也很慢)
问题描述
我知道 CUDA 会在第一次 API 调用期间进行初始化,但花费的时间太多了.即使在单独的 cudaSetDevice 之后
I understand CUDA will do initialization during first API call, but the time spent is just too much. Even after a separate cudaSetDevice
测试程序:
使用 CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5 构建的相同程序,然后在 2 台单独的机器上运行(无需重建)
The same program built with: CUDA 7.0 (compute_35) + Visual Studio 2012 + NSight 4.5, then got run in 2 seperate machines (no rebuilding)
在第一个 cudaMalloc 之前,我称之为cudaSetDevice"
Before the 1st cudaMalloc, I’ve called "cudaSetDevice"
在我的电脑上:Win7 + Tesla K20,第一个 cudaMalloc 需要 150 毫秒
on my PC: Win7 + Tesla K20, 1st cudaMalloc takes 150ms
在我的服务器上:Win2012+ Tesla K40,耗时1100ms!!
on my server: Win2012+ Tesla K40, it takes 1100ms!!
对于两台机器,后续的 cudaMalloc 都快得多.
For both machines, subsequent cudaMalloc are much faster.
我的问题是:
1,为什么 K40 的第一个 cudaMalloc 需要更长的时间(1100ms vs 150ms)?因为K40应该比K20好
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
2,我认为cudaSetDevice"可以捕获Init时间?例如这个来自 talonmies 的答案
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
3,如果初始化是不可避免的,进程A可以在GPU中保持其状态(或上下文)而进程B在同一个GPU中运行吗?我知道我最好在独占"模式下运行 GPU,但可以处理暂停",以便以后不需要再次初始化 GPU?
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
提前致谢
推荐答案
1,为什么 K40 的第一个 cudaMalloc 需要更长的时间(1100ms vs 150ms)?因为K40应该比K20好
1, Why the K40 takes a much longer time (1100ms vs 150ms) for the 1st cudaMalloc? As K40 is supposed to be better than K20
没有指定初始化过程的细节,但是通过观察系统内存量会影响初始化时间.CUDA初始化通常包括UVM的建立,这涉及到设备和主机内存的协调地图.如果您的服务器的系统内存比您的 PC 多,这是初始化时间差异的一种可能解释.操作系统也可能有影响,最后可能是 GPU 的内存大小有影响.
The details of the initialization process are not specified, however by observation the amount of system memory affects initialization time. CUDA initialization usually includes establishment of UVM, which involves harmonizing of device and host memory maps. If your server has more system memory than your PC, it is one possible explanation for the disparity in initialization time. The OS may have an effect as well, finally the memory size of the GPU may have an effect.
2,我认为cudaSetDevice"可以捕获Init时间?例如来自talonmies的这个答案
2, I thought "cudaSetDevice" can capture the Init time? e.g. This Answer from talonmies
CUDA 初始化过程是惰性"初始化.这意味着将完成足够的初始化过程以支持请求的操作.如果请求的操作是 cudaSetDevice
,与请求的操作是 cudaMalloc
相比,这可能需要更少的初始化来完成(这意味着所需的明显时间可能更短).这意味着一些初始化开销可能会被吸收到 cudaSetDevice
操作中,而一些额外的初始化开销可能会被吸收到后续的 cudaMalloc
操作中.
The CUDA initialization process is a "lazy" initialization. That means that just enough of the initialization process will be completed in order to support the requested operation. If the requested operation is cudaSetDevice
, this may require less of the initialization to be complete (which means the apparent time required may be shorter) than if the requested operation is cudaMalloc
. That means that some of the initialization overhead may be absorbed into the cudaSetDevice
operation, while some additional initialization overhead may be absorbed into a subsequent cudaMalloc
operation.
3,如果初始化是不可避免的,进程A可以在GPU中保持其状态(或上下文)而进程B在同一个GPU中运行吗?我知道我最好在独占"模式下运行 GPU,但可以处理暂停",以便以后不需要再次初始化 GPU?
3, If the initialization is unavoidable, can process A maintain its status(or context) in GPU while process B is running in the same GPU? I understand I better run GPU in "exclusive" mode, but can process A "suspend" so that it doesn't need to initialize GPU again later?
独立的宿主进程通常会产生独立的CUDA 上下文.CUDA 上下文具有与之关联的初始化要求,因此如果需要初始化新的 CUDA 上下文(可能来自单独的主机进程),则可能已经在设备上初始化了另一个单独的 cuda 上下文这一事实不会提供太多好处.通常,保持进程处于活动状态涉及保持应用程序在该进程中运行.应用程序具有各种休眠"或暂停行为的机制.只要应用程序没有终止,由该应用程序建立的任何上下文都不需要重新初始化(可能除了调用 cudaDeviceReset
时).
Independent host processes will generally spawn independent CUDA contexts. A CUDA context has the initialization requirement associated with it, so the fact that another, separate cuda context may be already initialized on the device will not provide much benefit if a new CUDA context needs to be initialized (perhaps from a separate host process). Normally, keeping a process active involves keeping an application running in that process. Applications have various mechanisms to "sleep" or suspend behavior. As long as the application has not terminated, any context established by that application should not require re-initialization (excepting, perhaps, if cudaDeviceReset
is called).
一般来说,通过设置 GPU 持久性模式(使用 nvidia-smi
)允许 GPU 进入深度空闲模式的系统可能会获得一些好处.但是,这与 GeForce GPU 无关,通常也与 Windows 系统无关.
In general, some benefit may be obtained on systems that allow the GPUs to go into a deep idle mode by setting GPU persistence mode (using nvidia-smi
). However this will not be relevant for GeForce GPUs nor will it be generally relevant on a windows system.
此外,在多 GPU 系统上,如果应用程序不需要多个 GPU,通常可以通过使用 CUDA_VISIBLE_DEVICES
环境变量,将 CUDA 运行时限制为仅使用必要的设备.
Additionally, on multi-GPU systems, if the application does not need multiple GPUs, some initialization time can usually be avoided by using the CUDA_VISIBLE_DEVICES
environment variable, to restrict the CUDA runtime to only use the necessary devices.
这篇关于即使在 cudaSetDevice 之后,第一个 cudaMalloc(K40 与 K20)的速度也很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:即使在 cudaSetDevice 之后,第一个 cudaMalloc(K40 与 K20)的速度也很慢
基础教程推荐
- 如何在不破坏 vtbl 的情况下做相当于 memset(this, ...) 的操作? 2022-01-01
- 使用从字符串中提取的参数调用函数 2022-01-01
- 从 std::cin 读取密码 2021-01-01
- 为什么语句不能出现在命名空间范围内? 2021-01-01
- 管理共享内存应该分配多少内存?(助推) 2022-12-07
- Windows Media Foundation 录制音频 2021-01-01
- 在 C++ 中循环遍历所有 Lua 全局变量 2021-01-01
- 如何使图像调整大小以在 Qt 中缩放? 2021-01-01
- 如何“在 Finder 中显示"或“在资源管理器中显 2021-01-01
- 为 C/C++ 中的项目的 makefile 生成依赖项 2022-01-01