NOTE: The CUDA Samples are not meant for performance measurements. Which limits the number of GPUs that the application can /usr/local/cuda-11.2/samples/bin/x86_64/linux/release$ CUDA_VISIBLE_DEVICES=0,1,2,3. Run the example one more time by using the CUDA_VISIBLE_DEVICES variable, This indicates that GPUĤ should not be used for high-performance workloads. The application also shows that there is no peer-to-peerĬonnectivity between any GPU and GPU 4. The example above shows the peer-to-peer bandwidth and latency test across all five GPUs, Results may vary when GPU Boost is enabled. P2P=Enabled Latency (P2P Writes) Matrix (us) Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)īidirectional P2P=Disabled Bandwidth Matrix (GB/s)īidirectional P2P=Enabled Bandwidth Matrix (GB/s) Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. p2pBandwidthLatencyTestĭevice: 0, Graphics Device, pciBusID: 1, pciDeviceID: 0, pciDomainID:0ĭevice: 1, Graphics Device, pciBusID: 47, pciDeviceID: 0, pciDomainID:0ĭevice: 2, Graphics Device, pciBusID: 81, pciDeviceID: 0, pciDomainID:0ĭevice: 3, Graphics Device, pciBusID: c2, pciDeviceID: 0, pciDomainID:0ĭevice: 4, DGX Display, pciBusID: c1, pciDeviceID: 0, pciDomainID:0 gencode arch=compute_86,code=compute_86 -o p2pBandwidthLatencyTest p2pBandwidthLatencyTest.oĬp p2pBandwidthLatencyTest $ cd $. gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 Nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). gencode arch=compute_86,code=compute_86 -o p2pBandwidthLatencyTest.o -c p2pBandwidthLatencyTest.cu gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 usr/local/cuda/bin/nvcc -ccbin g++ -I././common/inc -m64 -threadsĠ -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 Output, GPU 0 is the fastest in a DGX Station A100, and GPU In the following example, a CUDA application that comes with CUDA samples is run.