Think Deep,Work Lean

2025-06-19-在POD中验证GPU是否可用

Posted on By zack

背景

使用Hami后,可以创建一个Pod申请vGPU,但是申请到的GPU是否真实可用呢?因此,本文提供一种方法来检查该pod是否真实可用。

环境

K8s: v1.23.17 GPU: H20 Hami: v2.6.0

步骤

创建POD yaml,并apply到K8s:

apiVersion: v1
kind: Pod
metadata:
  name: torch-gpu-test
spec:
  containers:
  - name: torch-container
    image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
    command: [ "python", "-c" ]
    args:
      - |
        import torch
        print('PyTorch version:', torch.__version__)
        if torch.cuda.is_available():
            print('CUDA is available!')
            print('GPU count:', torch.cuda.device_count())
            print('Current GPU:', torch.cuda.get_device_name(0))
            # 运行简单的 GPU 计算
            x = torch.tensor([1.0, 2.0]).cuda()
            y = torch.tensor([3.0, 4.0]).cuda()
            print('GPU计算结果:', x + y)
        else:
            print('CUDA不可用,使用CPU')
    resources:
      limits:
        nvidia.com/gpu: 1  # 请求 1 个 GPU
  restartPolicy: Never  # 执行一次后不再重启

创建完这个pod后,进入这个Pod直接执行命令,可以验证。

root@torch-gpu-test:/workspace# python3
Python 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_
torch.cuda.is_available()                 torch.cuda.is_bf16_supported()            torch.cuda.is_current_stream_capturing()  torch.cuda.is_initialized()               
>>> torch.cuda.is_available()
True

看日志也是正常:

/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:173: UserWarning: 
NVIDIA H20 with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 sm_80 sm_86 compute_37.
If you want to use the NVIDIA H20 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
PyTorch version: 2.0.1
CUDA is available!
GPU count: 1
Current GPU: NVIDIA H20
GPU计算结果: tensor([4., 6.], device='cuda:0')

使用nvidia-smi查看GPU卡:

docker run --gpus all -it --rm nvidia/cuda:11.8.0-base-ubuntu22.04 bash

nvida-smi