Kubernetes 이야기

nvidia k8s device driver 및 dcgm 설치 본문

Kubernetes/일반

nvidia k8s device driver 및 dcgm 설치

kmaster 2024. 5. 11. 23:02
반응형

Kubernetes에서 nvidia gpu 를 사용하고, prometheus로 gpu 모니터링을 위한 절차를 알아보자.

 

설치 전 nvidia driver는 노드에 설치되어 있어야 한다.

 

설치 

 

1. container-toolkit 설치

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
sudo yum install -y nvidia-container-toolkit

 

설치 후 설정

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

 

2, kubernetes에 GPU 활성화

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nvidia-device-plugin.yml

 

 

테스트

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF
$ k logs -f -n default gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

 

모니터링

NVIDIA DCGM은 대규모 Linux 기반 클러스터 환경에서 NVIDIA GPU를 관리하고 모니터링하기 위한 도구 세트이다.

 

$ helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
$ helm repo update

 

$ helm install \
    --generate-name \
    gpu-helm-charts/dcgm-exporter -n kube-system

 

이렇게 설치하면 dcgm을 위한 servicemonitor가 생성된다.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    meta.helm.sh/release-name: dcgm-exporter-1715426946
    meta.helm.sh/release-namespace: kube-system
  generation: 1
  labels:
    #app.kubernetes.io/component: dcgm-exporter
    #app.kubernetes.io/instance: dcgm-exporter-1715426946
    #app.kubernetes.io/managed-by: Helm
    #app.kubernetes.io/name: dcgm-exporter
    #app.kubernetes.io/version: 3.4.1
    #helm.sh/chart: dcgm-exporter-3.4.1
    release: prometheus
  name: dcgm-exporter-1715426946
  namespace: kube-system
spec:
  endpoints:
  - interval: 15s
    path: /metrics
    port: metrics
    relabelings: []
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/component: dcgm-exporter
      app.kubernetes.io/instance: dcgm-exporter-1715426946
      app.kubernetes.io/name: dcgm-exporter

 

여기서 label -> release: prometheus 를 설정한다. ( 필자는 https://github.com/prometheus-community/helm-charts/ 차트를 이용하여 설치한 helm install 이름을 사용하였다. )

 

 

참고

https://github.com/NVIDIA/k8s-device-plugin

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

https://github.com/NVIDIA/dcgm-exporter

반응형
Comments