Litmus Chaos Engineering

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Kubernetes 이야기

Litmus Chaos Engineering - 테스트 본문

카테고리 없음

Litmus Chaos Engineering - 테스트

kmaster 2022. 8. 12. 10:10

카오스 테스트를 위한 일반적인 절차는 아래와 같다.

1. 테스트 할 애플리케이션을 준비하고, 애플리케이션의 정상 상태를 정의해야 한다.

2. 카오스 테스트할 시나리오 ( 예를 들어, Pod 비정상 종료, Network 속도 지연, 디스크 장애 등 ) 를 정의해야 한다.

3. 카오스 테스트 중 애플리케이션의 상태를 모니터링한다.

4. 카오스 테스트의 롤백을 진행하여 애플리케이션의 정상 상태로 돌아오는지 확인한다.

애플리케이션 준비

테스트할 애플리케이션은 아래와 같이 배포해 보자.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sampleapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sampleapp
  template:
    metadata:
      labels:
        app: sampleapp
    spec:
      containers:
      - name: sampleapp
        image: nginx
        ports:
        - containerPort: 80
        imagePullPolicy: Always
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
  name: ingress-tomcat
spec:
  rules:
  - host: sampleapp.10.20.20.100.nip.io
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: sampleapp
            port:
              number: 80
---
apiVersion: v1
kind: Service
metadata:
  name: sampleapp
spec:
  type: ClusterIP
  selector:
    app: sampleapp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

배포된 내역은 아래와 같다.

# k get all -n litmus-test
NAME                             READY   STATUS              RESTARTS   AGE
pod/sampleapp-6c65c5cdb7-q9bjr   0/1     ContainerCreating   0          9s

NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/sampleapp   ClusterIP   10.107.81.167   <none>        80/TCP    9s

NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sampleapp   0/1     1            0           9s

NAME                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/sampleapp-6c65c5cdb7   1         1         0       9s

domain을 한번 호출해 보면 아래와 같이 기본 화면이 나타난다.

이제 litmus 카오스 테스트로 메모리 증가 이슈를 발생시키는 시나리오를 테스트해보자.

먼저, RBAC 권한을 위해 아래와 같은 service account 정책을 생성한다. ( 앱이 배포된 ns에 생성한다. )

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-memory-hog-sa
  namespace: litmus-test
  labels:
    name: pod-memory-hog-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-memory-hog-sa
  namespace: litmus-test
  labels:
    name: pod-memory-hog-sa
    app.kubernetes.io/part-of: litmus
rules:
  # Create and monitor the experiment & helper pods
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create","delete","get","list","patch","update", "deletecollection"]
  # Performs CRUD operations on the events inside chaosengine and chaosresult
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create","get","list","patch","update"]
  # Fetch configmaps details and mount it to the experiment pod (if specified)
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get","list",]
  # Track and get the runner, experiment, and helper pods log 
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get","list","watch"]  
  # for creating and managing to execute comands inside target container
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["get","list","create"]
  # deriving the parent/owner details of the pod(if parent is anyof {deployment, statefulset, daemonsets})
  - apiGroups: ["apps"]
    resources: ["deployments","statefulsets","replicasets", "daemonsets"]
    verbs: ["list","get"]
  # deriving the parent/owner details of the pod(if parent is deploymentConfig)  
  - apiGroups: ["apps.openshift.io"]
    resources: ["deploymentconfigs"]
    verbs: ["list","get"]
  # deriving the parent/owner details of the pod(if parent is deploymentConfig)
  - apiGroups: [""]
    resources: ["replicationcontrollers"]
    verbs: ["get","list"]
  # deriving the parent/owner details of the pod(if parent is argo-rollouts)
  - apiGroups: ["argoproj.io"]
    resources: ["rollouts"]
    verbs: ["list","get"]
  # for configuring and monitor the experiment job by the chaos-runner pod
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create","list","get","delete","deletecollection"]
  # for creation, status polling and deletion of litmus chaos resources used within a chaos workflow
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines","chaosexperiments","chaosresults"]
    verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-memory-hog-sa
  namespace: litmus-test
  labels:
    name: pod-memory-hog-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-memory-hog-sa
subjects:
- kind: ServiceAccount
  name: pod-memory-hog-sa
  namespace: litmus-test

시나리오를 생성하기 전에 먼저 실험에 필요한 ChaosExperiment를 생성하자.

---
apiVersion: litmuschaos.io/v1alpha1
description:
  message: |
    Injects memory consumption on pods belonging to an app deployment
kind: ChaosExperiment
metadata:
  name: pod-memory-hog
  labels:
    name: pod-memory-hog
    app.kubernetes.io/part-of: litmus
    app.kubernetes.io/component: chaosexperiment
    app.kubernetes.io/version: 2.11.0
spec:
  definition:
    scope: Namespaced
    permissions:
      # Create and monitor the experiment & helper pods
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create","delete","get","list","patch","update", "deletecollection"]
      # Performs CRUD operations on the events inside chaosengine and chaosresult
      - apiGroups: [""]
        resources: ["events"]
        verbs: ["create","get","list","patch","update"]
      # Fetch configmaps details and mount it to the experiment pod (if specified)
      - apiGroups: [""]
        resources: ["configmaps"]
        verbs: ["get","list",]
      # Track and get the runner, experiment, and helper pods log
      - apiGroups: [""]
        resources: ["pods/log"]
        verbs: ["get","list","watch"]
      # for creating and managing to execute comands inside target container
      - apiGroups: [""]
        resources: ["pods/exec"]
        verbs: ["get","list","create"]
      # deriving the parent/owner details of the pod(if parent is anyof {deployment, statefulset, daemonsets})
      - apiGroups: ["apps"]
        resources: ["deployments","statefulsets","replicasets", "daemonsets"]
        verbs: ["list","get"]
      # deriving the parent/owner details of the pod(if parent is deploymentConfig)
      - apiGroups: ["apps.openshift.io"]
        resources: ["deploymentconfigs"]
        verbs: ["list","get"]
      # deriving the parent/owner details of the pod(if parent is deploymentConfig)
      - apiGroups: [""]
        resources: ["replicationcontrollers"]
        verbs: ["get","list"]
      # deriving the parent/owner details of the pod(if parent is argo-rollouts)
      - apiGroups: ["argoproj.io"]
        resources: ["rollouts"]
        verbs: ["list","get"]
      # for configuring and monitor the experiment job by the chaos-runner pod
      - apiGroups: ["batch"]
        resources: ["jobs"]
        verbs: ["create","list","get","delete","deletecollection"]
      # for creation, status polling and deletion of litmus chaos resources used within a chaos workflow
      - apiGroups: ["litmuschaos.io"]
        resources: ["chaosengines","chaosexperiments","chaosresults"]
        verbs: ["create","list","get","patch","update","delete"]
    image: "litmuschaos/go-runner:2.11.0"
    imagePullPolicy: Always
    args:
    - -c
    - ./experiments -name pod-memory-hog
    command:
    - /bin/bash
    env:
      - name: TOTAL_CHAOS_DURATION
        value: '60'

      ## enter the amount of memory in megabytes to be consumed by the application pod
      - name: MEMORY_CONSUMPTION
        value: '500'

      ## Number of workers to perform stress
      - name: NUMBER_OF_WORKERS
        value: '1'

      ## percentage of total pods to target
      - name: PODS_AFFECTED_PERC
        value: ''

      ## Period to wait before and after injection of chaos in sec
      - name: RAMP_TIME
        value: ''

      ## env var that describes the library used to execute the chaos
      ## default: litmus. Supported values: litmus, pumba
      - name: LIB
        value: 'litmus'

      ## It is used in pumba lib only
      - name: LIB_IMAGE
        value: 'litmuschaos/go-runner:2.11.0'

      ## It is used in pumba lib only
      - name: STRESS_IMAGE
        value: 'alexeiled/stress-ng:latest-ubuntu'

      ## provide the cluster runtime
      - name: CONTAINER_RUNTIME
        value: 'containerd'

      # provide the socket file path
      - name: SOCKET_PATH
        value: '/run/containerd/containerd.sock'

      ## it defines the sequence of chaos execution for multiple target pods
      ## supported values: serial, parallel
      - name: SEQUENCE
        value: 'parallel'

      - name: TARGET_PODS
        value: ''

      # To select pods on specific node(s)
      - name: NODE_LABEL
        value: ''

    labels:
      name: pod-memory-hog
      app.kubernetes.io/part-of: litmus
      app.kubernetes.io/component: experiment-job
      app.kubernetes.io/runtime-api-usage: "true"
      app.kubernetes.io/version: 2.11.0

주의 : 필자의 시스템은 docker 를 사용하지 않고 containerd를 사용하기 때문에 CONTAINER_RUNTIME을 containerd로 설정해야 한다. SOCKET_PATH 도 runtime에 따라 변경해 준다.

이제 memory 증가 카오스 시나리오를 생성해 보자.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-network-chaos
  namespace: litmus-test
spec:
  # It can be active/stop
  engineState: 'active'
  annotationCheck: "false"
  appinfo:
    appns: 'litmus-test'
    # FYI, To see app label, apply kubectl get pods --show-labels
    applabel: 'app=sampleapp'
    appkind: 'deployment'
  chaosServiceAccount: pod-network-latency-sa
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '300' # in seconds
            - name: NETWORK_LATENCY
              value: '2000'
            - name: JITTER
              value: '0'
            # provide the name of container runtime
            # for litmus LIB, it supports docker, containerd, crio
            # for pumba LIB, it supports docker only
            - name: CONTAINER_RUNTIME
              value: 'containerd'
            # provide the socket file path
            - name: SOCKET_PATH
              value: '/run/containerd/containerd.sock'
             ## percentage of total pods to target
            - name: PODS_AFFECTED_PERC
              value: ''

실행한 카오스 테스트 상태를 보자.

# k get po -n litmus-test
NAME                           READY   STATUS    RESTARTS   AGE
engine-nginx-runner            1/1     Running   0          24s
pod-memory-hog-helper-imjzld   1/1     Running   0          12s
pod-memory-hog-n6kpdk-mpfnr    1/1     Running   0          23s
sampleapp-6c65c5cdb7-q9bjr     1/1     Running   0          3h6m

실행 전 메모리 상태

# k top pod -n litmus-test
NAME                         CPU(cores)   MEMORY(bytes)
sampleapp-6c65c5cdb7-q9bjr   0m           24Mi

실행 후 메모리 상태

# k top pod -n litmus-test
NAME                           CPU(cores)   MEMORY(bytes)
engine-nginx-runner            1m           13Mi
pod-memory-hog-helper-yuvrvs   22m          8Mi
pod-memory-hog-ux9mb2-pngkj    2m           12Mi
sampleapp-6c65c5cdb7-q9bjr     792m         533Mi

위와 같이 sampleapp 메모리가 500Mi 증가된 것을 볼 수 있다. 이 설정으로 mem limit 설정, oom killer 작동 등의 테스트를 진행해 볼 수 있다.

Comments

Kubernetes 이야기

Litmus Chaos Engineering - 테스트 본문

Litmus Chaos Engineering - 테스트

티스토리툴바